更新时间:2022-12-12 09:36:09
select 。
为了使代码更加紧凑,我们首先可以获取想要的列列表中的差异:
To make the code a little more compact, we can first get the columns we want to diff in a list:
diff_columns = [c for c in df.columns if c != 'index']
下一步选择索引并遍历 diff_columns
计算新列。使用 .alias()
重命名结果列:
Next select the index and iterate over diff_columns
to compute the new column. Use .alias()
to rename the resulting column:
df_diff = df.select(
'index',
*[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+
#|index| col1_diff| col2_diff| col3_diff|
#+-----+------------------+-------------------+-------------------+
#| 1| null| null| null|
#| 2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|
#| 3|0.4054651081081646|0.40546510810816416|0.40546510810816416|
#+-----+------------------+-------------------+-------------------+