且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Pyspark:Concat函数将列生成到新数据框中

更新时间:2022-12-12 09:36:09



select 。

为了使代码更加紧凑,我们首先可以获取想要的列列表中的差异:

To make the code a little more compact, we can first get the columns we want to diff in a list:

diff_columns = [c for c in df.columns if c != 'index']

下一步选择索引并遍历 diff_columns 计算新列。使用 .alias()重命名结果列:

Next select the index and iterate over diff_columns to compute the new column. Use .alias() to rename the resulting column:

df_diff = df.select(
    'index',
    *[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
      for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+
#|index|         col1_diff|          col2_diff|          col3_diff|
#+-----+------------------+-------------------+-------------------+
#|    1|              null|               null|               null|
#|    2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|
#|    3|0.4054651081081646|0.40546510810816416|0.40546510810816416|
#+-----+------------------+-------------------+-------------------+