在 PySpark 数据框中添加列总和作为新列

更新时间：2021-09-13 07:05:22

这并不明显.我没有看到 spark Dataframes API 中定义的列的基于行的总和.

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.

这可以通过一种相当简单的方式完成:

This can be done in a fairly simple way:

newdf = df.withColumn('total', sum(df[col] for col in df.columns))

df.columns 由 pyspark 作为字符串列表提供，给出 Spark Dataframe 中的所有列名.对于不同的总和，您可以提供任何其他列名列表.

df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.

我没有尝试将此作为我的第一个解决方案，因为我不确定它会如何表现.但它有效.

I did not try this as my first solution because I wasn't certain how it would behave. But it works.

这太复杂了，但也很好用.

This is overly complicated, but works as well.

你可以这样做:

使用 df.columns 获取列的名称列表
使用该名称列表来制作列列表
将该列表传递给将在折叠式函数方式

use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner

使用 python 的 reduce，了解运算符重载的工作原理，和列 here 的 pyspark 代码变为:

With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:

def column_add(a,b):
     return  a.__add__(b)

newdf = df.withColumn('total_col', 
         reduce(column_add, ( df[col] for col in df.columns ) ))

注意这是一个python reduce，而不是spark RDD reduce，reduce的第二个参数中的括号项需要括号，因为它是一个列表生成器表达式.

Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.

经过测试，有效！

$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
...     return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

上一篇 : ：插入具有唯一列的sqlite表下一篇 : 在现有coredata模型中添加新属性

在 PySpark 数据框中添加列总和作为新列

相关阅读

技术问答最新文章