且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何遍历pyspark中的每一行dataFrame

更新时间:2023-02-05 10:38:28

要循环"并利用 Spark 的并行计算框架,您可以定义自定义函数并使用 map.

To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.

def customFunction(row):

   return (row.name, row.age, row.city)

sample2 = sample.rdd.map(customFunction)

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

然后自定义函数将应用于数据框的每一行.请注意,sample2 将是 RDD,而不是数据帧.

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

如果您要执行更复杂的计算,则可能需要 Map.如果您只需要添加一个简单的派生列,您可以使用 withColumn,并返回一个数据框.

Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2)