且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

PySpark Dataframe将两列转换为基于第三列值的新元组列

更新时间:2022-02-14 22:44:49

假设您的Dataframe被称为df:

from pyspark.sql.functions import struct
from pyspark.sql.functions import collect_list

gdf = (df.select("product_id", "category", struct("purchase_date", "warranty_days").alias("pd_wd"))
.groupBy("product_id")
.pivot("category")
.agg(collect_list("pd_wd")))

本质上,您必须使用struct()purchase_datewarranty_days分组到单个列中.然后,您只需按product_id分组,按category进行旋转,就可以汇总为collect_list().

Essentially, you have to group the purchase_date and warranty_days into a single column using struct(). Then, you are just grouping by product_id, pivoting by category, can aggregating as collect_list().