更新时间:2022-02-14 22:44:49
假设您的Dataframe
被称为df
:
from pyspark.sql.functions import struct
from pyspark.sql.functions import collect_list
gdf = (df.select("product_id", "category", struct("purchase_date", "warranty_days").alias("pd_wd"))
.groupBy("product_id")
.pivot("category")
.agg(collect_list("pd_wd")))
本质上,您必须使用struct()
将purchase_date
和warranty_days
分组到单个列中.然后,您只需按product_id
分组,按category
进行旋转,就可以汇总为collect_list()
.
Essentially, you have to group the purchase_date
and warranty_days
into a single column using struct()
. Then, you are just grouping by product_id
, pivoting by category
, can aggregating as collect_list()
.