且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Pyspark转置

更新时间:2023-11-18 22:08:16

IIUC,您需要进行从宽到长的转换,这可以通过pyspark 中的

IIUC, you need wide to long kind of transformation which can be achieved by stack in pyspark

我创建了一个包含5个月数据的示例数据框

I created a sample dataframe with 5 months data

df = spark.createDataFrame([(1,10,20,30,40,50,10,20,30,40,50),(2,10,20,30,40,50,10,20,30,40,50)],['cust','Measrue1_month1','Measrue1_month2','Measrue1_month3','Measrue1_month4','Measrue1_month5','Measrue2_month1','Measrue2_month2','Measrue2_month3','Measrue2_month4','Measrue2_month5'])

现在生成用于堆栈操作的子句.可以用更好的方法完成,但这是最简单的示例

Now generating the clause for stack operation. Can be done in better ways but here is the most simplest example

Measure1 = [i for i in df.columns if i.startswith('Measrue1')]
Measure2 = [i for i in df.columns if i.startswith('Measrue2')]
final = []
for i in Measure1:
    for j in Measure2:
        if(i.split('_')[1]==j.split('_')[1]):
            final.append((i,j))
rows = len(final)
values = ','.join([f"'{i.split('_')[1]}',{i},{j}" for i,j in final])

现在实际应用堆栈操作

df.select('cust',expr(f'''stack({rows},{values})''').alias('Month','Measure1','Measure2')).show()

+----+------+--------+--------+
|cust| Month|Measure1|Measure2|
+----+------+--------+--------+
|   1|month1|      10|      10|
|   1|month2|      20|      20|
|   1|month3|      30|      30|
|   1|month4|      40|      40|
|   1|month5|      50|      50|
|   2|month1|      10|      10|
|   2|month2|      20|      20|
|   2|month3|      30|      30|
|   2|month4|      40|      40|
|   2|month5|      50|      50|
+----+------+--------+--------+