且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为 Spark 数据帧中的每个组创建索引

更新时间:2023-02-05 16:26:00

您可以使用 Window 函数创建基于 value 的排名列,由 分区group_id:

You can use Window functions to create a rank column based on value, partitioned by group_id:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank
# Define window
window = Window.partitionBy(df['group_id']).orderBy(df['value'])
# Create column
df.select('*', rank().over(window).alias('index')).show()
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
|       1| -1.7|    1|
|       1|  0.0|    2|
|       1|  1.3|    3|
|       1|  2.7|    4|
|       1|  3.4|    5|
|       2|  0.8|    1|
|       2|  2.3|    2|
|       2|  5.9|    3|
+--------+-----+-----+

因为,您首先选择了 '*',因此您也使用上述代码保留了所有其他变量.但是,您的第二个示例表明您正在寻找函数 dense_rank(),它作为一个没有间隙的排名列提供:

Because, you first select '*', you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank(), which gives as a rank column with no gaps:

df.select('*', dense_rank().over(window).alias('index'))