更新时间:2023-02-05 16:26:00
您可以使用 Window
函数创建基于 value
的排名列,由 分区group_id
:
You can use Window
functions to create a rank column based on value
, partitioned by group_id
:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank
# Define window
window = Window.partitionBy(df['group_id']).orderBy(df['value'])
# Create column
df.select('*', rank().over(window).alias('index')).show()
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
| 1| -1.7| 1|
| 1| 0.0| 2|
| 1| 1.3| 3|
| 1| 2.7| 4|
| 1| 3.4| 5|
| 2| 0.8| 1|
| 2| 2.3| 2|
| 2| 5.9| 3|
+--------+-----+-----+
因为,您首先选择了 '*'
,因此您也使用上述代码保留了所有其他变量.但是,您的第二个示例表明您正在寻找函数 dense_rank()
,它作为一个没有间隙的排名列提供:
Because, you first select '*'
, you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank()
, which gives as a rank column with no gaps:
df.select('*', dense_rank().over(window).alias('index'))