更新时间:2023-11-10 09:37:04
根据问题 SPARK-20969,您应该能够通过为您的窗口定义足够的边界来获得预期的结果,如下所示.
According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window
.partitionBy("name")
.orderBy("count")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
或者,如果您在第一次和最后一次计算的同一列上进行排序,则可以使用无序窗口更改 min
和 max
,然后它也应该可以正常工作.
Alternatively, if your are ordering on the same column you are computing first and last, you can change for min
and max
with a non-ordered window, then it should also work properly.