且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

获取Spark 2.1.1中窗口的最后一个元素

更新时间:2023-11-10 07:56:04

根据问题 SPARK-20969 ,您应该可以通过为窗口定义适当的边界来获得预期的结果,如下所示.

According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window
  .partitionBy("name")
  .orderBy("count")
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

sqlContext
  .createDataFrame(
    Seq[(String, Int)](
      ("A", 1),
      ("A", 2),
      ("A", 3),
      ("B", 10),
      ("B", 20),
      ("B", 30)
    ))
  .toDF("name", "count")
  .withColumn("firstCountOfName", first("count").over(windowSpec))
  .withColumn("lastCountOfName", last("count").over(windowSpec))
  .show()

或者,如果您要在同一列上进行排序,那么您可以使用无序窗口更改minmax,那么它也应该可以正常工作.

Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.