且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark 2.1.1 获取窗口的最后一个元素

更新时间:2023-11-10 09:37:04

根据问题 SPARK-20969,您应该能够通过为您的窗口定义足够的边界来获得预期的结果,如下所示.

According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window
  .partitionBy("name")
  .orderBy("count")
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

sqlContext
  .createDataFrame(
    Seq[(String, Int)](
      ("A", 1),
      ("A", 2),
      ("A", 3),
      ("B", 10),
      ("B", 20),
      ("B", 30)
    ))
  .toDF("name", "count")
  .withColumn("firstCountOfName", first("count").over(windowSpec))
  .withColumn("lastCountOfName", last("count").over(windowSpec))
  .show()

或者,如果您在第一次和最后一次计算的同一列上进行排序,则可以使用无序窗口更改 minmax,然后它也应该可以正常工作.

Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.