且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark 和 SparkSQL:如何模仿窗口函数?

更新时间:2023-11-18 15:02:22

您可以使用 RDD 来做到这一点.就个人而言,我发现 RDD 的 API 更有意义 - 我并不总是希望我的数据像数据框一样扁平".

You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.

val df = sqlContext.sql("select 1, '2015-09-01'"
    ).unionAll(sqlContext.sql("select 2, '2015-09-01'")
    ).unionAll(sqlContext.sql("select 1, '2015-09-03'")
    ).unionAll(sqlContext.sql("select 1, '2015-09-04'")
    ).unionAll(sqlContext.sql("select 2, '2015-09-04'"))

// dataframe as an RDD (of Row objects)
df.rdd 
  // grouping by the first column of the row
  .groupBy(r => r(0)) 
  // map each group - an Iterable[Row] - to a list and sort by the second column
  .map(g => g._2.toList.sortBy(row => row(1).toString))     
  .collect()

上面给出的结果如下:

Array[List[org.apache.spark.sql.Row]] = 
Array(
  List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]), 
  List([2,2015-09-01], [2,2015-09-04]))

如果您还想要在组"中的位置,您可以使用 zipWithIndex.

If you want the position within the 'group' as well, you can use zipWithIndex.

df.rdd.groupBy(r => r(0)).map(g => 
    g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()

Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
  List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
  List(([2,2015-09-01],0), ([2,2015-09-04],1)))

可以使用 FlatMap 将其展平为一个简单的 Row 对象列表/数组,但是如果您需要在组"上执行任何不会是个好主意.

You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.

像这样使用 RDD 的缺点是从 DataFrame 转换到 RDD 再转换回来很乏味.

The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.