且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何遍历Spark数据框

更新时间:2023-11-18 23:53:16

如果数据很小("但df没那么大"),我将使用Scala集合进行收集和处理.如果类型如下所示:

If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:

df.printSchema
root
 |-- time: integer (nullable = false)
 |-- id: integer (nullable = false)
 |-- direction: boolean (nullable = false)

您可以收集:

val data = df.as[(Int, Int, Boolean)].collect.toSeq

scanLeft:

val result = data.scanLeft((-1, Set[Int]())){ 
  case ((_, acc), (time, value, true)) => (time, acc + value)
  case ((_, acc), (time, value, false))  => (time, acc - value)
}.tail