更新时间:2023-11-26 11:03:22
使用aggregateByKey
:
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
(numList, num) => {numList += num; numList},
(numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()
scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))
有关aggregateByKey
的详细信息,请参见此答案, 此链接,以了解使用可变数据集ListBuffer
的背后原理.
See this answer for the details on aggregateByKey
, this link for the rationale behind using a mutable dataset ListBuffer
.
Is there a way to achieve the same result using reduceByKey?
以上实际上是更糟糕的性能,请查看@ zero323的评论以获取详细信息.
The above is actually worse in performance, please see comments by @zero323 for the details.