如何使用reduceByKey代替GroupByKey构建一个列表？

更新时间：2023-09-27 12:36:22

答案是：你不能（或至少不滥用无活力的语言直接和Python化的方式）。由于值类型和返回类型是不同的（元组VS一个元组列表）减少是不是在这里一个有效的功能。你可以使用 combineByKey 或 aggregateByKey 例如这样的：

The answer is you cannot (or at least not in a straightforward and Pythonic way without abusing language dynamism). Since values type and return type are different (a list of tuples vs a single tuple) reduce is not a valid function here. You could use combineByKey or aggregateByKey for example like this:

rdd = sc.parallelize([
    ("key1", ("val1_key1", "val2_key1")),
    ("key2", ("val1_key2", "val2_key2"))])

rdd.aggregateByKey([], lambda acc, x: acc + [x], lambda acc1, acc2: acc1 + acc2)

但它仅仅是一个 groupByKey 的效率较低版本。另请参见是有史以来groupByKey preferred超过reduceByKey

but it is just a less efficient version of groupByKey. See also Is groupByKey ever preferred over reduceByKey

上一篇 : ：GroupByKey 转换的早期结果下一篇 : groupByKey 是否比 reduceByKey 更受欢迎

如何使用reduceByKey代替GroupByKey构建一个列表？

相关阅读

推荐文章