且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

多个 RDD 的 Spark 联合

更新时间:2023-11-18 22:38:46

如果这些是 RDD,你可以使用 SparkContext.union 方法:

If these are RDDs you can use SparkContext.union method:

rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])

rdd = sc.union([rdd1, rdd2, rdd3])
rdd.collect()

## [1, 2, 3, 4, 5, 6, 7, 8, 9]

没有 DataFrame 等价物,但它只是一个简单的单行代码:

There is no DataFrame equivalent but it is just a matter of a simple one-liner:

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

unionAll(df1, df2, df3).show()

## +---+----+
## |  k|   v|
## +---+----+
## |  1|foo1|
## |  2|bar1|
## |  3|foo2|
## |  4|bar2|
## |  5|foo3|
## |  6|bar3|
## +---+----+

如果 DataFrames 的数量很大,在 RDD 上使用 SparkContext.union 并重新创建 DataFrame 可能是避免 与准备执行计划的成本相关的问题:

If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:

def unionAll(*dfs):
    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )