更新时间:2021-12-11 02:18:23
由于Spark的懒惰评估,合并导致读取操作的并行性降低.
Thanks to Spark's lazy evaluation, the coalesce is resulting in reduced parallelism of read operation.
It has nothing to do with laziness. coalesce
intentionally doesn't create analysis barrier:
但是,如果您要进行剧烈的合并,例如到numPartitions = 1,这可能会导致您的计算在少于您希望的节点上进行(例如,在numPartitions = 1的情况下为一个节点).为避免这种情况,您可以调用重新分区.这将增加一个随机播放步骤,但是意味着当前的上游分区将并行执行(无论当前分区是什么).
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
因此,只需遵循文档并使用repartition
而不是coalesce
.
So just follow the documentation and use repartition
instead of coalesce
.