且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在哪些情况下会跳过DAG的阶段?

更新时间:2021-10-03 00:48:07

实际上,这非常简单.

在您的情况下,不能跳过任何操作,因为每个Action具有不同的JOIN类型.它需要扫描d和d'来计算结果.即使使用.cache(您不使用它,也应该使用它来避免重新计算每个Action的源代码),这没有什么区别.

In your case nothing can be skipped as each Action has a different JOIN type. It needs to scan d and d' to compute the result. Even with .cache (which you do not use and should use to avoid recomputing all the way back to source on each Action), this would make no difference.

看这个简化的版本:

val d = sc.parallelize(0 until 100000).map(i => (i%10000, i)).cache // or not cached, does not matter

val c=d.rightOuterJoin(d.reduceByKey(_+_))
val f=d.leftOuterJoin(d.reduceByKey(_+_))

c.count
c.collect // skipped, shuffled 
f.count
f.collect // skipped, shuffled

显示此应用程序的以下作业:

Shows the following Jobs for this App:

(4) Spark Jobs
Job 116 View(Stages: 3/3)
Job 117 View(Stages: 1/1, 2 skipped)
Job 118 View(Stages: 3/3)
Job 119 View(Stages: 1/1, 2 skipped)

您可以看到,基于相同混洗结果的连续动作会导致val c或val f的第二个Action/Job跳过一个或多个阶段.也就是说,c和f的联接类型是已知的,并且相同联接类型的2个动作从先前的工作中顺次运行,即第二个动作可以依靠直接适用于第二个动作的第一个动作的改组行动.这么简单.

You can see that successive Actions based on same shuffling result cause a skipping of one or more Stages for the second Action / Job for val c or val f. That is to say, the join type for c and f are known and the 2 Actions for the same join type run sequentially profiting from prior work, i.e. the second Action can rely on the shuffling of the first Action that is directly applicable to the 2nd Action. That simple.