且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何限制Spark作业失败的重试次数?

更新时间:2022-10-15 23:25:39

有两个设置可控制重试次数(即,使用YARN进行的ApplicationMaster注册尝试的最大次数被认为是失败的,因此整个Spark应用程序都将被视为失败). :

  • spark.yarn.maxAppAttempts-Spark自己的设置.参见 YarnRMClient.getMaxRegAttempts ),实际数字是YARN和Spark的配置设置中的最小值,而YARN是最后的选择.>

    We are running a Spark job via spark-submit, and I can see that the job will be re-submitted in the case of failure.

    How can I stop it from having attempt #2 in case of yarn container failure or whatever the exception be?

    This happened due to lack of memory and "GC overhead limit exceeded" issue.

    There are two settings that control the number of retries (i.e. the maximum number of ApplicationMaster registration attempts with YARN is considered failed and hence the entire Spark application):

    • spark.yarn.maxAppAttempts - Spark's own setting. See MAX_APP_ATTEMPTS:

        private[spark] val MAX_APP_ATTEMPTS = ConfigBuilder("spark.yarn.maxAppAttempts")
          .doc("Maximum number of AM attempts before failing the app.")
          .intConf
          .createOptional
      

    • yarn.resourcemanager.am.max-attempts - YARN's own setting with default being 2.

    (As you can see in YarnRMClient.getMaxRegAttempts) the actual number is the minimum of the configuration settings of YARN and Spark with YARN's being the last resort.