且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将 jars 添加到 Spark 作业 - spark-submit

更新时间:2022-05-24 18:24:40

ClassPath:

ClassPath 会受到影响,具体取决于您提供的内容.有几种方法可以在类路径上设置一些东西:

ClassPath:

ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:

  • spark.driver.extraClassPath 或者它的别名 --driver-class-path 在运行驱动程序的节点上设置额外的类路径.
  • spark.executor.extraClassPath 在 Worker 节点上设置额外的类路径.
  • spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the node running the driver.
  • spark.executor.extraClassPath to set extra class path on the Worker nodes.

如果您希望某个 JAR 同时作用于 Master 和 Worker,则必须在 BOTH 标志中分别指定它们.

If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.

遵循与 JVM 相同的规则:

  • Linux:冒号 :
    • 例如:--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
    • 例如:--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"

    这取决于您运行作业的模式:

    This depends on the mode which you're running your job under:

    1. 客户端模式 - Spark 启动 Netty HTTP 服务器,该服务器在启动时为每个工作节点分发文件.您可以在开始 Spark 作业时看到:

    1. Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:

    16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
    16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
    16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
    16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732
    16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
    

  • 集群模式 - 在集群模式下,spark 选择了一个领导工作节点来执行驱动程序进程.这意味着作业不是直接从主节点运行的.在这里,Spark 不会设置 HTTP 服务器.您必须通过对所有节点都可用的 HDFS/S3/其他来源,手动将您的 JARS 提供给所有工作节点.

  • Cluster mode - In cluster mode spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JARS available to all the worker node via HDFS/S3/Other sources which are available to all nodes.

    已接受的文件 URI

    "Submitting Applications" 中,Spark 文档做得很好解释可接受的文件前缀的工作:

    Accepted URI's for files

    In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:

    使用 spark-submit 时,应用程序 jar 以及任何 jar包含在 --jars 选项中将自动转移到集群.Spark 使用以下 URL 方案来允许不同的传播罐子的策略:

    When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

    • file: - 绝对路径和 file:/URI 由驱动程序的 HTTP 提供服务文件服务器,每个执行程序从驱动程序 HTTP 中拉取文件服务器.
    • hdfs:, http:, https:, ftp: - 这些下拉文件和 JAR来自 URI 的预期
    • local: - 以 local:/开头的 URI预期作为每个工作节点上的本地文件存在.这意味着不会产生网络 IO,并且适用于大文件/JAR推送给每个工作人员,或通过 NFS、GlusterFS 等共享.

    请注意,JAR 和文件被复制到每个文件的工作目录中.执行器节点上的 SparkContext.

    Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.

    如前所述,JAR 被复制到每个 Worker 节点的工作目录.那究竟是哪里?它通常/var/run/spark/work下,你会看到它们是这样的:

    As noted, JARs are copied to the working directory for each Worker node. Where exactly is that? It is usually under /var/run/spark/work, you'll see them like this:

    drwxr-xr-x    3 spark spark   4096 May 15 06:16 app-20160515061614-0027
    drwxr-xr-x    3 spark spark   4096 May 15 07:04 app-20160515070442-0028
    drwxr-xr-x    3 spark spark   4096 May 15 07:18 app-20160515071819-0029
    drwxr-xr-x    3 spark spark   4096 May 15 07:38 app-20160515073852-0030
    drwxr-xr-x    3 spark spark   4096 May 15 08:13 app-20160515081350-0031
    drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172020-0032
    drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172045-0033
    

    当您查看内部时,您会看到您部署的所有 JAR:

    And when you look inside, you'll see all the JARs you deployed along:

    [*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
    [*@*]$ ll
    total 89988
    -rwxr-xr-x 1 spark spark   801117 May  8 17:34 awscala_2.10-0.5.5.jar
    -rwxr-xr-x 1 spark spark 29558264 May  8 17:34 aws-java-sdk-1.10.50.jar
    -rwxr-xr-x 1 spark spark 59466931 May  8 17:34 com.mycode.code.jar
    -rwxr-xr-x 1 spark spark  2308517 May  8 17:34 guava-19.0.jar
    -rw-r--r-- 1 spark spark      457 May  8 17:34 stderr
    -rw-r--r-- 1 spark spark        0 May  8 17:34 stdout
    

    受影响的选项:

    要了解的最重要的事情是优先级.如果您通过代码传递任何属性,它将优先于您通过 spark-submit 指定的任何选项.Spark 文档中提到了这一点:

    Affected options:

    The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit. This is mentioned in the Spark documentation:

    任何指定为标志或属性文件中的值都将被传递到应用程序并与通过指定的那些合并火花会议.直接在 SparkConf 上设置的属性最高优先级,然后标志传递给 spark-submit 或 spark-shell,然后spark-defaults.conf 文件中的选项

    Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file

    因此请确保将这些值设置在正确的位置,这样当一个值优先于另一个时您就不会感到惊讶.

    So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.

    让我们分析每个选项:

    • --jars vs SparkContext.addJar:这些都是一样的,只有一个是通过spark提交设置的,一个是通过代码设置的.选择更适合您的那一款.需要注意的一件重要事情是,使用这些选项中的任何一个都不会将 JAR 添加到您的驱动程序/执行程序类路径,您需要使用 extraClassPath 配置显式添加它们
    • SparkContext.addJarSparkContext.addFile:当您有需要与代码一起使用的依赖项时,请使用前者.当您只想将任意文件传递给工作程序节点时,请使用后者,这不是代码中的运行时依赖项.
    • --conf spark.driver.extraClassPath=...--driver-class-path:这些是别名,你选择哪个无关紧要
    • --conf spark.driver.extraLibraryPath=...,或 --driver-library-path ... 同上,别名.
    • --conf spark.executor.extraClassPath=...:当您有一个无法包含在 uber JAR 中的依赖项时使用此选项(例如,因为存在编译时冲突)库版本之间)以及您需要在运行时加载的内容.
    • --conf spark.executor.extraLibraryPath=... 这作为 JVM 的 java.library.path 选项传递.当您需要对 JVM 可见的库路径时,请使用此选项.
    • --jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using the extraClassPath config on both.
    • SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
    • --conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, doesn't matter which one you choose
    • --conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
    • --conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an uber JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
    • --conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.

    假设为简单起见,我可以添加额外的同时使用 3 个主要选项的应用程序 jar 文件:

    Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

    您可以安全地假设这仅适用于客户端模式,而不适用于集群模式.正如我之前所说的.此外,你给出的例子有一些多余的论点.例如,将 JAR 传递给 --driver-library-path 是没有用的,如果您希望它们在您的类路径上,您需要将它们传递给 extraClassPath.最终,当您在驱动程序和工作线程上部署外部 JAR 时,您想要做的是:

    You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JARs to --driver-library-path is useless, you need to pass them to extraClassPath if you want them to be on your classpath. Ultimately, what you want to do when you deploy external JARs on both the driver and the worker is:

    spark-submit --jars additional1.jar,additional2.jar 
      --driver-class-path additional1.jar:additional2.jar 
      --conf spark.executor.extraClassPath=additional1.jar:additional2.jar 
      --class MyClass main-application.jar