且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在google dataproc集群实例中的spark-submit上运行app jar文件

更新时间:2021-09-02 17:58:55

正如您所发现的,Dataproc在调用Spark时会在类路径中包含Hadoop依赖项。这主要是为了使用Hadoop输入格式,文件系统等非常简单。缺点是你最终会得到Hadoop的番石榴版本11.02(见 HADOOP-10101 )。

As you've found, Dataproc includes Hadoop dependencies on the classpath when invoking Spark. This is done primarily so that using Hadoop input formats, file systems, etc is fairly straight-forward. The downside is that you will end up with Hadoop's guava version which is 11.02 (See HADOOP-10101).

如何解决这个问题取决于您的构建系统。如果使用Maven,maven-shade插件可用于在新的包名下重新定位您的番石榴版本。可以在 GCS Hadoop Connector的包装中看到此示例,但它的关键是你的pom.xml构建部分中的以下插件声明:

How to work around this depends on your build system. If using Maven, the maven-shade plugin can be used to relocate your version of guava under a new package name. An example of this can be seen in the GCS Hadoop Connector's packaging, but the crux of it is the following plugin declaration in your pom.xml build section:

  <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
      <execution>
        <phase>package</phase>
        <goals>
          <goal>shade</goal>
        </goals>
        <configuration>
          <relocations>
            <relocation>
              <pattern>com.google.common</pattern>
              <shadedPattern>your.repackaged.deps.com.google.common</shadedPattern>
            </relocation>
          </relocations>
        </execution>
      </execution>
    </plugin>

使用sbt的sbt-assembly插件,ant的jarjar和jarjar可以实现类似的重定位或影子为gradle。

Similar relocations can be accomplished with the sbt-assembly plugin for sbt, jarjar for ant, and either jarjar or shadow for gradle.