且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在Elastic MapReduce中将Python UDF与Pig一起使用?

更新时间:2021-09-20 00:14:01

经过很多错误的转折后,我发现,至少在弹性映射上减少了Hadoop的实现,Pig似乎忽略了CLASSPATH环境变量.相反,我发现我可以使用HADOOP_CLASSPATH变量来控制类路径.

After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.

一旦我意识到这一点,就可以很容易地设置要使用Python UDFS的内容:

Once I made that realization, it was fairly easy to get things setup to use Python UDFS:

  • 安装Jython
    • sudo apt-get install jython -y -qq
    • Install Jython
      • sudo apt-get install jython -y -qq
      • export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
        • jython.jar确保Hadoop可以找到PyException类
        • antlr-runtime-3.2.jar确保Hadoop可以找到CharStream类
        • sudo mkdir /usr/share/java/cachedir/
        • sudo chmod a+rw /usr/share/java/cachedir
        • sudo mkdir /usr/share/java/cachedir/
        • sudo chmod a+rw /usr/share/java/cachedir

        我应该指出,这似乎与我在寻找解决此问题的方法时发现的其他建议直接矛盾:

        I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:

        • 设置CLASSPATH和PIG_CLASSPATH环境变量似乎没有任何作用.
        • 包含UDF的.py文件不需要包含在HADOOP_CLASSPATH环境变量中.
        • Pig register语句中使用的.py文件的路径可以是相对的,也可以是绝对的,这似乎无关紧要.
        • Setting the CLASSPATH and PIG_CLASSPATH environment variables doesn't seem to do anything.
        • The .py file containing the UDF does not need to be included in the HADOOP_CLASSPATH environment variable.
        • The path to the .py file used in the Pig register statement may be relative or absolute, it doesn't seem to matter.