且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 Python Mapper 进行 Hadoop 流处理的多个输出文件

更新时间:2021-12-25 09:01:05

您可以执行以下操作,但它涉及一些 Java 编译,如果您希望完成用例,我认为这应该不是问题无论如何使用Python-从 Python 中,据我所知,不能直接从最终输出中跳过文件名,因为您的用例在单个作业中需要.但是下面显示的内容可以轻松实现!

You can do something like the following, but it involves a little Java compiling, which I think shouldn't be a problem, if you want your use case done anyway with Python- From Python, as far as I know it's not directly possible to skip the filename from the final output as your use case demands in a single job. But what's shown below can make it possible with ease!

这里是需要编译的Java类-

Here is the Java class that's need to compiled -

package com.custom;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

 public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
        /**
        * Use they key as part of the path for the final output file.
        */
       @Override
       protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
             return new Path(key.toString(), leaf).toString();
       }

       /**
        * We discard the key as per your requirement
        */
       @Override
       protected Text generateActualKey(Text key, Text value) {
             return null;
       }
 }

编译步骤:

  1. 将文本完全保存到文件中(没有不同的名称)CustomMultiOutputFormat.java
  2. 当你在上面保存的文件所在的目录中时,输入 -

  1. Save the text to a file exactly (no different name) CustomMultiOutputFormat.java
  2. While you are in the directory where the above saved file is, type -

$JAVA_HOME/bin/javac -cp $(hadoop classpath) -d .CustomMultiOutputFormat.java

在尝试之前确保 JAVA_HOME 设置为/path/to/your/SUNJDK上面的命令.

Make sure JAVA_HOME is set to /path/to/your/SUNJDK before attempting the above command.

使用(准确输入)制作您的 custom.jar 文件 -

Make your custom.jar file using (type exactly) -

$JAVA_HOME/bin/jar cvf custom.jar com/custom/CustomMultiOutputFormat.class

最后,像这样运行你的工作 -

Finally, run your job like -

hadoop jar/path/to/your/hadoop-streaming-*.jar -libjars custom.jar -outputformat com.custom.CustomMultiOutputFormat -file your_script.py -input inputpath --numReduceTasks 0 -output outputpath -映射器 your_script.py

执行这些操作后,您应该会在 outputpath 中看到两个目录,一个是 valid_file_name,另一个是 err_file_name.所有以valid_file_name 为标签的记录将进入valid_file_name 目录,所有以err_file_name 为标签的记录将进入err_file_name 目录.

After doing these you should see two directories inside your outputpath one with valid_file_name and other with err_file_name. All records having valid_file_name as a tag will go to valid_file_name directory and all records having err_file_name would go to err_file_name directory.

我希望所有这些都有意义.

I hope all these makes sense.