使用Python进行Hadoop流式传输：跟踪行号

更新时间：2023-11-21 19:12:10

如果你的工作只是一个单一文件的大写，那么Hadoop并不会给你任何将文件流式传输到单个机器，执行大写然后写入内容备份到HDFS。即使有一个巨大的文件（比如1TB），你仍然需要将所有东西都放到一个reducer中，这样当它被写回到HDFS时，它将被存储在一个连续的文件中。

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

在这种情况下，我会配置您的流作业为每个文件都有一个映射器（将分割最小和最大大小设置为大于文件本身的大小），然后运行纯映射作业。

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.

上一篇 : ：/tmp/hive上的Powershell chmod for winutils和hadoop/spark下一篇 : 所有的不相交的双套

使用Python进行Hadoop流式传输：跟踪行号

相关阅读

推荐文章