且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Python进行Hadoop流式传输:跟踪行号

更新时间:2023-11-21 19:12:10

如果你的工作只是一个单一文件的大写,那么Hadoop并不会给你任何将文件流式传输到单个机器,执行大写然后写入内容备份到HDFS。即使有一个巨大的文件(比如1TB),你仍然需要将所有东西都放到一个reducer中,这样当它被写回到HDFS时,它将被存储在一个连续的文件中。

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

在这种情况下,我会配置您的流作业为每个文件都有一个映射器(将分割最小和最大大小设置为大于文件本身的大小),然后运行纯映射作业。

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.