且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Hadoop中的映射器输入键值对

更新时间:2022-01-26 08:42:31

映射器的输入取决于使用的 InputFormat.InputFormat 负责读取传入的数据并将其调整为 Mapper 期望的任何格式.默认 InputFormat 是 TextInputFormat,它扩展了 FileInputFormat<LongWritable, Text>.

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.

如果不更改 InputFormat,使用与 具有不同 Key-Value 类型签名的 Mapper 将导致此错误.如果您希望 输入,则必须选择适当的 InputFormat.您可以在 Job setup 中设置 InputFormat:

If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:

job.setInputFormatClass(MyInputFormat.class);

就像我说的,默认设置为 TextInputFormat.

And like I said, by default this is set to TextInputFormat.

现在,假设您的输入数据是一堆由逗号分隔的换行符分隔的记录:

Now, let's say your input data is a bunch of newline-separated records delimited by a comma:

  • "A,value1"
  • B,value2"

如果您希望映射器的输入键为 ("A", "value1"), ("B", "value2"),则必须使用 < 实现自定义 InputFormat 和 RecordReader.文字,文字>签名.幸运的是,这很容易.这里有一个例子,可能还有一些例子漂浮在 *** 上.

If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around *** as well.

简而言之,添加一个扩展 FileInputFormat 的类和一个扩展 RecordReader 的类.覆盖 FileInputFormat#getRecordReader 方法,并让它返回您的自定义 RecordReader 的实例.

In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.

然后您将必须实现所需的 RecordReader 逻辑.最简单的方法是创建 LineRecordReader 在您的自定义 RecordReader 中,并将所有基本职责委托给此实例.在 getCurrentKey 和 getCurrentValue 方法中,您将通过调用 LineRecordReader#getCurrentValue 并将其拆分为逗号来实现提取逗号分隔文本内容的逻辑.

Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.

最后,将您的新 InputFormat 设置为 Job InputFormat,如上面第二段之后所示.

Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.