且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Pig,如何将混合格式的行解析为元组和一包元组?

更新时间:2021-10-29 23:29:06

您的第二种方法是正确的.不幸的是,您将需要UDF将元组转换为包,据我所知,没有内置函数可以执行此操作.但是,写一个是简单的事情.

Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.

您不想对固定字段进行分组,而是希望对键值对本身进行分组.因此,您只需要保留键值对的元组即可;您可以完全忽略固定字段.

You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.

UDF非常简单.在Java中,您可以在exec方法中执行以下操作:

The UDF is pretty simple. In Java, you can just do something like this in your exec method:

DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
    Object o = t.get(i);
    Tuple e = TupleFactory.getInstance().createTuple(o);
    b.add(e);
}

return b;

一旦有了,就将STRSPLIT中的元组变成一个袋子,将其展平,然后进行分组和计数.

Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.