且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 pig,如何将混合格式行解析为元组和一袋元组?

更新时间:2022-05-22 22:58:08

您的第二种方法是正确的.不幸的是,您需要一个 UDF 来将元组转换为包,据我所知,没有内置函数可以做到这一点.然而,写一个是一件简单的事情.

Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.

您不想对固定字段进行分组,而是对键值对本身进行分组.所以只需要保留键值对的元组即可;您可以完全忽略固定字段.

You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.

UDF 非常简单.在 Java 中,您可以在 exec 方法中执行以下操作:

The UDF is pretty simple. In Java, you can just do something like this in your exec method:

DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
    Object o = t.get(i);
    Tuple e = TupleFactory.getInstance().createTuple(o);
    b.add(e);
}

return b;

一旦你有了它,把 STRSPLIT 中的元组变成一个袋子,把它压平,然后进行分组和计数.

Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.