更新时间:2023-01-11 16:29:37
我的建议:
在hdfs中创建一个文件夹: 加载文件创建的hdfs文件夹: hadoop fs -mkdir / pigdata $>
hadoop fs -put /opt/pig/tutorial/data/excite-small.log/ pigdata
(或者你可以从grunt shell作为 grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log / pigdata
)
执行pig latin脚本:
grunt>在
grunt>上设置调试set job.name'first-p2-job'
grunt> log = LOAD'hdfs:// hostname:54310 / pigdata / excite-small.log'AS
(user:chararray,time:long,query:chararray);
grunt> grpd = GROUP log BY用户;
grunt> cntd = FOREACH grpd GENERATE组,COUNT(log);
grunt> STORE cntd INTO'output';
输出文件将存储在 hdfs://主机名:54310 / pigdata / output
I have a pig script, and need to load files from local hadoop cluster. I can list the files using hadoop command: hadoop fs –ls /repo/mydata,` but when i tried to load files in pig script, it failed. the load statement is like this:
in = LOAD '/repo/mydata/2012/02' USING PigStorage() AS (event:chararray, user:chararray)
the error message is:
Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: file:/repo/mydata/2012/02
any idea? thanks
My suggestion:
Create a folder in hdfs : hadoop fs -mkdir /pigdata
Load the file to the created hdfs folder: hadoop fs -put /opt/pig/tutorial/data/excite-small.log /pigdata
(or you can do it from grunt shell as grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log /pigdata
)
Execute the pig latin script :
grunt> set debug on
grunt> set job.name 'first-p2-job'
grunt> log = LOAD 'hdfs://hostname:54310/pigdata/excite-small.log' AS
(user:chararray, time:long, query:chararray);
grunt> grpd = GROUP log BY user;
grunt> cntd = FOREACH grpd GENERATE group, COUNT(log);
grunt> STORE cntd INTO 'output';
The output file will be stored in hdfs://hostname:54310/pigdata/output