如何使用apache pig加载hadoop集群上的文件？

更新时间：2023-01-11 16:29:37

我的建议：

在hdfs中创建一个文件夹：hadoop fs -mkdir / pigdata $>
加载文件创建的hdfs文件夹： hadoop fs -put /opt/pig/tutorial/data/excite-small.log/ pigdata

（或者你可以从grunt shell作为 grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log / pigdata ）

执行pig latin脚本：

  grunt>在

 grunt>上设置调试set job.name'first-p2-job'

 grunt> log = LOAD'hdfs：// hostname：54310 / pigdata / excite-small.log'AS 
（user：chararray，time：long，query：chararray）; 
 grunt> grpd = GROUP log BY用户; 
 grunt> cntd = FOREACH grpd GENERATE组，COUNT（log）; 
 grunt> STORE cntd INTO'output';

输出文件将存储在 hdfs：//主机名：54310 / pigdata / output

I have a pig script, and need to load files from local hadoop cluster. I can list the files using hadoop command: hadoop fs –ls /repo/mydata,` but when i tried to load files in pig script, it failed. the load statement is like this:

in = LOAD '/repo/mydata/2012/02' USING PigStorage() AS (event:chararray, user:chararray)

the error message is:

Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: file:/repo/mydata/2012/02

any idea? thanks

My suggestion:

Create a folder in hdfs : hadoop fs -mkdir /pigdata
Load the file to the created hdfs folder: hadoop fs -put /opt/pig/tutorial/data/excite-small.log /pigdata

(or you can do it from grunt shell as grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log /pigdata)

Execute the pig latin script :

   grunt> set debug on

   grunt> set job.name 'first-p2-job'

   grunt> log = LOAD 'hdfs://hostname:54310/pigdata/excite-small.log' AS 
              (user:chararray, time:long, query:chararray); 
   grunt> grpd = GROUP log BY user; 
   grunt> cntd = FOREACH grpd GENERATE group, COUNT(log); 
   grunt> STORE cntd INTO 'output';

The output file will be stored in hdfs://hostname:54310/pigdata/output

上一篇 : ：在单个 build.gradle 文件中多次调用相同的任务下一篇 : 使用 maven 运行单个测试方法

如何使用apache pig加载hadoop集群上的文件？

相关阅读

技术问答最新文章