更新时间:2023-09-29 22:41:28
您需要使用 luigi.contrib.hadoop_jar
包( code )。
尤其需要扩展 HadoopJarJobTask
。例如,如下所示:
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
$ b $ class TextExtractorTask(HadoopJarJobTask):
def output(self):
return HdfsTarget('data / processed /')
def jar(self):
return'jobfile.jar'
def main(self):
return'com.ololo.HadoopJob'
def args(self):
return ['--param1','1','--param2','2']
您还可以在工作流中使用maven构建一个jar文件:
I need to run a Hadoop jar job using Luigi from python. I searched and found examples of writing mapper and reducer in Luigi but nothing to directly run a Hadoop jar.
I need to run a Hadoop jar compiled directly. How can I do it?
You need to use the luigi.contrib.hadoop_jar
package (code).
In particular, you need to extend HadoopJarJobTask
. For example, like that:
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
class TextExtractorTask(HadoopJarJobTask):
def output(self):
return HdfsTarget('data/processed/')
def jar(self):
return 'jobfile.jar'
def main(self):
return 'com.ololo.HadoopJob'
def args(self):
return ['--param1', '1', '--param2', '2']
You can also include building a jar file with maven to the workflow:
import luigi
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
from luigi.file import LocalTarget
import subprocess
import os
class BuildJobTask(luigi.Task):
def output(self):
return LocalTarget('target/jobfile.jar')
def run(self):
subprocess.call(['mvn', 'clean', 'package', '-DskipTests'])
class YourHadoopTask(HadoopJarJobTask):
def output(self):
return HdfsTarget('data/processed/')
def jar(self):
return self.input().fn
def main(self):
return 'com.ololo.HadoopJob'
def args(self):
return ['--param1', '1', '--param2', '2']
def requires(self):
return BuildJobTask()