更新时间:2022-10-23 19:40:27
https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
何时使用--save_main_session
:
您可以将 --save_main_session
管道选项设置为 True
.这将导致全局命名空间的状态被腌制并加载到 Cloud Dataflow 工作器上
最适合我的设置是将 dataflow_launcher.py
与您的 setup.py
放在项目根目录中.它唯一能做的就是导入您的管道文件并启动它.使用 setup.py
处理所有依赖项.这是迄今为止我发现的***的例子.
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset
Could somebody please clarify the expected behavior when using save_main_session
and custom modules imported in __main__
. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt
and another one via setup_file
. Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session
would help to solve this problem, but it does not (see error below). So I wonder if I missed something or this behavior is by design. The same import works fine when placed into a function.
Error:
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class __import__(module) ImportError: No module named jmespath
https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
When to use --save_main_session
:
you can set the
--save_main_session
pipeline option toTrue
. This will cause the state of the global namespace to be pickled and loaded on the Cloud Dataflow worker
The setup that best works for me is having a dataflow_launcher.py
sitting at the project root with your setup.py
. The only thing it does is import your pipeline file and launch it. Use setup.py
to handle all your dependencies. This is the best example I've found so far.
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset