且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Google DataFlow/Python:在 __main__ 中使用 save_main_session 和自定义模块导入错误

更新时间:2022-10-23 19:40:27

https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

何时使用--save_main_session:

您可以将 --save_main_session 管道选项设置为 True.这将导致全局命名空间的状态被腌制并加载到 Cloud Dataflow 工作器上

最适合我的设置是将 dataflow_launcher.py 与您的 setup.py 放在项目根目录中.它唯一能做的就是导入您的管道文件并启动它.使用 setup.py 处理所有依赖项.这是迄今为止我发现的***的例子.

https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset

Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file. Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session would help to solve this problem, but it does not (see error below). So I wonder if I missed something or this behavior is by design. The same import works fine when placed into a function.

Error:

  File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
    __import__(module)
ImportError: No module named jmespath

https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

When to use --save_main_session:

you can set the --save_main_session pipeline option to True. This will cause the state of the global namespace to be pickled and loaded on the Cloud Dataflow worker

The setup that best works for me is having a dataflow_launcher.py sitting at the project root with your setup.py. The only thing it does is import your pipeline file and launch it. Use setup.py to handle all your dependencies. This is the best example I've found so far.

https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset