且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

用于数据处理和作业调度的 Apache Airflow 或 Apache Beam

更新时间:2022-03-11 19:44:41

其他答案技术性很强且难以理解.我之前就在你的位置上,所以我会用简单的术语来解释.

The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.

气流可以做任何事情.它有 BashOperatorPythonOperator 这意味着它可以运行任何 bash 脚本或任何 Python 脚本.
这是一种在易于查看和使用的 UI 中组织(设置复杂的数据管道 DAG)、调度、监控、触发数据管道重新运行的方法.
此外,它很容易设置,一切都在熟悉的 Python 代码中.
以有组织的方式(即使用 Airflow)构建管道意味着您不会浪费时间到处调试一堆数据处理 (cron) 脚本.
如今(大约从 2020 年开始),我们称其为编排工具.

Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.

Apache Beam 是许多数据处理框架(Spark、Flink 等)的包装器.
目的是让您只需学习 Beam 并可以在多个后端(Beam runners)上运行.
如果您熟悉 Keras 和 TensorFlow/Theano/Torch,那么 Keras 与其后端之间的关系类似于 Beam 与其数据处理后端之间的关系.

Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

Google Cloud Platform 的 Cloud Dataflow 是运行 Beam 的一个后端.
他们称之为Dataflow runner.

Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.

GCP 的产品 Cloud Composer 是一种托管 Airflow 实施即服务,在 Google Kubernetes Engine (GKE) 的 Kubernetes 集群中运行.

GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).

所以你可以:

  • 手动Airflow实现,在实例本身做数据处理(如果你的数据很小(或者你的实例足够强大),你可以在运行Airflow的机器上处理数据.这就是为什么很多人对Airflow是否可以处理数据感到困惑与否)
  • 手动 Airflow 实现调用 Beam 作业
  • Cloud Composer(托管 Airflow 即服务)在 Cloud Dataflow 中调用作业
  • Cloud Composer 在 Composer 的 Kubernetes 集群环境中运行数据处理容器,使用 Airflow 的 KubernetesPodOperator (KPO)
  • Cloud Composer 使用 Airflow 的 KPO 在 Composer 的 Kubernetes 集群环境中运行数据处理容器,但这次以更好的隔离方式创建一个新的节点池并指定KPO Pod 将在新的节点池中运行
  • manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
  • manual Airflow implementation calling Beam jobs
  • Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
  • Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
  • Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool

我的个人经历:
Airflow 是轻量级的并且不难学习(易于实现),您应该尽可能将其用于您的数据管道.
此外,由于许多公司都在寻找使用 Airflow 的经验,如果您想成为一名数据工程师,您可能应该学习它
此外,托管 Airflow(到目前为止我只使用过 GCP 的 Composer)比自己运行 Airflow 和管理 Airflow webserverscheduler 进程要方便得多.

My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.