更新时间:2022-03-11 19:44:41
其他答案技术性很强且难以理解.我之前就在你的位置上,所以我会用简单的术语来解释.
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
气流可以做任何事情.它有 BashOperator
和 PythonOperator
这意味着它可以运行任何 bash 脚本或任何 Python 脚本.
这是一种在易于查看和使用的 UI 中组织(设置复杂的数据管道 DAG)、调度、监控、触发数据管道重新运行的方法.
此外,它很容易设置,一切都在熟悉的 Python 代码中.
以有组织的方式(即使用 Airflow)构建管道意味着您不会浪费时间到处调试一堆数据处理 (cron
) 脚本.
如今(大约从 2020 年开始),我们称其为编排工具.
Airflow can do anything. It has BashOperator
and PythonOperator
which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron
) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam 是许多数据处理框架(Spark、Flink 等)的包装器.
目的是让您只需学习 Beam 并可以在多个后端(Beam runners)上运行.
如果您熟悉 Keras 和 TensorFlow/Theano/Torch,那么 Keras 与其后端之间的关系类似于 Beam 与其数据处理后端之间的关系.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform 的 Cloud Dataflow 是运行 Beam 的一个后端.
他们称之为Dataflow runner.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP 的产品 Cloud Composer 是一种托管 Airflow 实施即服务,在 Google Kubernetes Engine (GKE) 的 Kubernetes 集群中运行.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
所以你可以:
KubernetesPodOperator (KPO)
KPO
在 Composer 的 Kubernetes 集群环境中运行数据处理容器,但这次以更好的隔离方式创建一个新的节点池并指定KPO
Pod 将在新的节点池中运行KubernetesPodOperator (KPO)
KPO
, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO
pods are to be run in the new node-pool我的个人经历:
Airflow 是轻量级的并且不难学习(易于实现),您应该尽可能将其用于您的数据管道.
此外,由于许多公司都在寻找使用 Airflow 的经验,如果您想成为一名数据工程师,您可能应该学习它
此外,托管 Airflow(到目前为止我只使用过 GCP 的 Composer)比自己运行 Airflow 和管理 Airflow webserver
和 scheduler
进程要方便得多.
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver
and scheduler
processes.