So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me.
What is Airflow?
Apache Airflow is a workflow manager similar to Luigi or Oozie. Just like all job schedulers, you define a schedule, then the work to be done, and Airflow takes care of the rest. Where Airflow shines though, is how everything works together.
Airflow workflows are written in Python code. Oh and another thing: “workflows” in Airflow are known as “DAGs” or, “Directed Acyclic Graphs”. What does that mean? Directed == the work only flows in one direction, i.e. a distinct start and end. Acyclic == there aren’t any loops in the workflow. Graph == the workflow can be abstracted into nodes and edges, i.e. the nodes are where the work happens and the edges are the inputs/outputs of the various tasks.
Assuming you have everything installed, an Airflow DAG is super easy to create and schedule:
#!/bin/python import airflow from airflow.models import DAG from airflow.operators.python_operator import PythonOperator # 1. define arguments default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': airflow.utils.dates.days_ago(2), 'schedule': '@daily', 'email': ['airflow@hadoopsters.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), } # 2. define functionality def hello_world(): return "Hello World!" # 3. create dag dag = DAG("example_dag", default_args=default_args) my_first_operator = PythonOperator(task_id="first_task", python_callable=hello_world) # 4. construct dag relationships dag >> my_first_operator
While it’s not a slick, one-liner like things can be in Python, this example is a full fledged Airflow DAG. Heck, the majority of those 20-odd lines is just setting up things you don’t need but I wanted to include. I also wanted to build a
Here’s what it does:
- It will be run every day at midnight
- It will execute any arbitrary Python code in the “hello_world” function
- It will automatically be rerun in the event of failure, 1 time after waiting 5 minutes
- It will send an email in the event of a failure
- It will keep historical logs of its runs
- It will backfill the “days it’s missed” (two days ago, in this case)
All this functionality in four simple steps built in Python.
Here’s the best part:
Version Control. With Airflow, your workflows are now trackable code! You can check it in, make branches, plug it into CI/CD frameworks, conduct code reviews… you get the idea. This is so powerful and should make any proper Hadoopster very excited.
More to come soon, stay tuned!