Planning an Apache Airflow Deployment

This is part one of a five-part series addressing Airflow at an enterprise scale. I will update these with links as they are published.

  • Airflow: Planning a Deployment

Apache Airflow is a platform for authoring, scheduling and monitoring ETL workflows using a diverse range of task-based operators. It features well thought out secret and variable management, flexible Jinja templating, and enterprise friendly Auth solutions such as LDAP, OAuth and OIDC. The KubernetesExecutor, which we will focus on here, executes each task as a Kubernetes Pod.

The Scenario

Let’s assume that our employer has tasked us with estimating the effort for introducing Airflow to our organization. We have a few Kubernetes clusters and some of our existing ETLs run on those clusters as Python, Spark, and Scala applications. Another team is working on a POC implementation of Snowflake so we need to be able to support that when the time comes. To be safe, we will assume that there are some zombie bash and SQL scripts that we will find along our journey. We will need to authenticate and authorize our users and collect logs for the airflow application and tasks.

Architecture

An Airflow Cluster consists of three services; scheduler, web server, and worker(s). The web server provides a UI and a resource API, the scheduler polls the database for task state changes and manages the lifecycle of tasks for execution and retry. The workers perform the actual execution of the task and ensure logs are moved to the scheduler. The database is required and should be HA/DR capable.

Image 1. Airflow Architecture

To configure the full featured cluster, we need some supporting services. First, we need a rock-solid HA/DR capable database. As our services don’t communicate directly, the database is a central point of failure in the cluster as every operation starts by retrieving state from the database.

Last, we need to define how the cluster will be hosted. For this, we will select a namespace on our Kubernetes cluster and we can deploy the cluster using the official Helm chart for Airflow. Naturally, we plan on configuring the KubernetesExecutor for task execution.

What’s Not Included

  • Log Aggregation and Forwarding
    • Airflow provides an integrated logging configuration for writing and collection task and application level logs but does not provide a log aggregation or forwarding framework. We will need to implement a solution that can forward logs from a directory to our log analytics platform.
  • DAG Management
    • Airflow does not provide a means to move DAGs to the running services in the airflow cluster. With the deployment cadence that we expect from our usage of this platform, we will need to sync the DAG directory with a Github repo(s).
  • SMTP Relay
    • Airflow requires the SMTP_DEFAULT connection string to contain credentials to a SMTP relay server or the alerting features cannot function properly. SendGrid and basic SMTP are supported.
  • Auth
    • Airflow allows for anonymous access, user/password, or Single Sign On access through a variety of providers like LDAP, OIDC, Github Apps. Our organization uses Active Directory (Fancy LDAP) and Github Enterprise so this is a matter of preference rather than a needed solution.

Moving Forward

As we can see, implementing a useful Airflow cluster requires more than `helm install …` as we lack many necessary features without a significant amount of configuration. To that end, we have this list:

  • Identify Authentication Method
    • Attain credentials and other configurations
  • Identify SMTP Relay
    • Attain credentials and other configurations
  • Identify Kubernetes Namespace
  • Implement Helm configuration for deployment
  • Implement Logging Solution
  • Implement DAG Sync Solution

At this point, point or time estimates can be applied and aggregated. Now comes the fun part. Thanks for reading!

About the Author

Jacob Nosal profile.

Jacob Nosal

Sr Consultant

One thought on “Planning an Apache Airflow Deployment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Airflow Logging: Task logs to Elasticsearch
This is part three of a five-part series addressing Airflow at an enterprise scale. I will update these with links as they are published. Airflow: Planning a Deployment Airflow + Helm: Simple Airflow Deployment More […]
Using Nix as a Professional
How to use Nix as a tool to optimize developer time with real-life examples.
Enterprise Auth for Airflow: Azure AD
This is part three of a five-part series addressing Airflow at an enterprise scale. I will update these with links as they are published. Airflow: Planning a Deployment Airflow + Helm: Deploying the Chart Without […]
More Charts: Adding TLS to Airflow
In this post, we will be adding TLS to Airflow on Azure Kubernetes Service. This is part three of a five-part series addressing Airflow at an enterprise scale. I will update these with links as […]