Planning an Apache Airflow Deployment

This is part one of a five-part series addressing Airflow at an enterprise scale. I will update these with links as they are published.

  • Airflow: Planning a Deployment

Apache Airflow is a platform for authoring, scheduling and monitoring ETL workflows using a diverse range of task-based operators. It features well thought out secret and variable management, flexible Jinja templating, and enterprise friendly Auth solutions such as LDAP, OAuth and OIDC. The KubernetesExecutor, which we will focus on here, executes each task as a Kubernetes Pod.

The Scenario

Let’s assume that our employer has tasked us with estimating the effort for introducing Airflow to our organization. We have a few Kubernetes clusters and some of our existing ETLs run on those clusters as Python, Spark, and Scala applications. Another team is working on a POC implementation of Snowflake so we need to be able to support that when the time comes. To be safe, we will assume that there are some zombie bash and SQL scripts that we will find along our journey. We will need to authenticate and authorize our users and collect logs for the airflow application and tasks.


An Airflow Cluster consists of three services; scheduler, web server, and worker(s). The web server provides a UI and a resource API, the scheduler polls the database for task state changes and manages the lifecycle of tasks for execution and retry. The workers perform the actual execution of the task and ensure logs are moved to the scheduler. The database is required and should be HA/DR capable.

Image 1. Airflow Architecture

To configure the full featured cluster, we need some supporting services. First, we need a rock-solid HA/DR capable database. As our services don’t communicate directly, the database is a central point of failure in the cluster as every operation starts by retrieving state from the database.

Last, we need to define how the cluster will be hosted. For this, we will select a namespace on our Kubernetes cluster and we can deploy the cluster using the official Helm chart for Airflow. Naturally, we plan on configuring the KubernetesExecutor for task execution.

What’s Not Included

  • Log Aggregation and Forwarding
    • Airflow provides an integrated logging configuration for writing and collection task and application level logs but does not provide a log aggregation or forwarding framework. We will need to implement a solution that can forward logs from a directory to our log analytics platform.
  • DAG Management
    • Airflow does not provide a means to move DAGs to the running services in the airflow cluster. With the deployment cadence that we expect from our usage of this platform, we will need to sync the DAG directory with a Github repo(s).
  • SMTP Relay
    • Airflow requires the SMTP_DEFAULT connection string to contain credentials to a SMTP relay server or the alerting features cannot function properly. SendGrid and basic SMTP are supported.
  • Auth
    • Airflow allows for anonymous access, user/password, or Single Sign On access through a variety of providers like LDAP, OIDC, Github Apps. Our organization uses Active Directory (Fancy LDAP) and Github Enterprise so this is a matter of preference rather than a needed solution.

Moving Forward

As we can see, implementing a useful Airflow cluster requires more than `helm install …` as we lack many necessary features without a significant amount of configuration. To that end, we have this list:

  • Identify Authentication Method
    • Attain credentials and other configurations
  • Identify SMTP Relay
    • Attain credentials and other configurations
  • Identify Kubernetes Namespace
  • Implement Helm configuration for deployment
  • Implement Logging Solution
  • Implement DAG Sync Solution

At this point, point or time estimates can be applied and aggregated. Now comes the fun part. Thanks for reading!

About the Author

Object Partners profile.

One thought on “Planning an Apache Airflow Deployment

Leave a Reply

Your email address will not be published.

Related Blog Posts
Natively Compiled Java on Google App Engine
Google App Engine is a platform-as-a-service product that is marketed as a way to get your applications into the cloud without necessarily knowing all of the infrastructure bits and pieces to do so. Google App […]
Building Better Data Visualization Experiences: Part 2 of 2
If you don't have a Ph.D. in data science, the raw data might be difficult to comprehend. This is where data visualization comes in.
Unleashing Feature Flags onto Kafka Consumers
Feature flags are a tool to strategically enable or disable functionality at runtime. They are often used to drive different user experiences but can also be useful in real-time data systems. In this post, we’ll […]
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]