Apache Airflow is the most widely used workflow management platform. I have been building a workflow system using Apache Airflow for more than a year now. The system is required to support workflows that handle large amount of data and scale on-demand. Even though there is help and documentation available online, I ran into few challenges. I did extensive research and tried to find best possible ways to not only scale but spend least amount of money. This post describes Apache Airflow optimizations as per my experience, research and a workflow system in a large enterprise environment
Prerequisites
- Experience using Apache Airflow on kubernetes engine
- Experience building workflows using Apache Airflow
- Basic understanding of kubernetes and helm chart
- Basic understanding of CICD and GitOps
- Basic understanding of IaC (Infrastructure as Code)
- Experience with a cloud provider and cloud services
Why Apache Airflow optimizations?
It is extremely easy now a days to setup an instance of Apache Airflow, not only on a standalone machine but also on a kubernetes engine. However, out-of-box instructions and setup mostly help during the development phase. When you need to move the application instance to a live/production environment, it requires many considerations from the scalability, security, maintenance and cost point of view. These considerations are the most basic objectives of any enterprise application but difficult to achieve at the same time. Apache Airflow community provides few guidelines but it may not be enough in many situations. I ran into few practical issues and did not find straightforward solutions online. These Apache Airflow optimizations are listed here to help others to overcome similar obstacles
1. GitOps
Enterprise applications involve large teams and agile development cycles. Building a workflow application using Apache Airflow requires a setup that can allow team to iteratively develop workflows. This setup includes GitOps. Everything, including the infrastructure, should be part of the source code. It is very important that the infrastructure is part of the source code so that multiple environments can be seamlessly configured without compromising security. It will also help to continuously and easily keep Apache Airflow version current. My previous post describes the steps to configure the project structure that can support these capabilities. My setup includes CDKTF for IaC and Github/Github actions for the CICD. There are typically at least two environments, development and production. Having this setup helps to meet the enterprise objectives of scalability, security, low maintenance and low cost
Apache Airflow helm chart
Apache Airflow on kubernetes engine is the most common setup for an enterprise workflow application to achieve scalability and stability. Having CICD, GitOps and IaC in-place, it makes sense to include Apache Airflow installation and configuration part of it. Helm chart is the best way to setup Airflow on a kubernetes engine. CDKTF includes construct for helm chart to include it as a part of IaC. You can maintain a configuration file in your source code with all Airflow helm chart parameters and use it with the CDKTF helm chart construct to build Airflow instance for different environments
Project structure
When you have setup CICD, GitOps and IaC fully setup, You will see the following components in your project repository after setting up CICD, GitOps and IaC
- CICD workflows. If you use GitHub actions, there will be .github folder with all the CICD workflows
- GitHub environments to maintain environment specific secrets and variables
- Apache Airflow metadata database, kubernetes cluster and Airflow helm release part of IaC. You can select any major programming language of your choice to write the infrastructure code using a cloud development kit (CDK)
- Unit test cases if you want to follow TDD approach. IaC allows you to include unit test cases for the infrastructure as well
- DAG folder with all Airflow DAGs for your project
2. Serverless compute
On-demand resource allocation is the next generation compute model and all major cloud providers include serverless infrastructure services. It allows to optimize the application infrastructure and lower the cost. The following serverless services from Google Cloud Platform (GCP) are options to setup Apache Airflow on GCP
- Airflow metadata database – Cloud SQL (MySQL or PostgreSQL)
- Airflow webserver, scheduler, workers, etc. – Google Kubernetes Engine Autopilot
- Airflow connections – Secret Manager
- Airflow logging – Google Cloud Storage
I prefer serverless compute on a cloud provider and have been using these services to run Apache Airflow. My other post describes GKE Autopilot features useful for running an enterprise application.
3. Preemptible compute
Airflow web server or scheduler deployments on kubernetes run as a pod and pods on kubernetes run on a node. Managed kubernetes services such as GKE nodes are virtual machine instances. These virtual machines have either standard or preemptible (also knows as spot) provisioning. Preemptible (or spot) instances are available at a much lower cost compared to standard instances but cloud provider can preempt them to reclaim the compute capacity any time. You can take advantage of spot instances to set up lower environments (development) as well as to run fault-tolerant production task. If you use serverless kubernetes such as Autopilot, it allows to run your pod on-demand on spot instances. Simply set ‘nodeSelector’ or ‘nodeAffinity’ for Airflow deployment pods or task pods and serverless compute will take care of running on spot instances
nodeSelector:
cloud.google.com/gke-spot: "true"
Code language: JavaScript (javascript)
With IaC and using helm chart for Airflow, you can easily configure lower environment on lower cost infrastructure compared to production
4. KubernetesExecutor and CeleryKubernetes Executor
Executor in Airflow is a pluggable module that allows to run task instances. Kubernetes executor runs each task in its own pod while Celery executor is a task queue that runs each task on a worker. They are the most common executors used in Airflow and have advantages – disadvantages. Kubernetes executor, with serverless kubernetes, creates worker on-demand saving cost. However, kubernetes executor is slower because it is created on-demand. Celery executor is faster because workers are always running but it incurs cost even when Airflow is idle. Also, kubernetes executor is more stable for a long running tasks compared to celery worker. So, lower environments (development/test) are more cost efficient with kubernetes executor and production environment is faster and stable with CeleryKubernetes executor. Serverless compute with kubernetes executor can save a lot in cloud cost. And IaC with helm chart will allow you to configure appropriate executor for the environment
5. KubernetesPodOperator
If you want to launch an Airflow task as a pod in a kubernetes cluster, KubernetesPodOperator allows you to do that. This can help you to not only launch a code written in a different language (as a container) but also seamlessly scale. You can launch the k8 operator tasks in the same cluster running Airflow instance as well as a separate kubernetes cluster. Also, using k8 operator task allows you to utilize kubernetes resources and features e.g. persistent disks, config maps, secrets, etc. Any long running or data heavy tasks can be designed fault-tolerant and launched using k8 operator on a preemptible (spot) instance via k8 executor. That will help with scalability, stability and lowering cost. My previous post describes important features to be able to use k8 resources with k8 operator tasks. The following use case can provide more understanding
Use case
You need to build a workflow where a container will process large amount of data for an AI system. The container and data location is provided to you by other team. You/your team is responsible to build the workflow using Airflow. The data input and size can vary and container image version (tags) can vary for different runs
The DAG for this workflow can utilize KubernetesPodOperator. You can create tasks to first dynamically calculate size of data, create a persistent disk for the size of data and then launch k8 operator task with the disk and container image. This is a very simple use case to give you some idea but you can expand it to build much more complex, but stable, workflows that are also cost efficient. You could practically attach a source code as configmap and launch inside your k8 operator pod
6. Dynamically launch tasks on preemptible compute
In the previous optimization about preemptible compute, I described the option to configure Airflow deployment or tasks to run on preemptible instances. However, that configuration allows you to setup preemptible instance statically. Imagine a situation in your workflow where you launch your task on k8 operator on a spot instance but would like to switch to standard node after certain number of retries. You can use pod mutation hook to dynamically modify your k8 pod to launch on spot or standard instance based on retries. This can be very useful to launch certain tasks on k8 operator in production first on spot instances and then dynamically launch the task on standard node if they are being retried certain number of times. You can identify if the task failed because of preemption based on pod error message or cloud logging for the node