Apache Airflow Database Backend MySQL vs PostgreSQL

Apache Airflow supports multiple database backends; PostgreSQL, MySQL and SQLite. However, SQLite is only recommended for the development purposes due to many limitations. That leaves PostgreSQL and MySQL as two options for Apache Airflow database backend for a production environment. When I started working on a framework using Apache Airflow, I chose MySQL as a backend database. However, I recently ran into an issue when workload started increasing. Airflow community recommended to migrate to PostgreSQL to avoid such issues. Before migrating the database, I wanted to compare both the databases with an Apache Airflow instance having a production like environment and workload. This comparison was important to ensure that database backend is also part of Apache Airflow Optimizations. This post compares Apache Airflow Database Backend MySQL vs PostgreSQL by performing extensive load tests on an instance setup on Kubernetes.

Prerequisites

  • Experience using Apache Airflow on kubernetes engine
  • Experience building workflows using Apache Airflow
  • Basic understanding of Database Backends

Apache Airflow instance important configuration and settings

Airflow container image tag2.9.2-python3.10
ExecutorCeleryKubernetesExecutor
Celery workers replicas4
Celery workers autoscaling enabledtrue
Celery workers autoscaling max replicas8
Celery workers memory 30Gi
Celery workers cpu5
Scheduler replicas2
Scheduler memory30Gi
Scheduler cpu5
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG8
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG96
AIRFLOW__CORE__PARALLELISM160
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY0
AIRFLOW__CELERY__WORKER_CONCURRENCY40
pgbouncer enabled (for PostgreSQL)true
pgbouncer memory (for PostgreSQL)30Gi
pgbouncer cpu (for PostgreSQL)5
MySQL / PostgreSQL memory4GB
MySQL / PostgreSQL cpu1

Apache Airflow DAG to perform the load test

I created two DAGs to perform the load test. First, a DAG with one dynamic task mapping that expands into number of tasks at run time based on the input parameter. Second, a DAG with multiple dynamic task mapping under a dynamic task mapping group that expands into number of tasks at run time based on the input parameter

"""DAG file to load test with Dynamic task mapping"""
from datetime import datetime
from airflow.decorators import task
from airflow.models.dag import dag

@task
def make_list(no_task: str):
    return [i for i in range(int(no_task if no_task else 100))]

@task
def consumer(arg):
    print(arg)

with DAG(
    dag_id="dynamic-map", start_date=datetime(2024, 06, 01), schedule_interval=None
    ) as dag:
    no_of_tasks = "{{ dag_run.conf['no_of_tasks'] }}"
    consumer.expand(arg=make_list(no_of_tasks))Code language: Python (python)
"""DAG file to load test with Dynamic task mapping group"""
from datetime import datetime
from airflow.decorators import task
from airflow.models.dag import dag

@task
def make_list(no_task: str):
    return [i for i in range(int(no_task if no_task else 100))]

@task
def consumer(arg):
    print(arg)

@task
def agent(arg):
    print(arg)

@task
def company(arg):
    print(arg)

@task_group
def supplier(arg):
    consumer(arg)
    agent(arg)
    company(arg)

with DAG(
    dag_id="dynamic-map-group", start_date=datetime(2024, 06, 01), schedule_interval=None
    ) as dag:
    no_of_tasks = "{{ dag_run.conf['no_of_tasks'] }}"
    supplier.expand(arg=make_list(no_of_tasks))Code language: JavaScript (javascript)

I wrote a simple python script to submit multiple dagruns for these two DAGs with increasing value of ‘no_of_tasks’ input parameter for the dagruns. This was done first for MySQL database backend and then for PostgreSQL

from requests.auth import HTTPBasicAuth
from json import dumps
from requests import post

for i in range(75, 100):
    run = {
        "no_of_tasks": str(i * 4)
    }
    post("https://k8airflowwebserverdomainname/api/v1/dags/dynamic-map/dagRuns",
            data=dumps({"conf": run}),
            headers={"content-type": "application/json"}
            auth=HTTPBasicAuth(username="user", password="password"))
    post("https://k8airflowwebserverdomainname/api/v1/dags/dynamic-map-group/dagRuns",
            data=dumps({"conf": run}),
            headers={"content-type": "application/json"}
            auth=HTTPBasicAuth(username="user", password="password"))Code language: JavaScript (javascript)

Comparison between MySQL and PostgreSQL database backend

Apache Airflow DAG runs
Apache Airflow Dagruns
Dagruns for the 1st DAG with MySQL database backend
Apache Airflow Dagruns
Dagruns for the 1st DAG with PostgreSQL database backend
Apache Airflow Dagruns
Dagruns for the 2ndDAG with MySQL database backend
Apache Airflow Dagruns
Dagruns for the 2nd DAG with PostgreSQL database backend

Execution times for the MySQL and PostgreSQL database backends were drastically different. PostgreSQL database backend performed slower than MySQL database backend

Database metrics
Top queries causing the most database load
Apache Airflow longest running queries with MySQL
Top longest running queries for MySQL database backend
Apache Airflow longest running queries with PostgreSQL
Top longest running queries for PostgreSQL database backend

Database updates (writes) with PostgreSQL were significantly slower compared to MySQL

CPU
CPU utilization for MySQL database backend
CPU utilization for PostgreSQL database backend

MySQL database consumed more CPU compared to PostgreSQL in my test. However, I did not notice any impact of that on the database performance during the load test

Memory
Memory utilization for MySQL database backend
Memory utilization for PostgreSQL database backend
Data transfer
Data transfer for MySQL database backend
Data transfer for PostgreSQL database backend
Disk read/write
Disk read/write for MySQL database backend
Disk read/write for PostgreSQL database backend

Disk read/write metric shows that PostgreSQL write operations were slower than MySQL

Observations

Both the database backends were stable in the load test. PostgreSQL performed slower compared to MySQL. PostgreSQL setup also required additional pod for pgbouncer to manage the database connection pool which is an additional running/fixed cost compared to MySQL