Parallel ETL for Data Warehouse CTU with Apache Airflow
Requirements
- Airflow 2.5.1
- Python 3.7 or later
- Python packages: apache-airflow, celery, cx_Oracle, flower, numpy, oracledb, pandas, psycopg2-binary, SQLAlchemy
- Database: PostgreSQL 11 or later
- Celery 5.2.7
- RabbitMQ 3.11
Installation and First Start
- Create a virtual environment:
$ python3 -m venv airflow_env
- Activate the virtual environment:
$ source airflow_env/bin/activate
- Install Airflow from PyPI:
(airflow_env) pip install 'apache-airflow==2.5.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.7.txt"
where the latter3.7
is Python version. - By default, the last command will create an 'airflow' directory. The path to this directory must be exported as the
AIRFLOW_HOME
environment variable:$ export AIRFLOW_HOME=~/.
- Apache Airflow needs a database for proper functioning. After installation, it creates an SQLite DB by default, which doesn't support parallel task execution. For that purpose, create a database, e.g., PostgreSQL.
- When the database is created, modify the
airflow.cfg
file to connect to this database. More info in Airflow documentation here. - For testing and development purposes, change the executor type to
executor = LocalExecutor
. For production, useCeleryExecutor
. This requires installing and configuring Celery and RabbitMQ. The steps are described here. - Before starting Airflow, initialize the database with the
(airflow_env) airflow db init
command. - Create a user to access the Airflow web server:
(airflow_env) airflow users create --username admin --firstname firstname --lastname lastname --role Admin --email admin@domain.com
. More info for creating a user:(airflow_env) airflow users create --help
. - Run
(airflow_env) airflow webserver
and(airflow_env) airflow scheduler
in different terminals to start working with Apache Airflow. The default endpoint is http://localhost:8080/, but it is configurable (check out(airflow_env) airflow webserver --help
). - Visit http://localhost:8080/.
Run ETL
Since ETLs are built as Airflow tasks within a DAG, they may be triggered automatically by defining the frequency in code using a cron expression (when writing a DAG) or manually. To start loading manually, follow these steps:
- Click on 'DAGs' in the top menu and select the DAG.
- In the upper right corner, there is a Play button. Clicking on it will manually trigger the DAG.
Logs
- Logs are saved to
$AIRFLOW_HOME/logs
. - Logs can also be viewed in the UI (Airflow web server): In the Grid section, the left part contains the statuses of all tasks within a DAG. To access the logs of a task, click on the corresponding box and select the 'Logs' tab.
Screenshots to each steps are attached to the bachelor thesis.