Skip to content
Snippets Groups Projects
Name Last commit Last update
airflow
.gitignore
README.md

Parallel ETL for Data Warehouse CTU with Apache Airflow


Requirements

  • Airflow 2.5.1
  • Python 3.7 or later
  • Python packages: apache-airflow, celery, cx_Oracle, flower, numpy, oracledb, pandas, psycopg2-binary, SQLAlchemy
  • Database: PostgreSQL 11 or later
  • Celery 5.2.7
  • RabbitMQ 3.11

Installation and First Start

  1. Create a virtual environment: $ python3 -m venv airflow_env
  2. Activate the virtual environment: $ source airflow_env/bin/activate
  3. Install Airflow from PyPI: (airflow_env) pip install 'apache-airflow==2.5.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.7.txt" where the latter 3.7 is Python version.
  4. By default, the last command will create an 'airflow' directory. The path to this directory must be exported as the AIRFLOW_HOME environment variable: $ export AIRFLOW_HOME=~/.
  5. Apache Airflow needs a database for proper functioning. After installation, it creates an SQLite DB by default, which doesn't support parallel task execution. For that purpose, create a database, e.g., PostgreSQL.
  6. When the database is created, modify the airflow.cfg file to connect to this database. More info in Airflow documentation here.
  7. For testing and development purposes, change the executor type to executor = LocalExecutor. For production, use CeleryExecutor. This requires installing and configuring Celery and RabbitMQ. The steps are described here.
  8. Before starting Airflow, initialize the database with the (airflow_env) airflow db init command.
  9. Create a user to access the Airflow web server: (airflow_env) airflow users create --username admin --firstname firstname --lastname lastname --role Admin --email admin@domain.com. More info for creating a user: (airflow_env) airflow users create --help.
  10. Run (airflow_env) airflow webserver and (airflow_env) airflow scheduler in different terminals to start working with Apache Airflow. The default endpoint is http://localhost:8080/, but it is configurable (check out (airflow_env) airflow webserver --help).
  11. Visit http://localhost:8080/.

Run ETL

Since ETLs are built as Airflow tasks within a DAG, they may be triggered automatically by defining the frequency in code using a cron expression (when writing a DAG) or manually. To start loading manually, follow these steps:

  • Click on 'DAGs' in the top menu and select the DAG.
  • In the upper right corner, there is a Play button. Clicking on it will manually trigger the DAG.

Logs

  • Logs are saved to $AIRFLOW_HOME/logs.
  • Logs can also be viewed in the UI (Airflow web server): In the Grid section, the left part contains the statuses of all tasks within a DAG. To access the logs of a task, click on the corresponding box and select the 'Logs' tab.

Screenshots to each steps are attached to the bachelor thesis.