Building Scalable ETL Pipelines with Python¶

Extract, Transform, Load, (aka ETL), is a critical component of data management where data is:

Extracted from various sources
Transformed into a format suitable for analysis, and then
Loaded into a data warehouse or other storage system.

It is one of the most important parts of the data pipeline and crucial to the success of any data-driven organizations or projects.

Extract Transform and Load Data Example¶

In it's most basic sense, you may be doing ETL and not even know it. Let's say you have two data sets comtained in CSV files that you want to combined.

The first source contains list of countries with country_code and country_population columns.

country_code country_population

USA 345,426,571

CAD 39,742,430
The second source contains list of cities with City, Country, and Population columns, where Country stores country code and Population it is city population.

City Country Population

New York USA 8,097,282

Los Angles USA 3,795,936

Toronto CAD 2,832,718

Calgary CAD 1,491,900

... ... ...
And let's we want create a report that aggregates the data into a single table with country_code, city_population and non_city_population and writes it to a separate data source.

country_code city_population non_city_population

USA 59,849,899 285,576,672

CAD 31,873,429 7,869,001

To do that, you may:

Load the data into Excel
Create formulas and a pivot table to process the data using two different functions—one to add and another to multiply the columns.
And save the final dataframe to a new CSV file.

There you go! That's an ETL process! But as simple as it can be, ETL can also get very complicated, involving many data sources, large amounts of data and complex computation, where success is critical.

Unfortunately, while Excel is a great tool to get things done quickly and on a relatively small scale, it lacks scalability and auditability required in most academic or professional use cases.

Python as an ETL Tool¶

Enter Python. ETL using Python has become the most popular and widely adopted approach for data management, data science and big data. Python perfectly threads the gap between being high-level enough so that's its fun and easy to use, like Excel, but also being low-level enough so that you can confidently accomplish very complicated tasks at scale. Python also has the largest and most extensive open source tools available.

Ease of Use: Python’s syntax is simple and readable, making it accessible for beginners and powerful for experienced developers.
Community: The Python has a very large community with a rich ecosystem of libraries and resources that simplify ETL development.
Flexibility: Python is programming language that can be used to accomplish almost any task.
Scalability: Python can execute compiled C and Rust code, and with tools like Apache Spark, Python can handle large-scale data processing.

ETL Pipelines¶

From a technical perspective, ETL pipelines are composed of a series of tasks, including planning tasks, executed in a specific order.

ETL Pipeline

In general, an ETL Pipeline can be divided into two main categories:

Workflow/ Task Orchestration: Planning, organizing and executing a sequence of extract, transform and load data tasks, where a task is just peace of code that can be executed on a distributed system and implemented in any programming languages using any technolgy.
Pipeline Processing: Describing the entire pipeline as a whole and executing it.

ETL Frameworks¶

An ETL framework is just a set of solutions that are packaged or integrated together in such a way that it makes creating, managing and executing ETL pipelines easier. While an ETL framework is generally made up of a series of available products, libraries and technologies, the implementation of an specific ETL framework is ultimately self-managed by the end user.

1. Extracting Data

Data Sources: Various origins like databases, files, APIs, or cloud storage.
Extraction Tools: Methods or tools used to retrieve data from these sources.

2. Data Transformation

Data Cleaning: Handling missing, incorrect, or irrelevant data.
Data Transformation: Converting data into the required format or structure.
Data Mapping: Aligning source data with the target schema.

3. Loading Data

Data Destination: Where the transformed data is loaded, such as a data warehouse or database.
Loading Tools: Methods or tools used to insert data into the target system.

4. Orchestration

Workflow Management: Managing ETL task sequences and dependencies.
Automation: Scheduling and automating ETL processes.

5. Monitoring and Logging

Monitoring Tools: Tracking performance and health of ETL processes.
Logging: Recording ETL operations for troubleshooting and analysis.

6. Error Handling and Recovery

Error Detection: Identifying issues during ETL processes.
Error Handling: Managing and resolving errors.
Recovery Procedures: Strategies for restoring from failures.

7. Security

Data Security: Protecting data during ETL operations.
Compliance: Adhering to legal and regulatory requirements.

8. Documentation

Process Documentation: Descriptions of ETL processes and mappings.
Metadata Management: Maintaining information about data sources and transformations.

These components collectively ensure efficient and effective data integration and management.

Open Source Python ETL Libraries and Tools¶

Fortunately, since the Python communtiy has been doing ETL for a LONG time, there are a fantastic set of Python libraries that can help you with your ETL pipelines, offering various features that facilitate the ETL process, including tools for extraction, transformation, and loading, along with functionalities for orchestration, monitoring, and error handling.

Click here to see a list of some of the best open source ETL tools available.

ETL Example¶

In this simple example, we use pandas, a very popular Python library to execute s a simple ETL process.

Dataset

source_data.csv

id,date,quantity,price_per_unit
1,2024-01-15,10,5.00
2,2024-02-20,20,7.50
3,2024-03-05,15,6.00
4,2024-04-22,30,4.50
5,2024-05-30,12,8.00
6,,25,6.00
7,2024-07-10,,7.00
8,2024-08-15,18,9.00

Description of Columns

id: Unique identifier for each record.
date: Date of the transaction or record.
quantity: Quantity of items or units involved in the transaction.
price_per_unit: Price per unit of the item or service.

Notes

This example includes some missing values (NaN), which will be handled by the ETL process (e.g., using dropna()).
The date column is in YYYY-MM-DD format.
The quantity and price_per_unit columns contain numerical values, with some missing values as well.

Python Code

import pandas as pd

# Extract: Load data from a CSV file
df = pd.read_csv('source_data.csv')

# Transform: Example transformations
df.dropna(inplace=True)  # Drop rows with missing values
df['date'] = pd.to_datetime(df['date'])  # Convert to datetime
df['total_amount'] = df['quantity'] * df['price_per_unit']  # Calculate total amount
df = df[df['total_amount'] > 100]  # Filter rows with total_amount greater than 100

# Load: Save the transformed data to a new CSV file
df.to_csv('transformed_data.csv', index=False)

City	Country	Population
New York	USA	8,097,282
Los Angles	USA	3,795,936
Toronto	CAD	2,832,718
Calgary	CAD	1,491,900
...	...	...

country_code	country_population
USA	345,426,571
CAD	39,742,430

country_code	city_population	non_city_population
USA	59,849,899	285,576,672
CAD	31,873,429	7,869,001