Skip to content

Building Scalable ETL Pipelines with Python

Extract, Transform, Load, (aka ETL), is a critical component of data management where data is:

  • Extracted from various sources
  • Transformed into a format suitable for analysis, and then
  • Loaded into a data warehouse or other storage system.

It is one of the most important parts of the data pipeline and crucial to the success of any data-driven organizations or projects.

Extract Transform and Load Data Example

In it's most basic sense, you may be doing ETL and not even know it. Let's say you have two data sets comtained in CSV files that you want to combined.

  1. The first source contains list of countries with country_code and country_population columns.

    country_code country_population
    USA 345,426,571
    CAD 39,742,430
  2. The second source contains list of cities with City, Country, and Population columns, where Country stores country code and Population it is city population.

    City Country Population
    New York USA 8,097,282
    Los Angles USA 3,795,936
    Toronto CAD 2,832,718
    Calgary CAD 1,491,900
    ... ... ...
  3. And let's we want create a report that aggregates the data into a single table with country_code, city_population and non_city_population and writes it to a separate data source.

    country_code city_population non_city_population
    USA 59,849,899 285,576,672
    CAD 31,873,429 7,869,001

To do that, you may:

  1. Load the data into Excel
  2. Create formulas and a pivot table to process the data using two different functions—one to add and another to multiply the columns.
  3. And save the final dataframe to a new CSV file.

There you go! That's an ETL process! But as simple as it can be, ETL can also get very complicated, involving many data sources, large amounts of data and complex computation, where success is critical.

Unfortunately, while Excel is a great tool to get things done quickly and on a relatively small scale, it lacks scalability and auditability required in most academic or professional use cases.

Python as an ETL Tool

Enter Python. ETL using Python has become the most popular and widely adopted approach for data management, data science and big data. Python perfectly threads the gap between being high-level enough so that's its fun and easy to use, like Excel, but also being low-level enough so that you can confidently accomplish very complicated tasks at scale. Python also has the largest and most extensive open source tools available.

  • Ease of Use: Python’s syntax is simple and readable, making it accessible for beginners and powerful for experienced developers.
  • Community: The Python has a very large community with a rich ecosystem of libraries and resources that simplify ETL development.
  • Flexibility: Python is programming language that can be used to accomplish almost any task.
  • Scalability: Python can execute compiled C and Rust code, and with tools like Apache Spark, Python can handle large-scale data processing.

ETL Pipelines

From a technical perspective, ETL pipelines are composed of a series of tasks, including planning tasks, executed in a specific order.

ETL Pipeline ETL Pipeline

In general, an ETL Pipeline can be divided into two main categories:

  • Workflow/ Task Orchestration: Planning, organizing and executing a sequence of extract, transform and load data tasks, where a task is just peace of code that can be executed on a distributed system and implemented in any programming languages using any technolgy.
  • Pipeline Processing: Describing the entire pipeline as a whole and executing it.

ETL Frameworks

An ETL framework is just a set of solutions that are packaged or integrated together in such a way that it makes creating, managing and executing ETL pipelines easier. While an ETL framework is generally made up of a series of available products, libraries and technologies, the implementation of an specific ETL framework is ultimately self-managed by the end user.

1. Extracting Data

  • Data Sources: Various origins like databases, files, APIs, or cloud storage.
  • Extraction Tools: Methods or tools used to retrieve data from these sources.

2. Data Transformation

  • Data Cleaning: Handling missing, incorrect, or irrelevant data.
  • Data Transformation: Converting data into the required format or structure.
  • Data Mapping: Aligning source data with the target schema.

3. Loading Data

  • Data Destination: Where the transformed data is loaded, such as a data warehouse or database.
  • Loading Tools: Methods or tools used to insert data into the target system.

4. Orchestration

  • Workflow Management: Managing ETL task sequences and dependencies.
  • Automation: Scheduling and automating ETL processes.

5. Monitoring and Logging

  • Monitoring Tools: Tracking performance and health of ETL processes.
  • Logging: Recording ETL operations for troubleshooting and analysis.

6. Error Handling and Recovery

  • Error Detection: Identifying issues during ETL processes.
  • Error Handling: Managing and resolving errors.
  • Recovery Procedures: Strategies for restoring from failures.

7. Security

  • Data Security: Protecting data during ETL operations.
  • Compliance: Adhering to legal and regulatory requirements.

8. Documentation

  • Process Documentation: Descriptions of ETL processes and mappings.
  • Metadata Management: Maintaining information about data sources and transformations.

These components collectively ensure efficient and effective data integration and management.

Open Source Python ETL Libraries and Tools

Fortunately, since the Python communtiy has been doing ETL for a LONG time, there are a fantastic set of Python libraries that can help you with your ETL pipelines, offering various features that facilitate the ETL process, including tools for extraction, transformation, and loading, along with functionalities for orchestration, monitoring, and error handling.

Click here to see a list of some of the best open source ETL tools available.

ETL Example

In this simple example, we use pandas, a very popular Python library to execute s a simple ETL process.

Dataset

source_data.csv

id,date,quantity,price_per_unit
1,2024-01-15,10,5.00
2,2024-02-20,20,7.50
3,2024-03-05,15,6.00
4,2024-04-22,30,4.50
5,2024-05-30,12,8.00
6,,25,6.00
7,2024-07-10,,7.00
8,2024-08-15,18,9.00

Description of Columns

  • id: Unique identifier for each record.
  • date: Date of the transaction or record.
  • quantity: Quantity of items or units involved in the transaction.
  • price_per_unit: Price per unit of the item or service.

Notes

  • This example includes some missing values (NaN), which will be handled by the ETL process (e.g., using dropna()).
  • The date column is in YYYY-MM-DD format.
  • The quantity and price_per_unit columns contain numerical values, with some missing values as well.

Python Code

import pandas as pd

# Extract: Load data from a CSV file
df = pd.read_csv('source_data.csv')

# Transform: Example transformations
df.dropna(inplace=True)  # Drop rows with missing values
df['date'] = pd.to_datetime(df['date'])  # Convert to datetime
df['total_amount'] = df['quantity'] * df['price_per_unit']  # Calculate total amount
df = df[df['total_amount'] > 100]  # Filter rows with total_amount greater than 100

# Load: Save the transformed data to a new CSV file
df.to_csv('transformed_data.csv', index=False)

Addtional ETL Examples