How to use Pandas pipe() to create data pipelines

The Pandas pipe() function can be used to create data pipelines and works well with the modern Pandas method chaining technique. Here's how to use it.

How to use Pandas pipe() to create data pipelines
Picture by Magda Ehlers, Pexels.
8 minutes to read

The Pandas pipe() function takes a dataframe as its input, transforms or manipulates it, and returns the transformed dataframe. It is a very useful function that can be used to create data pipelines, especially if you’re using the Pandas method chaining technique.

Pipelines and the method chaining approach are both quite similar in that they break Pandas code down into smaller steps that are executed one by one. Each step in the chain or pipeline takes a dataframe, transforms it, and returns the transformed dataframe to the next step.

Pipelines can make complex Pandas projects much easier to maintain, and they’re often more efficient. In this tutorial, I’ll go over the basics of using the pipe() function by creating a step-by-step data pipeline to load and clean a dataset.

Import the packages

To get started, open a Jupyter notebook and import the Pandas library.

import pandas as pd

Import the data

Next, we’ll create a function called load_data() to load our dataset. We’ll use the Internet Retail dataset, which you can download directly from GitHub using the Pandas read_csv() function. We’ll load it using the load_data() function and save it to an object called df and use the Pandas head() function to view the first five rows.

def load_data():
    df = pd.read_csv('https://raw.githubusercontent.com/databricks/'
                     'Spark-The-Definitive-Guide/master/data/retail-data/'
                     'all/online-retail-dataset.csv')
    return df
df = load_data()
df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom

Create a function to clean the column names

Next, we’ll create a series of functions to run on our pipeline. When working with remote data you may need to clean the column names. I tend to slugify the column names so that they are easier to work with in the future. We’ll create a function that takes our dataframe and returns a dataframe with cleaned column names.

def slugify_column_names(df):
    """Slugify the column names.
    
    Args:
        df (pd.DataFrame): The dataframe to slugify the column names.
    
    Returns:
        pd.DataFrame: The dataframe with the slugified column names.
    """

    df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
    return df
df = slugify_column_names(df)
df.head()
invoiceno stockcode description quantity invoicedate unitprice customerid country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom

Create a function to calculate line price

Next, we’ll create a simple function that calculates the line price by multiplying the quantity by the unitprice. This uses the Pandas assign() function to assign the result to a new column called line_price and then returns the dataframe to the next step if there is one.

def calculate_line_price(df):
    """Calculate the line price.
    
    Args:
        df (pd.DataFrame): The dataframe to calculate the line price.
    
    Returns:
        pd.DataFrame: The dataframe with the line price calculated.
    """

    return df.assign(line_price = df['unitprice'] * df['quantity'])

Create a data pipeline using pipe()

Finally, we can use the Pandas pipe() method to create a very tiny data pipeline. Of course, in a proper pipeline, you’d have larger functions and more steps. To run the pipeline, we first call load_data() which returns a dataframe of raw data. We then append .pipe() and pass the argument slugify_column_names which takes the dataframe and runs it through the slugify_column_names() function.

The slugify_column_names() function returns a transformed dataframe that is passed to another pipe() command which then runs calculate_line_price(). The finished pipeline is then saved to df_cleaned. This works fine, but there’s an additional step I’d recommend you take.

df_cleaned = load_data().pipe(slugify_column_names).pipe(calculate_line_price)
df_cleaned.head()
invoiceno stockcode description quantity invoicedate unitprice customerid country line_price
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34

Wrap your pipeline in parentheses

By wrapping your pipeline in parentheses, you get the benefit of being able to incorporate whitespace and commenting that can greatly aid readability. You can also use a range of method chaining debugging techniques to work with the pipeline far more effectively than with a continuous pipe() chain.

df_cleaned = (
      load_data()
      .pipe(slugify_column_names)
      .pipe(calculate_line_price)
)

df_cleaned.head()
invoiceno stockcode description quantity invoicedate unitprice customerid country line_price
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 20.34

Matt Clarke, Sunday, January 08, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.