The Pandas pipe()
function takes a dataframe as its input, transforms or manipulates it, and returns the transformed dataframe. It is a very useful function that can be used to create data pipelines, especially if you’re using the Pandas method chaining technique.
Pipelines and the method chaining approach are both quite similar in that they break Pandas code down into smaller steps that are executed one by one. Each step in the chain or pipeline takes a dataframe, transforms it, and returns the transformed dataframe to the next step.
Pipelines can make complex Pandas projects much easier to maintain, and they’re often more efficient. In this tutorial, I’ll go over the basics of using the pipe()
function by creating a step-by-step data pipeline to load and clean a dataset.
To get started, open a Jupyter notebook and import the Pandas library.
import pandas as pd
Next, we’ll create a function called load_data()
to load our dataset. We’ll use the Internet Retail dataset, which you can download directly from GitHub using the Pandas read_csv()
function. We’ll load it using the load_data()
function and save it to an object called df
and use the Pandas head()
function to view the first five rows.
def load_data():
df = pd.read_csv('https://raw.githubusercontent.com/databricks/'
'Spark-The-Definitive-Guide/master/data/retail-data/'
'all/online-retail-dataset.csv')
return df
df = load_data()
df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
Next, we’ll create a series of functions to run on our pipeline. When working with remote data you may need to clean the column names. I tend to slugify the column names so that they are easier to work with in the future. We’ll create a function that takes our dataframe and returns a dataframe with cleaned column names.
def slugify_column_names(df):
"""Slugify the column names.
Args:
df (pd.DataFrame): The dataframe to slugify the column names.
Returns:
pd.DataFrame: The dataframe with the slugified column names.
"""
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
return df
df = slugify_column_names(df)
df.head()
invoiceno | stockcode | description | quantity | invoicedate | unitprice | customerid | country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
Next, we’ll create a simple function that calculates the line price by multiplying the quantity
by the unitprice
. This uses the Pandas assign()
function to assign the result to a new column called line_price
and then returns the dataframe to the next step if there is one.
def calculate_line_price(df):
"""Calculate the line price.
Args:
df (pd.DataFrame): The dataframe to calculate the line price.
Returns:
pd.DataFrame: The dataframe with the line price calculated.
"""
return df.assign(line_price = df['unitprice'] * df['quantity'])
Finally, we can use the Pandas pipe()
method to create a very tiny data pipeline. Of course, in a proper pipeline, you’d have larger functions and more steps. To run the pipeline, we first call load_data()
which returns a dataframe of raw data. We then append .pipe()
and pass the argument slugify_column_names
which takes the dataframe and runs it through the slugify_column_names()
function.
The slugify_column_names()
function returns a transformed dataframe that is passed to another pipe()
command which then runs calculate_line_price()
. The finished pipeline is then saved to df_cleaned
. This works fine, but there’s an additional step I’d recommend you take.
df_cleaned = load_data().pipe(slugify_column_names).pipe(calculate_line_price)
df_cleaned.head()
invoiceno | stockcode | description | quantity | invoicedate | unitprice | customerid | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
By wrapping your pipeline in parentheses, you get the benefit of being able to incorporate whitespace and commenting that can greatly aid readability. You can also use a range of method chaining debugging techniques to work with the pipeline far more effectively than with a continuous pipe()
chain.
df_cleaned = (
load_data()
.pipe(slugify_column_names)
.pipe(calculate_line_price)
)
df_cleaned.head()
invoiceno | stockcode | description | quantity | invoicedate | unitprice | customerid | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Matt Clarke, Sunday, January 08, 2023