How to use method chaining in Pandas

Pandas method chaining is a modern way to improve Pandas code readability and performance by splitting dataframe operations into simple pipeline steps.

How to use method chaining in Pandas
Photo by Miguel Á. Padriñán, Pexels.
13 minutes to read

Pandas method chaining, or flow programming, is a modern, but sometimes controversial way of structuring Pandas code into a structured chain or series of commands. Conceptually, Pandas chaining is a bit like the scikit-learn pipeline methodology used in machine learning.

Chaining is present in other data science libraries, such as R’s dplyr, but Pandas lends itself to chaining well because so many Pandas functions return a dataframe as their output. Each operation in the chain performs an action on the dataframe and passes it to the next step, allowing you to create a step-by-step series of operations that output a final dataframe while simultaneously avoiding the creation of temporary intermediate variables that bloat your code and increase memory usage.

I’m a huge fan of Python author Matt Harrison and he is a big advocate of the technique and I’ve learned a lot from his writing on this topic. As Matt says, it can take a little getting used to, and at first may seem harder to understand, but many data scientists see it as a game changer in terms of helping them create more readable sequential code after they adopt the technique.

In this tutorial, I’ll explain the advantages and disadvantages of method chaining, and show some simple code examples that convert regular Pandas code to code written using method chaining.

Advantages of method chaining in Pandas

Method chaining advocates say that it makes code clearer to read, quicker to write, avoids errors, and improves performance. That said, this does depend on the length and complexity of the chain you’re creating. It can be useful, but it can also be complex to decipher.

Improves code structure This technique can force you to write your code in a clear step-by-step manner, allowing it to be easily followed by others, thus aiding readability and improving the ease of future code maintenance. It really needs to be written one step at a time too.
Eliminates intermediate variables Method chaining lets you avoid creating temporary variables used to pass values between intermediate steps, so can result in clearer, more concise code.
Can improve code readability For small pipelines without lambda functions, Pandas method chaining can improve code readability and make the steps easier to scan and understand. Comments can be added to further aid readability if needed.
Improves performance Method chaining is said to be more efficient than separate steps, since operations can be run consecutively without creating intermediate dataframes that increase Pandas memory usage.
Minimises errors When used with the assign() function, method chaining lets you avoid the SettingWithCopyWarning issue that can occur in Pandas. It also works well in Jupyter notebooks since all the code sits in a single cell, thereby reducing the risk that you'll execute one code cell and not another.

Disadvantages of method chaining in Pandas

Method chaining is a fairly divisive technique and some people love it while others hate it. I have similar feelings about Object Relational Mapping or ORM in SQL. Both methods are designed to make code quicker and clearer to write, but share similar problems in that they can become complex and can sometimes make debugging much harder.

Splitting up larger chains into smaller steps can help resolve these issues, and there are some neat debugging methods you can employ that overcome many of the criticisms of the technique, so the issues are rarely insurmountable.

Can be harder to read Complex code that uses many steps in a method chain, or makes use of Pandas lambda functions, can be much harder to read and understand. This is the key reason why some Python developers and data scientists don't like it.
Harder to debug Code written using method chaining is considered harder to debug because you can't easily view what is happening to the dataframe at each step in the chain. For some data scientists, that's enough to put them off using it. However, you can simply comment out a step in the chain, just as you would for a cell or bunch of cells in a notebook, so there's an easy workaround.
Only works on some functions Method chaining works by modifying a dataframe in steps, so if your code doesn't return a dataframe, it won't work with method chaining.
Requires lambda functions In order to do more complex things, such as modifying the values of one or more columns, lambda functions are required, and they can be hard to read. Crucially, lambda functions are needed to ensure that the code modifies the latest version of the dataframe, rather than the original dataframe used when the chain is initialised.

Import the packages

To get started, open a Jupyter notebook and import the Pandas library.

import pandas as pd

Write some code without method chaining

To show how method chaining works, we’ll first write some regular code in Jupyter cells and then refactor it to use method chaining. This will show how method chaining can make code more readable and concise. Ignore the fact that we’re performing some pointless calculations - it’s just for demonstration purposes.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv')
df = df.fillna('')
df = df.sort_values(by='date', ascending=False)
df['conversion_rate'] = ((df['transactions'] / df['sessions']) * 100).round(2)
df['revenuePerTransaction'] = (df['transactionRevenue'] / df['transactions']).round(2)
df['transactionsPerSession'] = (df['transactions'] / df['sessions']).round(2)
df['date'] = pd.to_datetime(df['date'])
df = df.rename(columns={'date': 'Date', 
                        'sessions': 'Sessions', 
                        'transactions': 'Transactions', 
                        'transactionRevenue': 'Revenue', 
                        'transactionsPerSession': 'Transactions Per Session', 
                        'revenuePerTransaction': 'AOV',
                        'conversion_rate': 'CR'}) 
df = df.drop(columns=['Unnamed: 0', 'Transactions Per Session'])
df.head()
Date Sessions Transactions Revenue AOV CR
364 2021-12-31 4071 105 8241.39 78.49 2.58
363 2021-12-30 4924 117 5155.67 44.07 2.38
362 2021-12-29 4890 108 7868.78 72.86 2.21
361 2021-12-28 5045 112 6729.58 60.09 2.22
360 2021-12-27 4412 105 6929.72 66.00 2.38

Refactor the code to use method chaining

Next we’ll refactor the code above to use method chaining. Method chaining works by using a pair of parentheses, which allow you to include whitespace or linebreaks in your code to aid readability. We’re effectively creating a one-liner that performs all steps, but we’re using the parentheses to allow us to split it over multiple lines to make it readable and logical.

We’ll add each part of the chain using a step-by-step process. First, we’ll call pd.read_csv() and load our remote CSV file. This returns a raw dataframe that gets passed to the next step, which uses fillna() to fill any missing values.

The fillna() step returns a dataframe which then gets passed to the next step, which uses assign() to run some lambda functions that perform some calculations and use round() to round the values. We’ll then use rename() to rename some columns, use drop() to drop some columns, and use astype() to cast the dtypes of the date column.

df_chain = (
    pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv')
    .fillna('')
    .sort_values(by='date', ascending=False)
    .assign(
        conversion_rate=(lambda x: ((x['transactions'] / x['sessions']) * 100).round(2)),
        revenuePerTransaction=(lambda x: x['revenuePerTransaction'].round(2)),
        transactionsPerSession=(lambda x: x['transactionsPerSession'].round(2))
    )
    .rename(columns={'date': 'Date', 
                    'sessions': 'Sessions', 
                    'transactions': 'Transactions', 
                    'transactionRevenue': 'Revenue', 
                    'transactionsPerSession': 'Transactions Per Session', 
                    'revenuePerTransaction': 'AOV',
                    'conversion_rate': 'CR'})        
    .drop(columns=['Unnamed: 0', 'Transactions Per Session'])
    .astype({'Date': 'datetime64[ns]'})  
)

df_chain.head()
Date Sessions Transactions Revenue AOV CR
364 2021-12-31 4071 105 8241.39 78.49 2.58
363 2021-12-30 4924 117 5155.67 44.07 2.38
362 2021-12-29 4890 108 7868.78 72.86 2.21
361 2021-12-28 5045 112 6729.58 60.09 2.22
360 2021-12-27 4412 105 6929.72 66.00 2.38

The code does exactly the same thing but it is much more readable and easier to understand. In the earlier code, since it could be split up over multiple cells in a Jupyter notebook, there’s always a risk that someone might run the code out of sequence and end up with a different result. However, since method chaining groups everything in a single line, this risk is completely eliminated.

Functions that can be used in method chaining

Method chaining supports any Pandas function that takes a dataframe as its input, modifies the dataframe, and returns a dataframe as its output so it can be passed to the next step in the chain. Here’s a summary of some of the most commonly used Pandas functions in method chaining.

Method Usage
.assign(col1 = value, col2 = value) The assign() method can be used to create or compute new column values, change dtypes, and a range of other things. Crucially, it should be used with lambda functions to ensure any changes are done to the latest copy of the dataframe at the given step within the chain.
.pipe(function) The pipe() method can be used to pass a DataFrame to a custom function. Basically, anything you can't do with a regular Pandas function, or which might make the code harder to read, can be handled easily using pipe() to call a custom functon.
.drop(columns=['col1', 'col2']) The drop() method can be used to drop columns. It takes a Python list of column names.
.fillna(value) The fillna() method can be used to fill missing values. This function can be used to handle or impute missing values and is quite versatile.
.sort_values(by='col1', ascending=False) The sort_values() method can be used to sort a DataFrame by a column.
.groupby('col1') The groupby() method can be used to group a DataFrame by a column.
.agg({'col1': ['sum', 'mean'], 'col2': ['sum', 'mean']}) The agg() method can be used to aggregate a DataFrame by a column.
.where() The where() method can be used to filter a DataFrame by a column.
.query() The query() method can be used to filter a DataFrame by a column.

Further reading

Matt Clarke, Thursday, January 05, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.