How to use method chaining in Pandas

Photo by Miguel Á. Padriñán, Pexels.

13 minutes to read

Pandas method chaining, or flow programming, is a modern, but sometimes controversial way of structuring Pandas code into a structured chain or series of commands. Conceptually, Pandas chaining is a bit like the scikit-learn pipeline methodology used in machine learning.

Chaining is present in other data science libraries, such as R’s dplyr, but Pandas lends itself to chaining well because so many Pandas functions return a dataframe as their output. Each operation in the chain performs an action on the dataframe and passes it to the next step, allowing you to create a step-by-step series of operations that output a final dataframe while simultaneously avoiding the creation of temporary intermediate variables that bloat your code and increase memory usage.

I’m a huge fan of Python author Matt Harrison and he is a big advocate of the technique and I’ve learned a lot from his writing on this topic. As Matt says, it can take a little getting used to, and at first may seem harder to understand, but many data scientists see it as a game changer in terms of helping them create more readable sequential code after they adopt the technique.

In this tutorial, I’ll explain the advantages and disadvantages of method chaining, and show some simple code examples that convert regular Pandas code to code written using method chaining.

Advantages of method chaining in Pandas

Method chaining advocates say that it makes code clearer to read, quicker to write, avoids errors, and improves performance. That said, this does depend on the length and complexity of the chain you’re creating. It can be useful, but it can also be complex to decipher.

Improves code structure	This technique can force you to write your code in a clear step-by-step manner, allowing it to be easily followed by others, thus aiding readability and improving the ease of future code maintenance. It really needs to be written one step at a time too.
Eliminates intermediate variables	Method chaining lets you avoid creating temporary variables used to pass values between intermediate steps, so can result in clearer, more concise code.
Can improve code readability	For small pipelines without lambda functions, Pandas method chaining can improve code readability and make the steps easier to scan and understand. Comments can be added to further aid readability if needed.
Improves performance	Method chaining is said to be more efficient than separate steps, since operations can be run consecutively without creating intermediate dataframes that increase Pandas memory usage.
Minimises errors	When used with the `assign()` function, method chaining lets you avoid the SettingWithCopyWarning issue that can occur in Pandas. It also works well in Jupyter notebooks since all the code sits in a single cell, thereby reducing the risk that you'll execute one code cell and not another.

Disadvantages of method chaining in Pandas

Method chaining is a fairly divisive technique and some people love it while others hate it. I have similar feelings about Object Relational Mapping or ORM in SQL. Both methods are designed to make code quicker and clearer to write, but share similar problems in that they can become complex and can sometimes make debugging much harder.

Splitting up larger chains into smaller steps can help resolve these issues, and there are some neat debugging methods you can employ that overcome many of the criticisms of the technique, so the issues are rarely insurmountable.

Can be harder to read	Complex code that uses many steps in a method chain, or makes use of Pandas lambda functions, can be much harder to read and understand. This is the key reason why some Python developers and data scientists don't like it.
Harder to debug	Code written using method chaining is considered harder to debug because you can't easily view what is happening to the dataframe at each step in the chain. For some data scientists, that's enough to put them off using it. However, you can simply comment out a step in the chain, just as you would for a cell or bunch of cells in a notebook, so there's an easy workaround.
Only works on some functions	Method chaining works by modifying a dataframe in steps, so if your code doesn't return a dataframe, it won't work with method chaining.
Requires lambda functions	In order to do more complex things, such as modifying the values of one or more columns, lambda functions are required, and they can be hard to read. Crucially, lambda functions are needed to ensure that the code modifies the latest version of the dataframe, rather than the original dataframe used when the chain is initialised.

Import the packages

To get started, open a Jupyter notebook and import the Pandas library.

import pandas as pd

Write some code without method chaining

To show how method chaining works, we’ll first write some regular code in Jupyter cells and then refactor it to use method chaining. This will show how method chaining can make code more readable and concise. Ignore the fact that we’re performing some pointless calculations - it’s just for demonstration purposes.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv')
df = df.fillna('')
df = df.sort_values(by='date', ascending=False)
df['conversion_rate'] = ((df['transactions'] / df['sessions']) * 100).round(2)
df['revenuePerTransaction'] = (df['transactionRevenue'] / df['transactions']).round(2)
df['transactionsPerSession'] = (df['transactions'] / df['sessions']).round(2)
df['date'] = pd.to_datetime(df['date'])
df = df.rename(columns={'date': 'Date', 
                        'sessions': 'Sessions', 
                        'transactions': 'Transactions', 
                        'transactionRevenue': 'Revenue', 
                        'transactionsPerSession': 'Transactions Per Session', 
                        'revenuePerTransaction': 'AOV',
                        'conversion_rate': 'CR'}) 
df = df.drop(columns=['Unnamed: 0', 'Transactions Per Session'])
df.head()

	Date	Sessions	Transactions	Revenue	AOV	CR
364	2021-12-31	4071	105	8241.39	78.49	2.58
363	2021-12-30	4924	117	5155.67	44.07	2.38
362	2021-12-29	4890	108	7868.78	72.86	2.21
361	2021-12-28	5045	112	6729.58	60.09	2.22
360	2021-12-27	4412	105	6929.72	66.00	2.38

Refactor the code to use method chaining

Next we’ll refactor the code above to use method chaining. Method chaining works by using a pair of parentheses, which allow you to include whitespace or linebreaks in your code to aid readability. We’re effectively creating a one-liner that performs all steps, but we’re using the parentheses to allow us to split it over multiple lines to make it readable and logical.

We’ll add each part of the chain using a step-by-step process. First, we’ll call pd.read_csv() and load our remote CSV file. This returns a raw dataframe that gets passed to the next step, which uses fillna() to fill any missing values.

The fillna() step returns a dataframe which then gets passed to the next step, which uses assign() to run some lambda functions that perform some calculations and use round() to round the values. We’ll then use rename() to rename some columns, use drop() to drop some columns, and use astype() to cast the dtypes of the date column.

df_chain = (
    pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv')
    .fillna('')
    .sort_values(by='date', ascending=False)
    .assign(
        conversion_rate=(lambda x: ((x['transactions'] / x['sessions']) * 100).round(2)),
        revenuePerTransaction=(lambda x: x['revenuePerTransaction'].round(2)),
        transactionsPerSession=(lambda x: x['transactionsPerSession'].round(2))
    )
    .rename(columns={'date': 'Date', 
                    'sessions': 'Sessions', 
                    'transactions': 'Transactions', 
                    'transactionRevenue': 'Revenue', 
                    'transactionsPerSession': 'Transactions Per Session', 
                    'revenuePerTransaction': 'AOV',
                    'conversion_rate': 'CR'})        
    .drop(columns=['Unnamed: 0', 'Transactions Per Session'])
    .astype({'Date': 'datetime64[ns]'})  
)

df_chain.head()

	Date	Sessions	Transactions	Revenue	AOV	CR
364	2021-12-31	4071	105	8241.39	78.49	2.58
363	2021-12-30	4924	117	5155.67	44.07	2.38
362	2021-12-29	4890	108	7868.78	72.86	2.21
361	2021-12-28	5045	112	6729.58	60.09	2.22
360	2021-12-27	4412	105	6929.72	66.00	2.38

The code does exactly the same thing but it is much more readable and easier to understand. In the earlier code, since it could be split up over multiple cells in a Jupyter notebook, there’s always a risk that someone might run the code out of sequence and end up with a different result. However, since method chaining groups everything in a single line, this risk is completely eliminated.

Functions that can be used in method chaining

Method chaining supports any Pandas function that takes a dataframe as its input, modifies the dataframe, and returns a dataframe as its output so it can be passed to the next step in the chain. Here’s a summary of some of the most commonly used Pandas functions in method chaining.

Method	Usage
`.assign(col1 = value, col2 = value)`	The `assign()` method can be used to create or compute new column values, change dtypes, and a range of other things. Crucially, it should be used with lambda functions to ensure any changes are done to the latest copy of the dataframe at the given step within the chain.
`.pipe(function)`	The `pipe()` method can be used to pass a DataFrame to a custom function. Basically, anything you can't do with a regular Pandas function, or which might make the code harder to read, can be handled easily using `pipe()` to call a custom functon.
`.drop(columns=['col1', 'col2'])`	The `drop()` method can be used to drop columns. It takes a Python list of column names.
`.fillna(value)`	The `fillna()` method can be used to fill missing values. This function can be used to handle or impute missing values and is quite versatile.
`.sort_values(by='col1', ascending=False)`	The `sort_values()` method can be used to sort a DataFrame by a column.
`.groupby('col1')`	The `groupby()` method can be used to group a DataFrame by a column.
`.agg({'col1': ['sum', 'mean'], 'col2': ['sum', 'mean']})`	The `agg()` method can be used to aggregate a DataFrame by a column.
`.where()`	The `where()` method can be used to filter a DataFrame by a column.
`.query()`	The `query()` method can be used to filter a DataFrame by a column.