How to use Pandas explode() to split a list column into rows

Picture by Pixabay, Pexels.

8 minutes to read

Data Science Pandas

When dealing with more complex datasets, you’ll often find that Pandas dataframe columns contain data stored as Python lists. While these are easy to store, they do take a little more effort to work with. One thing you’ll probably want to do is split up the list of values over multiple rows of a new Pandas dataframe.

There are various ways to split up dataframe column lists into rows of a new dataframe. However, one of the most elegant ways to do this is via the Pandas explode() function.

The explode() function is very easy to use and simply takes the name of the column (or columns) you want to split into rows, and it returns a new dataframe in which the list values are spread over multiple rows. In this quick tutorial, I’ll show you how to use explode() to split single and multiple list-based columns over rows of a new Pandas dataframe.

Install or upgrade Pandas

To get started, open a Jupyter notebook and install or upgrade the Pandas package to the latest version using the Pip Python package management system. Upgrading to the latest release of Pandas will be useful if your environment is running a version older than 1.3.0, since the explode() function was updated to include new functionality in this release.

!pip3 install --upgrade pandas

import pandas as pd
pd.__version__

'1.5.2'

Create a Pandas dataframe

Next, either import data into a dataframe, or create a dummy dataframe containing some list values stored in the columns. We’ll be using the explode() function to split these list values up into rows in a new dataframe.

The dummy dataframe created includes three rows with three columns: the customer column holds the customer ID, the skus column holds the SKU (or stock keeping unit) of the products purchased, and the prices column holds the prices of the products. The skus and prices columns store these data in Python lists.

df = pd.DataFrame({'customer': [1, 2, 3], 
                   'skus': [[1, 2], [3, 4], [5, 6]], 
                   'prices': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]] 
                }) 
df

	customer	skus	prices
0	1	[1, 2]	[1.0, 2.0]
1	2	[3, 4]	[3.0, 4.0]
2	3	[5, 6]	[5.0, 6.0]

Explode a list column into multiple rows

First, we’ll use explode() to extract the values from the list stored in the skus column and split up the data over multiple rows. We get back a new dataframe that contains a row for each value present in each of the skus lists. Note that the index now contains values that match the original dataframe, but which are out of sync, with some duplicates present.

df_skus = df.explode('skus')
df_skus

	customer	skus	prices
0	1	1	[1.0, 2.0]
0	1	2	[1.0, 2.0]
1	2	3	[3.0, 4.0]
1	2	4	[3.0, 4.0]
2	3	5	[5.0, 6.0]
2	3	6	[5.0, 6.0]

Fix the index of the exploded dataframe

To fix the index of the exploded dataframe, we can use the reset_index method and pass the drop parameter as True. This gives us the same dataframe as the one we got from the explode method but with a sequential index.

df_skus = df_skus.reset_index(drop=True)
df_skus

	customer	skus	prices
0	1	1	[1.0, 2.0]
1	1	2	[1.0, 2.0]
2	2	3	[3.0, 4.0]
3	2	4	[3.0, 4.0]
4	3	5	[5.0, 6.0]
5	3	6	[5.0, 6.0]

Explode multiple list columns into dataframe rows

Providing you’re running version 1.3.0 or greater of Pandas, you can also explode multiple list columns into dataframe rows. If you’re not running Pandas 1.3.0 or more and you attempt this you’ll get an error that says ValueError: column must be a scalar. You can find the Pandas version installed by entering pd.__version__.

pd.__version__

'1.5.2'

Our dataframe contains two columns that contain lists of values, so we can pass their column names to the explode() function as a list. This then explodes the values from the column lists into their own rows in the dataframe, with the customer ID correctly mapped to each one.

df_skus_prices = df.explode(['skus', 'prices'])
df_skus_prices

	customer	skus	prices
0	1	1	1.0
0	1	2	2.0
1	2	3	3.0
1	2	4	4.0
2	3	5	5.0
2	3	6	6.0

As with the previous example, the index is not sequential, so we need to fix that by resetting the index. Once that’s done, we get a perfect dataframe back with the two list based columns separated into rows.

df_skus_prices = df_skus_prices.reset_index(drop=True)
df_skus_prices

	customer	skus	prices
0	1	1	1.0
1	1	2	2.0
2	2	3	3.0
3	2	4	4.0
4	3	5	5.0
5	3	6	6.0

Matt Clarke, Monday, November 28, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.