How to split a Pandas column string or list into separate columns

Picture by Nataliya Vaitkevich, Pexels.

6 minutes to read

Data Science Pandas

When working in Pandas you’ll sometimes encounter data stored in a single column that would actually be better presented when split into separate columns. For example, a Pandas column might contain a comma separated list of values, or a Python list.

Data stored in a single column are much harder to work with, so splitting data up into distinct columns can make analysis or modeling much easier. There are a couple of ways to split a Pandas column string or list into separate columns. The Pandas explode() function is probably the best, but you can also use the str.split() function.

In this simple project, we’ll take a string of comma separated values stored in a single column of a Pandas dataframe, split the data up into a Python list stored in a single column, and then explode or split the data, so it’s present in its own specific column.

Import the packages

First, open a Jupyter notebook and import the Pandas package. Use the set_option() function to increase the maximum column width using max_colwidth so we can easily see longer column values.

import pandas as pd

pd.set_option('max_colwidth', 100)

Create a Pandas dataframe

Next we need to create a Pandas dataframe that contains some string columns that are separated in some way, for example, as a comma separated value list. To make things more challenging for ourselves, we’ll make these strings of differing length, so we can’t take an easy route.

data = {
    'brand': ['Jaguar', 
              'Land Rover', 
              'Lotus'],
    'models_string': ['XE, XF, E Pace, F Pace, I Pace, F Type', 
                      'Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar',
                      'Emira, Evija']
}

df = pd.DataFrame(data, columns=['brand', 'models_string'])
df.head()

	brand	models_string
0	Jaguar	XE, XF, E Pace, F Pace, I Pace, F Type
1	Land Rover	Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar
2	Lotus	Emira, Evija

Use str.split() to split a string into a list

First, we’ll split or “explode” our string of comma separated values into a Python list using the str.split() function. We’ll append the str.split() function to the name of the original column and define the separator as a comma, then we’ll assign the output of the result to a new column called models_list. When executed this will split our string up after each comma and create a Python list of values.

df['models_list'] = df['models_string'].str.split(',')

df.head()

	brand	models_string	models_list
0	Jaguar	XE, XF, E Pace, F Pace, I Pace, F Type	[XE, XF, E Pace, F Pace, I Pace, F Type]
1	Land Rover	Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar	[Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar]
2	Lotus	Emira, Evija	[Emira, Evija]

Split a column list into separate columns

Finally, we can take each of these list values and store them in their own Pandas column. Our lists are of unequal length, which makes this trickier. We’ll first create a new dataframe called df_models and will use tolist() to pass in the data from models_list. This returns a Pandas dataframe with each of the list values in their own column. We’ll then use fillna('') to replace any NaN values with an empty string and prefix each column with model_.

df_models = pd.DataFrame(df['models_list'].tolist()).fillna('').add_prefix('model_')

Next, we’ll use concat() to perform an outer join. We’ll pass in the two dataframes and join them on their index, returning a new dataframe that includes the original brand data with the models each in their own column.

df = pd.concat([df, df_models], axis=1)

df[['brand', 'model_0', 'model_1', 'model_2', 'model_3', 'model_4', 'model_5']]

	brand	model_0	model_1	model_2	model_3	model_4	model_5
0	Jaguar	XE	XF	E Pace	F Pace	I Pace	F Type
1	Land Rover	Defender	Discovery	Discovery Sport	Range Rover	Range Rover Sport	Range Rover Velar
2	Lotus	Emira	Evija

Matt Clarke, Thursday, August 18, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.