When working in Pandas you’ll sometimes encounter data stored in a single column that would actually be better presented when split into separate columns. For example, a Pandas column might contain a comma separated list of values, or a Python list.
Data stored in a single column are much harder to work with, so splitting data up into distinct columns can make analysis or modeling much easier. There are a couple of ways to split a Pandas column string or list into separate columns. The Pandas explode()
function is probably the best, but you can also use the str.split()
function.
In this simple project, we’ll take a string of comma separated values stored in a single column of a Pandas dataframe, split the data up into a Python list stored in a single column, and then explode or split the data, so it’s present in its own specific column.
First, open a Jupyter notebook and import the Pandas package. Use the set_option()
function to increase the maximum column width using max_colwidth
so we can easily see longer column values.
import pandas as pd
pd.set_option('max_colwidth', 100)
Next we need to create a Pandas dataframe that contains some string columns that are separated in some way, for example, as a comma separated value list. To make things more challenging for ourselves, we’ll make these strings of differing length, so we can’t take an easy route.
data = {
'brand': ['Jaguar',
'Land Rover',
'Lotus'],
'models_string': ['XE, XF, E Pace, F Pace, I Pace, F Type',
'Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar',
'Emira, Evija']
}
df = pd.DataFrame(data, columns=['brand', 'models_string'])
df.head()
brand | models_string | |
---|---|---|
0 | Jaguar | XE, XF, E Pace, F Pace, I Pace, F Type |
1 | Land Rover | Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar |
2 | Lotus | Emira, Evija |
First, we’ll split or “explode” our string of comma separated values into a Python list using the str.split()
function. We’ll append the str.split()
function to the name of the original column and define the separator as a comma, then we’ll assign the output of the result to a new column called models_list
. When executed this will split our string up after each comma and create a Python list of values.
df['models_list'] = df['models_string'].str.split(',')
df.head()
brand | models_string | models_list | |
---|---|---|---|
0 | Jaguar | XE, XF, E Pace, F Pace, I Pace, F Type | [XE, XF, E Pace, F Pace, I Pace, F Type] |
1 | Land Rover | Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar | [Defender, Discovery, Discovery Sport, Range Rover, Range Rover Sport, Range Rover Velar] |
2 | Lotus | Emira, Evija | [Emira, Evija] |
Finally, we can take each of these list values and store them in their own Pandas column. Our lists are of unequal length, which makes this trickier. We’ll first create a new dataframe called df_models
and will use tolist()
to pass in the data from models_list
. This returns a Pandas dataframe with each of the list values in their own column. We’ll then use fillna('')
to replace any NaN values with an empty string and prefix each column with model_
.
df_models = pd.DataFrame(df['models_list'].tolist()).fillna('').add_prefix('model_')
Next, we’ll use concat()
to perform an outer join. We’ll pass in the two dataframes and join them on their index, returning a new dataframe that includes the original brand data with the models each in their own column.
df = pd.concat([df, df_models], axis=1)
df[['brand', 'model_0', 'model_1', 'model_2', 'model_3', 'model_4', 'model_5']]
brand | model_0 | model_1 | model_2 | model_3 | model_4 | model_5 | |
---|---|---|---|---|---|---|---|
0 | Jaguar | XE | XF | E Pace | F Pace | I Pace | F Type |
1 | Land Rover | Defender | Discovery | Discovery Sport | Range Rover | Range Rover Sport | Range Rover Velar |
2 | Lotus | Emira | Evija |
Matt Clarke, Thursday, August 18, 2022