How to slugify column names and values in Pandas

Picture by Leon Woods, Pexels.

4 minutes to read

Data Science Pandas

Slugification is the process of removing non-alphanumeric characters from a string and replacing spaces with underscores. Slugifying data is really useful for data scientists and can be used to both reformat column names and the values they contain.

When importing third party datasets into Pandas, I often use slugification to reformat column names to remove capital letters and symbols and ensure they’re all in the same format as any new columns I might create myself. This keeps things consistent and makes it easier to work with the data.

In this simple tutorial I’ll show you how you can use Python regular expressions to slugify column names and values in Pandas to both reformat and clean your data.

Import Pandas and create a dataframe

To get started, open a new Jupyter notebook and import Pandas. Then create a Pandas dataframe with some column names that contain spaces and special characters, or just import your own dataframe.

import pandas as pd

data = [{'Web Browser': 'Google Chrome', 'Country': 'United States'},
        {'Web Browser': 'Mozilla Firefox', 'Country': 'United States'},
        {'Web Browser': 'Internet Explorer', 'Country': 'United States'},
        {'Web Browser': 'Safari', 'Country': 'United States'},
        {'Web Browser': 'Opera', 'Country': 'United States'}]

df = pd.DataFrame(data)a
df

	Web Browser	Country
0	Google Chrome	United States
1	Mozilla Firefox	United States
2	Internet Explorer	United States
3	Safari	United States
4	Opera	United States

Slugify Pandas column names using a Python regular expression

To reformat our column names using slugification, we’ll use a Python regular expression to remove all non-alphanumeric characters and replace spaces with underscores. To do this, we’ll fetch the names of the columns using df.columns and then use the str.replace() method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column names to the dataframe using df.columns.

df.columns = df.columns.str.replace(r'\W+', '_', regex=True)
df

	web_browser	country
0	google_chrome	United States
1	mozilla_firefox	United States
2	internet_explorer	United States
3	safari	United States
4	opera	United States

Slugify Pandas column values using a Python regular expression

To reformat the values in our dataframe, we’ll fetch the column and use the str.replace() method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column values to the dataframe using df['column_name'].

df['web_browser'] = df['web_browser'].str.replace(r'\W+', '_', regex=True)

df

	web_browser	country
0	google_chrome	United States
1	mozilla_firefox	United States
2	internet_explorer	United States
3	safari	United States
4	opera	United States

Matt Clarke, Saturday, November 12, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.