How to slugify column names and values in Pandas

Learn how to slugify column names and values in Pandas by removing non-alphanumeric characters and replacing spaces with underscores.

How to slugify column names and values in Pandas
Picture by Leon Woods, Pexels.
4 minutes to read

Slugification is the process of removing non-alphanumeric characters from a string and replacing spaces with underscores. Slugifying data is really useful for data scientists and can be used to both reformat column names and the values they contain.

When importing third party datasets into Pandas, I often use slugification to reformat column names to remove capital letters and symbols and ensure they’re all in the same format as any new columns I might create myself. This keeps things consistent and makes it easier to work with the data.

In this simple tutorial I’ll show you how you can use Python regular expressions to slugify column names and values in Pandas to both reformat and clean your data.

Import Pandas and create a dataframe

To get started, open a new Jupyter notebook and import Pandas. Then create a Pandas dataframe with some column names that contain spaces and special characters, or just import your own dataframe.

import pandas as pd

data = [{'Web Browser': 'Google Chrome', 'Country': 'United States'},
        {'Web Browser': 'Mozilla Firefox', 'Country': 'United States'},
        {'Web Browser': 'Internet Explorer', 'Country': 'United States'},
        {'Web Browser': 'Safari', 'Country': 'United States'},
        {'Web Browser': 'Opera', 'Country': 'United States'}]

df = pd.DataFrame(data)a
df
Web Browser Country
0 Google Chrome United States
1 Mozilla Firefox United States
2 Internet Explorer United States
3 Safari United States
4 Opera United States

Slugify Pandas column names using a Python regular expression

To reformat our column names using slugification, we’ll use a Python regular expression to remove all non-alphanumeric characters and replace spaces with underscores. To do this, we’ll fetch the names of the columns using df.columns and then use the str.replace() method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column names to the dataframe using df.columns.

df.columns = df.columns.str.replace(r'\W+', '_', regex=True)
df
web_browser country
0 google_chrome United States
1 mozilla_firefox United States
2 internet_explorer United States
3 safari United States
4 opera United States

Slugify Pandas column values using a Python regular expression

To reformat the values in our dataframe, we’ll fetch the column and use the str.replace() method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column values to the dataframe using df['column_name'].

df['web_browser'] = df['web_browser'].str.replace(r'\W+', '_', regex=True)
df
web_browser country
0 google_chrome United States
1 mozilla_firefox United States
2 internet_explorer United States
3 safari United States
4 opera United States

Matt Clarke, Saturday, November 12, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.