Slugification is the process of removing non-alphanumeric characters from a string and replacing spaces with underscores. Slugifying data is really useful for data scientists and can be used to both reformat column names and the values they contain.
When importing third party datasets into Pandas, I often use slugification to reformat column names to remove capital letters and symbols and ensure they’re all in the same format as any new columns I might create myself. This keeps things consistent and makes it easier to work with the data.
In this simple tutorial I’ll show you how you can use Python regular expressions to slugify column names and values in Pandas to both reformat and clean your data.
To get started, open a new Jupyter notebook and import Pandas. Then create a Pandas dataframe with some column names that contain spaces and special characters, or just import your own dataframe.
import pandas as pd
data = [{'Web Browser': 'Google Chrome', 'Country': 'United States'},
{'Web Browser': 'Mozilla Firefox', 'Country': 'United States'},
{'Web Browser': 'Internet Explorer', 'Country': 'United States'},
{'Web Browser': 'Safari', 'Country': 'United States'},
{'Web Browser': 'Opera', 'Country': 'United States'}]
df = pd.DataFrame(data)a
df
Web Browser | Country | |
---|---|---|
0 | Google Chrome | United States |
1 | Mozilla Firefox | United States |
2 | Internet Explorer | United States |
3 | Safari | United States |
4 | Opera | United States |
To reformat our column names using slugification, we’ll use a Python regular expression to remove all non-alphanumeric characters and replace spaces with underscores. To do this, we’ll fetch the names of the columns using df.columns
and then use the str.replace()
method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column names to the dataframe using df.columns
.
df.columns = df.columns.str.replace(r'\W+', '_', regex=True)
df
web_browser | country | |
---|---|---|
0 | google_chrome | United States |
1 | mozilla_firefox | United States |
2 | internet_explorer | United States |
3 | safari | United States |
4 | opera | United States |
To reformat the values in our dataframe, we’ll fetch the column and use the str.replace()
method to replace all non-alphanumeric characters with an empty string and all spaces with underscores. We’ll then reassign the new column values to the dataframe using df['column_name']
.
df['web_browser'] = df['web_browser'].str.replace(r'\W+', '_', regex=True)
df
web_browser | country | |
---|---|---|
0 | google_chrome | United States |
1 | mozilla_firefox | United States |
2 | internet_explorer | United States |
3 | safari | United States |
4 | opera | United States |
Matt Clarke, Saturday, November 12, 2022