How to split strings using the Pandas split() function

Picture by Marta Wave, Pexels.

15 minutes to read

Data Science Pandas

The Pandas split() function lets you split a string value up into a list or into separate dataframe columns based on a separator or delimiter value, such as a space or comma. It’s a very useful function to master and includes a number of additional parameters that you can use to customize the output.

The split() function has the following basic structure: Series.str.split(pat=None, *, n=- 1, expand=False, regex=None). The pat parameter is the delimiter or separator value, and the n parameter is the number of times to split the string.

The expand parameter is a boolean value that determines whether the output is a list or separate columns. The regex parameter (added in Pandas 1.4.0) is a boolean value that determines whether the pat parameter is a regular expression or not. Let’s go over some code examples to see how these parameters work.

Create a dataframe

To get started, open a new Jupyter notebook and import the Pandas library, then either import data into a Pandas dataframe, or create a dummy dataframe containing some values to split, like the example below.

import pandas as pd

df = pd.DataFrame({'name': ['John Paul Smith', 'Brian David Jones', 'Harry William Roberts'],
                   'website': ['https://www.johnsmith.com', 'https://www.brianjones.com', 'https://www.harryroberts.com'],   
                   'telephone': ['(01234) 5678910', '(05432) 9876543', '(09876) 5432109'], 
                    'username': ['john_smith', 'brian_jones', 'harry_roberts']})
df

	name	website	telephone	username
0	John Paul Smith	https://www.johnsmith.com	(01234) 5678910	john_smith
1	Brian David Jones	https://www.brianjones.com	(05432) 9876543	brian_jones
2	Harry William Roberts	https://www.harryroberts.com	(09876) 5432109	harry_roberts

Use split() to split a string to a list

First, we’ll call the split() method using its default arguments. The main argument is called pat and doesn’t need to be written. It defaults to using a space for its delimiter or separator, so just calling str.split() will split a string on spaces.

If we assign its output to a new column, we can see the results are stored in a list. This technique is known as tokenization, and it’s a common preprocessing step when working in Natural Language Processing (NLP).

df['split_name'] = df['name'].str.split()
df[['name', 'split_name']]

	name	split_name
0	John Paul Smith	[John, Paul, Smith]
1	Brian David Jones	[Brian, David, Jones]
2	Harry William Roberts	[Harry, William, Roberts]

Use split with a custom delimiter using pat

The pat argument is used to define a custom delimiter or custom separator, other than the default value of space. For example, the username column contains the firstname and lastname of each person split by an underscore, so if we set str.split(pat='_') the string will be split at the underscore and a list of values returned.

df['split_username'] = df['username'].str.split(pat='_')
df[['username', 'split_username']]

	username	split_username
0	john_smith	[john, smith]
1	brian_jones	[brian, jones]
2	harry_roberts	[harry, roberts]

Use split with expand to return a dataframe

The expand argument is used to return a Pandas dataframe, instead of a list. If we call df['username'].str.split(pat='_', expand=True) we get a dataframe with two columns, one for each split value.

df['username'].str.split(pat='_', expand=True)

	0	1
0	john	smith
1	brian	jones
2	harry	roberts

Use expand to create new columns after splitting

The really neat thing about expand=True is that it can also be used to add new columns containing split values to the original Pandas dataframe. For example, df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True) will add three new columns to the original dataframe containing the first, middle and last names of each person. `

df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True)
df[['name', 'first_name', 'middle_name', 'last_name']]

	name	first_name	middle_name	last_name
0	John Paul Smith	John	Paul	Smith
1	Brian David Jones	Brian	David	Jones
2	Harry William Roberts	Harry	William	Roberts

Extract a specific element from a list after splitting

Let’s say you want to split a string in a Pandas column and extract a specific element, such as the firstname from a name. There are actually several ways to do this, one of them is the get() method.

df['first_name'] = df['split_name'].str.get(0)
df[['name', 'split_name', 'first_name']]

	name	split_name	first_name
0	John Paul Smith	[John, Paul, Smith]	John
1	Brian David Jones	[Brian, David, Jones]	Brian
2	Harry William Roberts	[Harry, William, Roberts]	Harry

Split the firstname and lastname into two columns

The second method is to use split and access the individual elements using str[]. For example, to split the name column into first and last name, we can do the following:

df['firstname'] = df['name'].str.split(' ').str[0]
df['lastname'] = df['name'].str.split(' ').str[2]
df[['name', 'firstname', 'lastname']]

	name	firstname	lastname
0	John Paul Smith	John	Smith
1	Brian David Jones	Brian	Jones
2	Harry William Roberts	Harry	Roberts

Use n to limit the number of splits

The n argument is used to limit the number of splits. For example, if you wanted to split the name column into two columns, first_name and last_name, you could use n=1 to split the name column into two columns.

df['name'].str.split(pat=' ', n=1, expand=True)

	0	1
0	John	Paul Smith
1	Brian	David Jones
2	Harry	William Roberts

df['name'].str.split(pat=' ', n=2, expand=True)

	0	1	2
0	John	Paul	Smith
1	Brian	David	Jones
2	Harry	William	Roberts

Use regex to split a string

Providing you’re using a Pandas version greater than 1.4.0, you can also use the new regex parameter of the split() function. The regex argument is used to tell split() whether the pat value provided is a Python regular expression or not. If regex=True, then pat is treated as a regular expression.

If regex=False, then pat is treated as a literal string. However, you can still use a regular expression with split() in older versions of Pandas. For example, to extract the digits from our telephone numbers, which are in the format (01234) 5678910, we can use the pat of r'D+' to denote this a regex, and then return only the numeric characters.

df['telephone'].str.split(pat=r'\D+', expand=True)

	1	2
0	01234	5678910
1	05432	9876543
2	09876	5432109

By referencing the individual cells using their identifiers, i.e. [1], we can then extract the area code and phone number from the longer telephone number string.

df['area_code'] = df['telephone'].str.split(pat=r'\D+', expand=True)[1]
df['phone_number'] = df['telephone'].str.split(pat=r'\D+', expand=True)[2]
df[['telephone', 'area_code', 'phone_number']]

	telephone	area_code	phone_number
0	(01234) 5678910	01234	5678910
1	(05432) 9876543	05432	9876543
2	(09876) 5432109	09876	5432109

Split the domain name from the website

Finally, we can split the domain name from the website using the same technique as above. Here we’re using the str.split() method to split the website column on the // characters and then extracting the second element from the resulting list.

df['domain'] = df['website'].str.split(pat=r'//', expand=True)[1]
df[['website', 'domain']]

	website	domain
0	https://www.johnsmith.com	www.johnsmith.com
1	https://www.brianjones.com	www.brianjones.com
2	https://www.harryroberts.com	www.harryroberts.com

If you want to get rid of the “www.” part you can either use split() again, and extract the second element, split the string at the “www.” instead, or just use replace() to find and replace the string if it’s present. A simple .str.replace('www', '') will work, but I’ve added the parameters here so you can see what they all do.

df['domain'] = df['domain'].str.replace(pat=r'www.', repl='', regex=False)
df[['website', 'domain']]

	website	domain
0	https://www.johnsmith.com	johnsmith.com
1	https://www.brianjones.com	brianjones.com
2	https://www.harryroberts.com	harryroberts.com

Matt Clarke, Monday, November 28, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.