How to split strings using the Pandas split() function

Learn how to use the Pandas split() function to split strings into lists or columns, including the use of regex, expand, and n parameters.

How to split strings using the Pandas split() function
Picture by Marta Wave, Pexels.
15 minutes to read

The Pandas split() function lets you split a string value up into a list or into separate dataframe columns based on a separator or delimiter value, such as a space or comma. It’s a very useful function to master and includes a number of additional parameters that you can use to customize the output.

The split() function has the following basic structure: Series.str.split(pat=None, *, n=- 1, expand=False, regex=None). The pat parameter is the delimiter or separator value, and the n parameter is the number of times to split the string.

The expand parameter is a boolean value that determines whether the output is a list or separate columns. The regex parameter (added in Pandas 1.4.0) is a boolean value that determines whether the pat parameter is a regular expression or not. Let’s go over some code examples to see how these parameters work.

Create a dataframe

To get started, open a new Jupyter notebook and import the Pandas library, then either import data into a Pandas dataframe, or create a dummy dataframe containing some values to split, like the example below.

import pandas as pd

df = pd.DataFrame({'name': ['John Paul Smith', 'Brian David Jones', 'Harry William Roberts'],
                   'website': ['', '', ''],   
                   'telephone': ['(01234) 5678910', '(05432) 9876543', '(09876) 5432109'], 
                    'username': ['john_smith', 'brian_jones', 'harry_roberts']})
name website telephone username
0 John Paul Smith (01234) 5678910 john_smith
1 Brian David Jones (05432) 9876543 brian_jones
2 Harry William Roberts (09876) 5432109 harry_roberts

Use split() to split a string to a list

First, we’ll call the split() method using its default arguments. The main argument is called pat and doesn’t need to be written. It defaults to using a space for its delimiter or separator, so just calling str.split() will split a string on spaces.

If we assign its output to a new column, we can see the results are stored in a list. This technique is known as tokenization, and it’s a common preprocessing step when working in Natural Language Processing (NLP).

df['split_name'] = df['name'].str.split()
df[['name', 'split_name']]
name split_name
0 John Paul Smith [John, Paul, Smith]
1 Brian David Jones [Brian, David, Jones]
2 Harry William Roberts [Harry, William, Roberts]

Use split with a custom delimiter using pat

The pat argument is used to define a custom delimiter or custom separator, other than the default value of space. For example, the username column contains the firstname and lastname of each person split by an underscore, so if we set str.split(pat='_') the string will be split at the underscore and a list of values returned.

df['split_username'] = df['username'].str.split(pat='_')
df[['username', 'split_username']]
username split_username
0 john_smith [john, smith]
1 brian_jones [brian, jones]
2 harry_roberts [harry, roberts]

Use split with expand to return a dataframe

The expand argument is used to return a Pandas dataframe, instead of a list. If we call df['username'].str.split(pat='_', expand=True) we get a dataframe with two columns, one for each split value.

df['username'].str.split(pat='_', expand=True)
0 1
0 john smith
1 brian jones
2 harry roberts

Use expand to create new columns after splitting

The really neat thing about expand=True is that it can also be used to add new columns containing split values to the original Pandas dataframe. For example, df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True) will add three new columns to the original dataframe containing the first, middle and last names of each person. `

df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True)
df[['name', 'first_name', 'middle_name', 'last_name']]
name first_name middle_name last_name
0 John Paul Smith John Paul Smith
1 Brian David Jones Brian David Jones
2 Harry William Roberts Harry William Roberts

Extract a specific element from a list after splitting

Let’s say you want to split a string in a Pandas column and extract a specific element, such as the firstname from a name. There are actually several ways to do this, one of them is the get() method.

df['first_name'] = df['split_name'].str.get(0)
df[['name', 'split_name', 'first_name']]
name split_name first_name
0 John Paul Smith [John, Paul, Smith] John
1 Brian David Jones [Brian, David, Jones] Brian
2 Harry William Roberts [Harry, William, Roberts] Harry

Split the firstname and lastname into two columns

The second method is to use split and access the individual elements using str[]. For example, to split the name column into first and last name, we can do the following:

df['firstname'] = df['name'].str.split(' ').str[0]
df['lastname'] = df['name'].str.split(' ').str[2]
df[['name', 'firstname', 'lastname']]
name firstname lastname
0 John Paul Smith John Smith
1 Brian David Jones Brian Jones
2 Harry William Roberts Harry Roberts

Use n to limit the number of splits

The n argument is used to limit the number of splits. For example, if you wanted to split the name column into two columns, first_name and last_name, you could use n=1 to split the name column into two columns.

df['name'].str.split(pat=' ', n=1, expand=True)
0 1
0 John Paul Smith
1 Brian David Jones
2 Harry William Roberts
df['name'].str.split(pat=' ', n=2, expand=True)
0 1 2
0 John Paul Smith
1 Brian David Jones
2 Harry William Roberts

Use regex to split a string

Providing you’re using a Pandas version greater than 1.4.0, you can also use the new regex parameter of the split() function. The regex argument is used to tell split() whether the pat value provided is a Python regular expression or not. If regex=True, then pat is treated as a regular expression.

If regex=False, then pat is treated as a literal string. However, you can still use a regular expression with split() in older versions of Pandas. For example, to extract the digits from our telephone numbers, which are in the format (01234) 5678910, we can use the pat of r'D+' to denote this a regex, and then return only the numeric characters.

df['telephone'].str.split(pat=r'\D+', expand=True)
0 1 2
0 01234 5678910
1 05432 9876543
2 09876 5432109

By referencing the individual cells using their identifiers, i.e. [1], we can then extract the area code and phone number from the longer telephone number string.

df['area_code'] = df['telephone'].str.split(pat=r'\D+', expand=True)[1]
df['phone_number'] = df['telephone'].str.split(pat=r'\D+', expand=True)[2]
df[['telephone', 'area_code', 'phone_number']]
telephone area_code phone_number
0 (01234) 5678910 01234 5678910
1 (05432) 9876543 05432 9876543
2 (09876) 5432109 09876 5432109

Split the domain name from the website

Finally, we can split the domain name from the website using the same technique as above. Here we’re using the str.split() method to split the website column on the // characters and then extracting the second element from the resulting list.

df['domain'] = df['website'].str.split(pat=r'//', expand=True)[1]
df[['website', 'domain']]
website domain

If you want to get rid of the “www.” part you can either use split() again, and extract the second element, split the string at the “www.” instead, or just use replace() to find and replace the string if it’s present. A simple .str.replace('www', '') will work, but I’ve added the parameters here so you can see what they all do.

df['domain'] = df['domain'].str.replace(pat=r'www.', repl='', regex=False)
df[['website', 'domain']]
website domain

Matt Clarke, Monday, November 28, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.