The Pandas split()
function lets you split a string value up into a list or into separate dataframe columns based on a separator or delimiter value, such as a space or comma. It’s a very useful function to master and includes a number of additional parameters that you can use to customize the output.
The split()
function has the following basic structure: Series.str.split(pat=None, *, n=- 1, expand=False, regex=None)
. The pat
parameter is the delimiter or separator value, and the n
parameter is the number of times to split the string.
The expand
parameter is a boolean value that determines whether the output is a list or separate columns. The regex
parameter (added in Pandas 1.4.0) is a boolean value that determines whether the pat
parameter is a regular expression or not. Let’s go over some code examples to see how these parameters work.
To get started, open a new Jupyter notebook and import the Pandas library, then either import data into a Pandas dataframe, or create a dummy dataframe containing some values to split, like the example below.
import pandas as pd
df = pd.DataFrame({'name': ['John Paul Smith', 'Brian David Jones', 'Harry William Roberts'],
'website': ['https://www.johnsmith.com', 'https://www.brianjones.com', 'https://www.harryroberts.com'],
'telephone': ['(01234) 5678910', '(05432) 9876543', '(09876) 5432109'],
'username': ['john_smith', 'brian_jones', 'harry_roberts']})
df
name | website | telephone | username | |
---|---|---|---|---|
0 | John Paul Smith | https://www.johnsmith.com | (01234) 5678910 | john_smith |
1 | Brian David Jones | https://www.brianjones.com | (05432) 9876543 | brian_jones |
2 | Harry William Roberts | https://www.harryroberts.com | (09876) 5432109 | harry_roberts |
First, we’ll call the split()
method using its default arguments. The main argument is called pat
and doesn’t need to be written. It defaults to using a space for its delimiter or separator, so just calling str.split()
will split a string on spaces.
If we assign its output to a new column, we can see the results are stored in a list. This technique is known as tokenization, and it’s a common preprocessing step when working in Natural Language Processing (NLP).
df['split_name'] = df['name'].str.split()
df[['name', 'split_name']]
name | split_name | |
---|---|---|
0 | John Paul Smith | [John, Paul, Smith] |
1 | Brian David Jones | [Brian, David, Jones] |
2 | Harry William Roberts | [Harry, William, Roberts] |
The pat
argument is used to define a custom delimiter or custom separator, other than the default value of space. For example, the username
column contains the firstname and lastname of each person split by an underscore, so if we set str.split(pat='_')
the string will be split at the underscore and a list of values returned.
df['split_username'] = df['username'].str.split(pat='_')
df[['username', 'split_username']]
username | split_username | |
---|---|---|
0 | john_smith | [john, smith] |
1 | brian_jones | [brian, jones] |
2 | harry_roberts | [harry, roberts] |
The expand
argument is used to return a Pandas dataframe, instead of a list. If we call df['username'].str.split(pat='_', expand=True)
we get a dataframe with two columns, one for each split value.
df['username'].str.split(pat='_', expand=True)
0 | 1 | |
---|---|---|
0 | john | smith |
1 | brian | jones |
2 | harry | roberts |
The really neat thing about expand=True
is that it can also be used to add new columns containing split values to the original Pandas dataframe. For example, df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True)
will add three new columns to the original dataframe containing the first, middle and last names of each person.
`
df[['first_name', 'middle_name', 'last_name']] = df['name'].str.split(pat=' ', expand=True)
df[['name', 'first_name', 'middle_name', 'last_name']]
name | first_name | middle_name | last_name | |
---|---|---|---|---|
0 | John Paul Smith | John | Paul | Smith |
1 | Brian David Jones | Brian | David | Jones |
2 | Harry William Roberts | Harry | William | Roberts |
Let’s say you want to split a string in a Pandas column and extract a specific element, such as the firstname from a name. There are actually several ways to do this, one of them is the get()
method.
df['first_name'] = df['split_name'].str.get(0)
df[['name', 'split_name', 'first_name']]
name | split_name | first_name | |
---|---|---|---|
0 | John Paul Smith | [John, Paul, Smith] | John |
1 | Brian David Jones | [Brian, David, Jones] | Brian |
2 | Harry William Roberts | [Harry, William, Roberts] | Harry |
The second method is to use split and access the individual elements using str[]
. For example, to split the name column into first and last name, we can do the following:
df['firstname'] = df['name'].str.split(' ').str[0]
df['lastname'] = df['name'].str.split(' ').str[2]
df[['name', 'firstname', 'lastname']]
name | firstname | lastname | |
---|---|---|---|
0 | John Paul Smith | John | Smith |
1 | Brian David Jones | Brian | Jones |
2 | Harry William Roberts | Harry | Roberts |
The n
argument is used to limit the number of splits. For example, if you wanted to split the name
column into two columns, first_name
and last_name
, you could use n=1
to split the name column into two columns.
df['name'].str.split(pat=' ', n=1, expand=True)
0 | 1 | |
---|---|---|
0 | John | Paul Smith |
1 | Brian | David Jones |
2 | Harry | William Roberts |
df['name'].str.split(pat=' ', n=2, expand=True)
0 | 1 | 2 | |
---|---|---|---|
0 | John | Paul | Smith |
1 | Brian | David | Jones |
2 | Harry | William | Roberts |
Providing you’re using a Pandas version greater than 1.4.0, you can also use the new regex
parameter of the split()
function. The regex
argument is used to tell split()
whether the pat
value provided is a Python regular expression or not. If regex=True
, then pat
is treated as a regular expression.
If regex=False
, then pat
is treated as a literal string. However, you can still use a regular expression with split()
in older versions of Pandas. For example, to extract the digits from our telephone numbers, which are in the format (01234) 5678910, we can use the pat
of r'D+'
to denote this a regex, and then return only the numeric characters.
df['telephone'].str.split(pat=r'\D+', expand=True)
0 | 1 | 2 | |
---|---|---|---|
0 | 01234 | 5678910 | |
1 | 05432 | 9876543 | |
2 | 09876 | 5432109 |
By referencing the individual cells using their identifiers, i.e. [1]
, we can then extract the area code and phone number from the longer telephone number string.
df['area_code'] = df['telephone'].str.split(pat=r'\D+', expand=True)[1]
df['phone_number'] = df['telephone'].str.split(pat=r'\D+', expand=True)[2]
df[['telephone', 'area_code', 'phone_number']]
telephone | area_code | phone_number | |
---|---|---|---|
0 | (01234) 5678910 | 01234 | 5678910 |
1 | (05432) 9876543 | 05432 | 9876543 |
2 | (09876) 5432109 | 09876 | 5432109 |
Finally, we can split the domain name from the website using the same technique as above. Here we’re using the str.split()
method to split the website column on the //
characters and then extracting the second element from the resulting list.
df['domain'] = df['website'].str.split(pat=r'//', expand=True)[1]
df[['website', 'domain']]
website | domain | |
---|---|---|
0 | https://www.johnsmith.com | www.johnsmith.com |
1 | https://www.brianjones.com | www.brianjones.com |
2 | https://www.harryroberts.com | www.harryroberts.com |
If you want to get rid of the “www.” part you can either use split()
again, and extract the second element, split the string at the “www.” instead, or just use replace()
to find and replace the string if it’s present. A simple .str.replace('www', '')
will work, but I’ve added the parameters here so you can see what they all do.
df['domain'] = df['domain'].str.replace(pat=r'www.', repl='', regex=False)
df[['website', 'domain']]
website | domain | |
---|---|---|
0 | https://www.johnsmith.com | johnsmith.com |
1 | https://www.brianjones.com | brianjones.com |
2 | https://www.harryroberts.com | harryroberts.com |
Matt Clarke, Monday, November 28, 2022