How to use Python regular expressions to extract information

Regular expressions, or regexes, are widely used in data science for matching specific patterns in text. Here's a quick guide to getting started with them.

How to use Python regular expressions to extract information
12 minutes to read

Regular expressions are used for pattern matching in programming, allowing you to identify or extract very specific pieces of text from a string or document. They’re very powerful and extremely useful to understand, but they’re also rather confusing and can be one of the most baffling things to learn in data science. Few of the developers I’ve worked with have ever been fluent in them, with most happily resorting to using a cheat sheet instead of memorising their complex nuances.

“Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.

Jamie Zawinski, Mozilla contributor

To avoid blowing your mind and trying to cover everything, let’s just look at some common uses for regular expressions in data science. As you’re not expected to memorise these, do expect to go and look up the precise way to write the ones you may need for any future projects.

The basics of Python regular expressions

Regular expressions, or regexes, are part of Python itself but to use them you need to specify the re module by typing import re at the top of your Python file. The regular expression itself consists of a specially constructed list of characters that tell re what pattern to find, or match, in the text. For example, the regex [0-9]+ will find any numbers with one or more continuous digits from 0 to 9.

You can pass a regex to a number of different re functions (and Pandas functions) to allow you to achieve a range of specific data wrangling goals. From identifying whether a string contains a particular value to returning a list of the matches found and much more. The most commonly used functions are:

Function Description
findall() Return a Python list of matches
search() Return a Match object if the match is found in the string
sub() Find and replace a string with a match
split() Return a Python list split at each match point

Let’s look at some practical examples and turn them into functions you can re-use in your own projects.

Practical examples of commonly used Python regexes

Extracting hashtags

In the below example we’ve loaded the re module and written a simple function called extract_hashtags() which takes a text string as its argument and returns a Python list containing any hashtags it finds using the findall() function of re.

To make it a bit easier to see the regex itself I’ve assigned the regex #(\w+) to a variable and passed it as an argument. It’s that tiny expression which does the magic. All this does is tells re to look for continuous strings that start with a # and are followed by a unicode string (\w+), which can contain letters from A-Z, numbers from 0-9 or an underscore.

import re

def extract_hashtags(text):
    regex = "#(\w+)"
    return re.findall(regex, str(text))

string = 'Data science is really interesting #datascience #python'
hashtags = extract_hashtags(string)
hashtags

Extracting @ mentions

Next, we’re going to extract @ mentions from Tweets with the @(\w+) regex. This is much like the hashtags example, but looks for strings which start with an @ instead.

import re

def extract_mentions(text):
    regex = "@(\w+)"
    return re.findall(regex, str(text))

string = 'Many English people think @realdonaldtrump is a cockwomble'
mentions = extract_mentions(string)
mentions

Extracting numbers

To extract numbers you can pass values in square brackets which look for “sets” of characters. The [0-9]+ regex we’ve used here will look for one or more strings of digits where the digits are between 0 and 9.

import re

def extract_numbers(text):
    regex = "[0-9]+"
    return re.findall(regex, str(text))

string = 'The Proclaimers would walk 500 miles and then walk 500 more.'
numbers = extract_numbers(string)
numbers

Extracting telephone numbers

Telephone numbers are a little more challenging to extract as they can vary in length from country to country and can be written in different ways. You’ll need a regex specific to your local number formatting or one sophisticated enough to catch all possible formatting derivatives.

UK numbers are written as area code of four or five digits, followed by a phone number of six or seven digits, i.e. 0161 1234567 or 01612 34567. Our regex to catch these uses \d{4} to find a sequence of four digits, the . indicates any character, such as a space, while \d{7} looks for a sequence of seven consecutive digits. The | is an or operator and lets you combine two regular expressions, so our second one then uses \d{5}.\d{6} to look for five digits, followed by a space followed by six digits.

import re

def extract_phone_numbers(text):
    regex = "\d{4}.\d{7}|\d{5}.\d{6}"
    return re.findall(regex, str(text))

string = 'This is a phone number 0161 1234567, so is 01611 234567, but 1234 is not.'
numbers = extract_phone_numbers(string)
numbers

Extracting email addresses

Like telephone numbers, using regexes to extract emails is again quite challenging because formats differ so much. There are loads of different ways to do this, some more reliable than others. The below example uses [a-zA-Z0-9_.+-] to look for any combination of uppercase or lowercase letters, numbers, underscores, hyphens and dots, then uses +@ to look for the @ symbol, then uses [a-zA-Z0-9-]+\. to look for for another string of letters, numbers or hyphens followed by a dot, and finally uses [a-zA-Z0-9-.]+ to look for the suffix.

import re

def extract_emails(text):
    regex = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
    return re.findall(regex, str(text))

string = 'You can email me at bob@example.com'
emails = extract_emails(string)
emails

Writing your own regular expressions in Python

To write your own regexes you just need to break down your problem and piece together the required components. There are three main tools for doing this: metacharacters, special sequences and sets. Metacharacters are special characters that you can put before, after or around other characters to change your match query. For example, . will match any character, ^ means starts with and $ means ends with. Special sequences look for one or more characters of a particular type, like the \w we used to find letters, numbers and underscores and the \d we uses to find strings consisting solely of digits. Finally, sets look for specific character combinations, like the [0-9] example we used to find numbers.

Special sequences

Special sequence Description
\d Any decimal digit, i.e. 1. Equivalent to [0-9].
\D Any non-decimal digit, i.e. H. Equivalent to [^0-9]
\s Any whitespace character, such as a space, tab or line break. Equivalent to [ \t\n\r\f\v]
\S Any non-whitespace character, such as 1, J or £. Equivalent to [^ \t\n\r\f\v]
\w Any alphanumeric character, such as 1 or L. Equivalent to [a-zA-Z0-9_]
\W Any non-alphanumeric character, such as ; or !. Equivalent to [^a-zA-Z0-9_]

Metacharacters

Metacharacter Description Example
[] Any set of characters, i.e. a-z or a-f [a-z]
\ Signals a special sequence (i.e. \d) or is used to escape a special character \d
^ Starts with ^www.
$ Ends with .com$
* A wildcard matching zero or more occurrences behavio*r
+ One or more occurrences ma+thew
{3} A specific number of occurences, such as three digits \d{3}
| An or operator pandas|numpy

Sets

Set Description
[a-zA-Z] Any upper or lowercase letters
[0-9] Any digits between 0 and 9

Using split()

The split() function is a bit like explode() in PHP in that it splits up a string into chunks using a given match pattern. For example, to split up a sentence into words you might use the \s special sequence to use whitespace characters as your breaking point. Similarly you can just pass it a , to break up a comma separated list.

import re
string = "An owl in a paper bag troubles no man"
words = re.split("\s", string)
print(words)
['An', 'owl', 'in', 'a', 'paper', 'bag', 'troubles', 'no', 'man']
import re
string = "Lots,of,values,separated,by,commas"
words = re.split(",", string)
print(words)
['Lots', 'of', 'values', 'separated', 'by', 'commas']

Using sub()

The sub() function is used to find and replace characters. In the below example, we’ll find any whitespace (denoted by the \s special sequence and replace it with an underscore. You can also append other functions, like lower(), to this to give you basic slugifying functionality.

import re
string = "Regular expressions are often baffling"
slug = re.sub("\s", "_", string).lower()
print(slug)
regular_expressions_are_often_baffling

The search() function takes a search string or regex and, if found, returns a match object containing its position within the string. It’s a bit like strpos() in PHP.

import re
string = "https://www.practicaldatascience.co.uk"
match = re.search(r'www', string)
match
<re.Match object; span=(8, 11), match='www'>

This really just scratches the surface of what you can do with regular expressions and the functions within the re package. You really don’t need to know everything. Once you understand the basics, assembling the right regex for your specific problem is really just a case of assembling the right components bit by bit and then testing that it works.

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Software Engineering for Data Scientists in Python

Learn all about modularity, documentation, & automated testing to help you solve Data Science problems quicker and more reliably.

Start course for FREE

Comments