Regular expressions are used for pattern matching in programming, allowing you to identify or extract very specific pieces of text from a string or document. They’re very powerful and extremely useful to understand, but they’re also rather confusing and can be one of the most baffling things to learn in data science. Few of the developers I’ve worked with have ever been fluent in them, with most happily resorting to using a cheat sheet instead of memorising their complex nuances.
“Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinski, Mozilla contributor
To avoid blowing your mind and trying to cover everything, let’s just look at some common uses for regular expressions in data science. As you’re not expected to memorise these, do expect to go and look up the precise way to write the ones you may need for any future projects.
Regular expressions, or regexes, are part of Python itself but to use them you need to specify the re
module by typing import re
at the top of your Python file. The regular expression itself consists of a specially constructed list of characters that tell re
what pattern to find, or match, in the text. For example, the regex [0-9]+
will find any numbers with one or more continuous digits from 0 to 9.
You can pass a regex to a number of different re
functions (and Pandas functions) to allow you to achieve a range of specific data wrangling goals. From identifying whether a string contains a particular value to returning a list of the matches found and much more. The most commonly used functions are:
Function | Description |
---|---|
findall() | Return a Python list of matches |
search() | Return a Match object if the match is found in the string |
sub() | Find and replace a string with a match |
split() | Return a Python list split at each match point |
Let’s look at some practical examples and turn them into functions you can re-use in your own projects.
In the below example we’ve loaded the re
module and written a simple function called extract_hashtags()
which takes a text string as its argument and returns a Python list containing any hashtags it finds using the findall()
function of re
.
To make it a bit easier to see the regex itself I’ve assigned the regex #(\w+)
to a variable and passed it as an argument. It’s that tiny expression which does the magic. All this does is tells re
to look for continuous strings that start with a # and are followed by a unicode string (\w+)
, which can contain letters from A-Z, numbers from 0-9 or an underscore.
import re
def extract_hashtags(text):
regex = "#(\w+)"
return re.findall(regex, str(text))
string = 'Data science is really interesting #datascience #python'
hashtags = extract_hashtags(string)
hashtags
Next, we’re going to extract @ mentions from Tweets with the @(\w+)
regex. This is much like the hashtags example, but looks for strings which start with an @ instead.
import re
def extract_mentions(text):
regex = "@(\w+)"
return re.findall(regex, str(text))
string = 'Many English people think @realdonaldtrump is a cockwomble'
mentions = extract_mentions(string)
mentions
To extract numbers you can pass values in square brackets which look for “sets” of characters. The [0-9]+
regex we’ve used here will look for one or more strings of digits where the digits are between 0 and 9.
import re
def extract_numbers(text):
regex = "[0-9]+"
return re.findall(regex, str(text))
string = 'The Proclaimers would walk 500 miles and then walk 500 more.'
numbers = extract_numbers(string)
numbers
Telephone numbers are a little more challenging to extract as they can vary in length from country to country and can be written in different ways. You’ll need a regex specific to your local number formatting or one sophisticated enough to catch all possible formatting derivatives.
UK numbers are written as area code of four or five digits, followed by a phone number of six or seven digits, i.e. 0161 1234567 or 01612 34567. Our regex to catch these uses \d{4}
to find a sequence of four digits, the .
indicates any character, such as a space, while \d{7}
looks for a sequence of seven consecutive digits. The |
is an or
operator and lets you combine two regular expressions, so our second one then uses \d{5}.\d{6}
to look for five digits, followed by a space followed by six digits.
import re
def extract_phone_numbers(text):
regex = "\d{4}.\d{7}|\d{5}.\d{6}"
return re.findall(regex, str(text))
string = 'This is a phone number 0161 1234567, so is 01611 234567, but 1234 is not.'
numbers = extract_phone_numbers(string)
numbers
Like telephone numbers, using regexes to extract emails is again quite challenging because formats differ so much. There are loads of different ways to do this, some more reliable than others. The below example uses [a-zA-Z0-9_.+-]
to look for any combination of uppercase or lowercase letters, numbers, underscores, hyphens and dots, then uses +@
to look for the @ symbol, then uses [a-zA-Z0-9-]+\.
to look for for another string of letters, numbers or hyphens followed by a dot, and finally uses [a-zA-Z0-9-.]+
to look for the suffix.
import re
def extract_emails(text):
regex = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
return re.findall(regex, str(text))
string = 'You can email me at bob@example.com'
emails = extract_emails(string)
emails
To write your own regexes you just need to break down your problem and piece together the required components. There are three main tools for doing this: metacharacters, special sequences and sets. Metacharacters are special characters that you can put before, after or around other characters to change your match query. For example, .
will match any character, ^
means starts with and $
means ends with. Special sequences look for one or more characters of a particular type, like the \w
we used to find letters, numbers and underscores and the \d
we uses to find strings consisting solely of digits. Finally, sets look for specific character combinations, like the [0-9]
example we used to find numbers.
Special sequence | Description |
---|---|
\d | Any decimal digit, i.e. 1. Equivalent to [0-9]. |
\D | Any non-decimal digit, i.e. H. Equivalent to [^0-9] |
\s | Any whitespace character, such as a space, tab or line break. Equivalent to [ \t\n\r\f\v] |
\S | Any non-whitespace character, such as 1, J or £. Equivalent to [^ \t\n\r\f\v] |
\w | Any alphanumeric character, such as 1 or L. Equivalent to [a-zA-Z0-9_] |
\W | Any non-alphanumeric character, such as ; or !. Equivalent to [^a-zA-Z0-9_] |
Metacharacter | Description | Example |
---|---|---|
[] | Any set of characters, i.e. a-z or a-f | [a-z] |
\ | Signals a special sequence (i.e. \d) or is used to escape a special character | \d |
^ | Starts with | ^www. |
$ | Ends with | .com$ |
* | A wildcard matching zero or more occurrences | behavio*r |
+ | One or more occurrences | ma+thew |
{3} | A specific number of occurences, such as three digits | \d{3} |
| | An or operator | pandas|numpy |
Set | Description |
---|---|
[a-zA-Z] | Any upper or lowercase letters |
[0-9] | Any digits between 0 and 9 |
The split()
function is a bit like explode()
in PHP in that it splits up a string into chunks using a given match pattern. For example, to split up a sentence into words you might use the \s
special sequence to use whitespace characters as your breaking point. Similarly you can just pass it a ,
to break up a comma separated list.
import re
string = "An owl in a paper bag troubles no man"
words = re.split("\s", string)
print(words)
['An', 'owl', 'in', 'a', 'paper', 'bag', 'troubles', 'no', 'man']
import re
string = "Lots,of,values,separated,by,commas"
words = re.split(",", string)
print(words)
['Lots', 'of', 'values', 'separated', 'by', 'commas']
The sub()
function is used to find and replace characters. In the below example, we’ll find any whitespace (denoted by the \s
special sequence and replace it with an underscore. You can also append other functions, like lower()
, to this to give you basic slugifying functionality.
import re
string = "Regular expressions are often baffling"
slug = re.sub("\s", "_", string).lower()
print(slug)
regular_expressions_are_often_baffling
The search()
function takes a search string or regex and, if found, returns a match object containing its position within the string. It’s a bit like strpos()
in PHP.
import re
string = "https://www.practicaldatascience.co.uk"
match = re.search(r'www', string)
match
<re.Match object; span=(8, 11), match='www'>
This really just scratches the surface of what you can do with regular expressions and the functions within the re
package. You really don’t need to know everything. Once you understand the basics, assembling the right regex for your specific problem is really just a case of assembling the right components bit by bit and then testing that it works.
Matt Clarke, Monday, March 01, 2021