MySQL includes a large number of string functions and operators that you can use in SQL statements to both query data and reformat column values. In this simple example, I’ll show you the main types of MySQL string functions and operators that you can use in SQL queries.
You can, of course, do this on any MySQL database to which you have access, and you can write your SQL statements directly into the MySQL terminal or into a SQL client, such as Beekeeper Studio. However, I’m doing what most data scientists will be doing, and passing my queries to MySQL using Pandas via a Jupyter notebook.
I run my MySQL server in a Docker container and have created a test database based on the widely used Online Retail Dataset. I’ve created a couple of step-by-step guides that you can follow to create your own MySQL Docker container and import a MySQL database so you can follow along on the very same data. It only takes a few minutes.
Once you’ve got your MySQL server setup, open up a Jupyter notebook and install the SQLAlchemy and PyMySQL Python packages. We’ll be using these to connect to MySQL and pass in database queries using Pandas. Once you’ve installed the packages using the Pip package manager, import pandas
and the create_engine
module from sqlalchemy
.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
Next, we need to configure the connection to our MySQL server by creating a connection string to pass to the create_engine()
function. This will vary according to your own setup. I’m connecting to a MySQL server with the PyMySQL driver, using the username root and the password *****
on port 3306 of the IP address 172.17.0.3. The final argument tells create_engine()
to connect to the ecommerce
database.
engine = create_engine('mysql+pymysql://root:*****@172.17.0.3:3306/ecommerce')
To test the connection to the MySQL server we’ll first write a simple MySQL SELECT
statement to return everything from the order_items
table.
df = pd.read_sql_query("SELECT * FROM order_items", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Now we can move onto looking at the MySQL string functions and operators. The first one we’ll try is called LOWER()
, which is a synonym of the LCASE()
function. As the name suggests, it converts a string to lowercase. We’ll use AS
to define the column name, otherwise MySQL will name the column LOWER(description)
.
query = """
SELECT
LOWER(description) AS description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
description | |
---|---|
0 | white hanging heart t-light holder |
1 | white metal lantern |
2 | cream cupid hearts coat hanger |
3 | knitted union flag hot water bottle |
4 | red woolly hottie white heart. |
As you might imagine, there’s also a corresponding string function in MySQL called UPPER()
which is a synonym of UCASE()
that we can use to convert a string to uppercase characters.
query = """
SELECT
UPPER(country) AS country
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
country | |
---|---|
0 | UNITED KINGDOM |
1 | UNITED KINGDOM |
2 | UNITED KINGDOM |
3 | UNITED KINGDOM |
4 | UNITED KINGDOM |
Unfortunately, there is no MySQL function to convert a string to title case, where each word starts with a capital letter (rather like the ucwords()
function of PHP). However, you can write custom functions to perform this. Converting a sentence to start with an uppercase letter followed by lowercase letters is slightly easier. In the example below, we’re using a mixture of CONCAT()
, LEFT()
, and SUBSTRING()
to do this for us.
query = """
SELECT
CONCAT(UPPER(LEFT(description, 1)),
LOWER(SUBSTRING(description, 2))
) AS description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
description | |
---|---|
0 | White hanging heart t-light holder |
1 | White metal lantern |
2 | Cream cupid hearts coat hanger |
3 | Knitted union flag hot water bottle |
4 | Red woolly hottie white heart. |
To count the number of characters in a string stored in a MySQL column you can use the CHAR_LENGTH()
function. This will count the characters in the word and return the value in the new column.
query = """
SELECT
description,
CHAR_LENGTH(description) AS characters
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
description | characters | |
---|---|---|
0 | WHITE HANGING HEART T-LIGHT HOLDER | 34.0 |
1 | WHITE METAL LANTERN | 19.0 |
2 | CREAM CUPID HEARTS COAT HANGER | 30.0 |
3 | KNITTED UNION FLAG HOT WATER BOTTLE | 35.0 |
4 | RED WOOLLY HOTTIE WHITE HEART. | 30.0 |
If you want to find the length of a string in bytes, you can use the MySQL LENGTH()
function.
query = """
SELECT
description,
LENGTH(description) AS bytes
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
description | bytes | |
---|---|---|
0 | WHITE HANGING HEART T-LIGHT HOLDER | 34.0 |
1 | WHITE METAL LANTERN | 19.0 |
2 | CREAM CUPID HEARTS COAT HANGER | 30.0 |
3 | KNITTED UNION FLAG HOT WATER BOTTLE | 35.0 |
4 | RED WOOLLY HOTTIE WHITE HEART. | 30.0 |
Another really common thing you’ll want to do in MySQL queries is join words together via a process known as concatenation. The CONCAT()
function and the related CONCAT_WS()
function, both allow you to concatenate one or more values together. These could be different column values that you want to combine in a single column, such as the firstname and lastname of a person. You can also include a separator with the values, so they don’t form a continuous string, if you wish.
query = """
SELECT
sku,
description,
CONCAT(sku, ' ', description) AS sku_description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | sku_description | |
---|---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 85123A WHITE HANGING HEART T-LIGHT HOLDER |
1 | 71053 | WHITE METAL LANTERN | 71053 WHITE METAL LANTERN |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER | 84406B CREAM CUPID HEARTS COAT HANGER |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 84029G KNITTED UNION FLAG HOT WATER BOTTLE |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 84029E RED WOOLLY HOTTIE WHITE HEART. |
The REPLACE()
function is used to replace one value with another, using a simple find and replace rule. For example, let’s say you want to replace occurrences of WHITE
with White
. Note that the REPLACE()
function is case sensitive.
query = """
SELECT
sku,
description,
REPLACE(description, 'WHITE', 'White') AS description2
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | description2 | |
---|---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | White HANGING HEART T-LIGHT HOLDER |
1 | 71053 | WHITE METAL LANTERN | White METAL LANTERN |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER | CREAM CUPID HEARTS COAT HANGER |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | KNITTED UNION FLAG HOT WATER BOTTLE |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | RED WOOLLY HOTTIE White HEART. |
Since the default REPLACE()
function is case sensitive, it’s often better to use the alternative REGEX_REPLACE()
function instead, since this can catch all variations and it’s case insensitive. It has some additional features, but for a simple case insenstive find and replace you can use it just like REPLACE()
. Here, we’re looking for the string white
(which doesn’t occur, as our data is set to WHITE
) and we’re replacing it with XXXXX
.
query = """
SELECT
sku,
description,
REGEXP_REPLACE(description, 'white', 'XXXXX') AS description2
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | description2 | |
---|---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | XXXXX HANGING HEART T-LIGHT HOLDER |
1 | 71053 | WHITE METAL LANTERN | XXXXX METAL LANTERN |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER | CREAM CUPID HEARTS COAT HANGER |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | KNITTED UNION FLAG HOT WATER BOTTLE |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | RED WOOLLY HOTTIE XXXXX HEART. |
Sometimes, values stored in databases aren’t properly sanitised and may contain trailing or leading whitespace. MySQL has a couple of different string functions for handling this kind of data. The RTRIM()
function removes any trailing whitespace from the right hand side of a string if it is present.
query = """
SELECT
sku,
RTRIM(description) AS description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | |
---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER |
1 | 71053 | WHITE METAL LANTERN |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. |
The LTRIM()
function is used to remove leading whitespace from the left hand side of a string if it exists.
query = """
SELECT
sku,
LTRIM(description) AS description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | |
---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER |
1 | 71053 | WHITE METAL LANTERN |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. |
The REVERSE()
function isn’t one I use very often, but can reverse the orders of characters in a string.
query = """
SELECT
sku,
REVERSE(description) AS description
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | |
---|---|---|
0 | 85123A | REDLOH THGIL-T TRAEH GNIGNAH ETIHW |
1 | 71053 | NRETNAL LATEM ETIHW |
2 | 84406B | REGNAH TAOC STRAEH DIPUC MAERC |
3 | 84029G | ELTTOB RETAW TOH GALF NOINU DETTINK |
4 | 84029E | .TRAEH ETIHW EITTOH YLLOOW DER |
The SOUNDEX()
string function returns the Soundex score or value for a string. Soundex is an algorithm that returns a hash value showing how the string sounds - strings that sound the same have identical Soundex values.
query = """
SELECT
sku,
description,
SOUNDEX(description) AS soundex_value
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
sku | description | soundex_value | |
---|---|---|---|
0 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | W3525263423436 |
1 | 71053 | WHITE METAL LANTERN | W35345365 |
2 | 84406B | CREAM CUPID HEARTS COAT HANGER | C652136323526 |
3 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | K53514236134 |
4 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | R34363 |
Matt Clarke, Thursday, April 14, 2022