The SQL CASE
statement is used for flow control, much like an if, then, else statement. If the statement finds a match on the chosen condition it will return the defined value, otherwise it will continue through the other options until it finds one that matches, or it will fail gracefully and trigger the final ELSE
clause.
The CASE
statement is one of the most useful and powerful features in MySQL, and other SQL dialects. Not only does it allow you to create new column labels or values based on data, but it also allows you to perform calculations based on the presence of values in one or more columns. It’s pretty easy to pick up and is something you’ll soon be using every day. Here’s how it works.
CASE
statement worksIn the simple example below, we’re querying the sku
, and unit_price
for some products, and we’re using the CASE
statement to assign a label to a new column in our result set, based on the value of the unit_price
.
SELECT
sku,
unit_price,
CASE
WHEN unit_price < 5 THEN 'Low'
WHEN unit_price >= 5 AND unit_price <= 10 THEN 'Medium'
ELSE 'High'
END AS price_range
FROM order_items
We place the CASE
statement in the list of columns we want to select, then we create a series of conditions that will be checked in order. First, we’ll check to see if the unit_price
is less than 5, and if it is we use THEN
to return the value Low
. If the price is greater than 5, execution will continue to the next step.
In the next step, we check if the unit_price
is greater than or equal to five, or less than or equal to 10. If it is, we’ll return the value Medium
. If the condition fails, execution will continue. Since we know the price must be greater than 10, we can use ELSE
to exit the flow control section and return the label High
. Finally, we use END AS
to assign the value to a column called price_range
.
It’s pretty easy really, but let’s look at some other examples, using an actual database.
First, you’ll need to create a MySQL database. I run my MySQL server in a Docker container and have created a test database based on the widely used Online Retail Dataset. I’ve created a couple of step-by-step guides that you can follow to create your own MySQL Docker container and import a MySQL database, so you can follow along on the very same data. It only takes a few minutes.
Once you’ve got your MySQL server setup, open up a Jupyter notebook and install the SQLAlchemy and PyMySQL Python packages. We’ll be using these to connect to MySQL and pass in database queries using Pandas. Once you’ve installed the packages using the Pip package manager, import pandas
and the create_engine
module from sqlalchemy
.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
Next, we need to configure the connection to our MySQL server by creating a connection string to pass to the create_engine()
function. This will vary according to your own setup. I’m connecting to a MySQL server with the PyMySQL driver, using the username root and the password *****
on port 3306 of the IP address 172.17.0.3. The final argument tells create_engine()
to connect to the ecommerce
database.
engine = create_engine('mysql+pymysql://root:*****@172.17.0.2:3306/ecommerce')
To test the connection to the MySQL server we’ll first write a simple MySQL SELECT
statement to return everything from the order_items
table.
df = pd.read_sql_query("SELECT * FROM order_items", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
The simplest way to use the CASE
statement is to construct a flow control section that works rather like a switch
statement or if elseif else
statement in order to form a new column value. For example, let’s say we wanted to perform some simple binning on our unit_price
data and assign a label of Low
, Medium
, or High
, depending on the unit_price
value.
To do this, we’d create a CASE
statement within our SQL SELECT
with WHEN
rules. If the data match the WHEN
rule, we’d then apply a label using THEN
. If the data doesn’t match the WHEN
rule, execution continues up to the ELSE
at the end of the CASE
statement. We then use END AS
to stop the CASE
statement and assign the output to our new column, called price_range
in this simple example.
query = """
SELECT
sku,
description,
unit_price,
CASE
WHEN unit_price < 5 THEN 'Low'
WHEN unit_price >= 5 AND unit_price <= 10 THEN 'Medium'
ELSE 'High'
END AS price_range
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.sample(5)
sku | description | unit_price | price_range | |
---|---|---|---|---|
240242 | 22759 | SET OF 3 NOTEBOOKS IN PARCEL | 1.65 | Low |
203110 | 21033 | JUMBO BAG CHARLIE AND LOLA TOYS | 4.13 | Low |
78742 | 22482 | BLUE TEA TOWEL CLASSIC DESIGN | 1.25 | Low |
418823 | 21818 | GLITTER HEART DECORATION | 0.39 | Low |
152474 | 22570 | FELTCRAFT CUSHION RABBIT | 3.75 | Low |
CASE
You can also use CASE
to one-hot encode data and record a binary value of 1 or 0 depending on whether the data in a given column or columns matches your rules. This can be quite a useful technique if you want to perform calculations on categorical variables stored within your database. Granted, it’s easier to do in Pandas, but sometimes you’ll need to have the data in SQL.
query = """
SELECT
invoice,
description,
CASE
WHEN description LIKE '%WATER BOTTLE%' THEN 1
ELSE 0
END AS contains_water_bottle
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | description | contains_water_bottle | |
---|---|---|---|
0 | 536365 | WHITE HANGING HEART T-LIGHT HOLDER | 0 |
1 | 536365 | WHITE METAL LANTERN | 0 |
2 | 536365 | CREAM CUPID HEARTS COAT HANGER | 0 |
3 | 536365 | KNITTED UNION FLAG HOT WATER BOTTLE | 1 |
4 | 536365 | RED WOOLLY HOTTIE WHITE HEART. | 0 |
You can also use the CASE
statement to perform calculations on aggregated data. For example, let’s take the simple example above where we one-hot encoded order lines to include a 1 if they contained the phrase “WATER BOTTLE”. If we wrap the calculation in a SUM()
we can aggregate the data and return the total number of order lines that matched the criteria.
query = """
SELECT
COUNT(invoice) AS order_lines,
COUNT(DISTINCT(invoice)) AS orders,
SUM(CASE WHEN description LIKE '%WATER BOTTLE%' THEN 1 ELSE 0 END) AS water_bottles
FROM order_items
"""
df = pd.read_sql_query(query, con=engine)
df.head()
order_lines | orders | water_bottles | |
---|---|---|---|
0 | 541909 | 25900 | 9505.0 |
You can create much more complex CASE
statements incorporating a whole range of different criteria, based on one or more columns, and either assign labels or values to fields, or perform calculations on the data. This makes CASE
one of the most useful and powerful flow control features in SQL.
Matt Clarke, Wednesday, April 20, 2022