The SQL GROUP BY
clause groups row-based data into aggregated data, reducing the number of rows in the dataset, and is commonly used to perform aggregate calculations.
The GROUP BY
clause is used as an optional part of an SQL SELECT
statement and will group the data on one or more columns. In ecommerce, GROUP BY
is one of the most useful SQL clauses for performing Exploratory Data Analysis or EDA, and is great for performing calculations and analyses on aggregate data.
In this simple tutorial I’ll show you how you can use GROUP BY
to analyse some ecommerce sales data from the Online Retail Dataset so you can investigate product sales and customer behaviour and filter the data in a variety of ways.
Before you can start, you’ll need to have access to a MySQL database that you can query. To make this easier to follow along, I’ve created a MySQL database dump file of the popular Online Retail Dataset that you can import into a Docker container.
To get started, you’ll first need to create a MySQL Docker container and then import the SQL database dump to create a new database. Although it sounds scary, the whole process should only take a few minutes and I’ve provided the code you need for every step.
Next, open a Jupyter notebook and install the SQLAlchemy and PyMYSQL Python packages. We’ll be using PyMYSQL to connect to the MySQL Docker container and then using SQLAlchemy to pass SQL queries to MySQL via Pandas. Once installed, import Pandas and the create_engine
module from sqlalchemy
.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
Now we need to configure our connection to the MySQL Docker container using the create_engine()
method. You’ll need to adjust the username, password, IP address, and database name from the values shown here to match your own set up.
engine = create_engine('mysql+pymysql://root:********@172.17.0.3:3306/ecommerce')
To test your connection to MySQL and look at the data in its unaggregated form, we’ll run a simple SQL SELECT statement to return all columns and rows from the order_items
table.
df = pd.read_sql("SELECT * FROM order_items", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
To get started using the GROUP BY
clause, we’ll write an SQL SELECT
statement that returns the sku
, description
, and unit_price
for each sku
. We first select the columns we want, then after we’ve declared the table, we can use GROUP BY sku
to aggregate the data on the sku
column.
To get meaningful results from a GROUP BY
clause, you need to be selective about the columns you return - it won’t work properly if the columns contain multiple values, unless you perform some kind of calculation upon the underlying column values.
query = """
SELECT
sku,
description,
unit_price
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | unit_price | |
---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 0.85 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 0.85 |
2 | 10120 | DOGGY RUBBER | 0.21 |
3 | 10123C | HEARTS WRAPPING TAPE | 0.65 |
4 | 10123G | None | 0.00 |
The HAVING
clause is often used along with GROUP BY
to return only those values that match certain criteria. For example, let’s say we want to show only the products that have a unit_price
that is greater than 50. We can do this using HAVING
. You can also combine multiple arguments and operators to create more complex query filters.
query = """
SELECT
sku,
description,
unit_price
FROM order_items
GROUP BY sku
HAVING unit_price > 50
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | unit_price | |
---|---|---|---|
0 | 21769 | VINTAGE POST OFFICE CABINET | 79.95 |
1 | 22016 | Dotcomgiftshop Gift Voucher £100.00 | 83.33 |
2 | 22503 | CABIN BAG VINTAGE PAISLEY | 59.53 |
3 | 22655 | VINTAGE RED KITCHEN CABINET | 295.00 |
4 | 22656 | VINTAGE BLUE KITCHEN CABINET | 125.00 |
As I mentioned above, GROUP BY
will only return valid results when you are selective about the columns you return. When a column contains multiple values, it’s very common to use GROUP BY
with SQL’s mathematical operators, such as COUNT()
, MIN()
, MAX()
, SUM()
, and numerous others.
As a basic example, let’s count the number of values in the invoice
column for each sku
in our resultset. We can do this by adding the COUNT(invoice)
line in the SELECT
statement.
query = """
SELECT
sku,
description,
COUNT(invoice)
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | COUNT(invoice) | |
---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 73 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 24 |
2 | 10120 | DOGGY RUBBER | 30 |
3 | 10123C | HEARTS WRAPPING TAPE | 4 |
4 | 10123G | None | 1 |
As you’ll have noticed above, just using COUNT(invoice)
will return that clause as the column name in the SQL resultset returned, which isn’t very tidy. You might want to change it to COUNT(invoice) AS orders
to return the calculated data in a column called orders
instead.
query = """
SELECT
sku,
description,
COUNT(invoice) AS orders
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | orders | |
---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 73 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 24 |
2 | 10120 | DOGGY RUBBER | 30 |
3 | 10123C | HEARTS WRAPPING TAPE | 4 |
4 | 10123G | None | 1 |
The other mathematical operators work in just the same way as COUNT()
. For example, we’ll now calculate the amount of revenue each product generated by calculating the SUM()
of the line_price
column.
query = """
SELECT
sku,
description,
SUM(line_price) AS revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | revenue | |
---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 759.89 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 119.09 |
2 | 10120 | DOGGY RUBBER | 40.53 |
3 | 10123C | HEARTS WRAPPING TAPE | 3.25 |
4 | 10123G | None | 0.00 |
You can also calculate values yourself by passing in other mathematical operators. For example, if we didn’t have the line_price
column, we could calculate it by multiplying the qty
value by the unit_price
with SUM(qty * unit_price) AS revenue
.
query = """
SELECT
sku,
description,
SUM(qty * unit_price) AS revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | revenue | |
---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 759.89 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 119.09 |
2 | 10120 | DOGGY RUBBER | 40.53 |
3 | 10123C | HEARTS WRAPPING TAPE | 3.25 |
4 | 10123G | None | -0.00 |
Similarly, you can calculate means and averages by performing divisions and multiplications at the same time. To calculate the average amount each customer spent on a given sku
we can use (SUM(qty * unit_price) / COUNT(invoice)) AS avg_revenue
.
query = """
SELECT
sku,
description,
COUNT(invoice) AS orders,
SUM(qty * unit_price) AS revenue,
(SUM(qty * unit_price) / COUNT(invoice)) AS avg_revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku | description | orders | revenue | avg_revenue | |
---|---|---|---|---|---|
0 | 10002 | INFLATABLE POLITICAL GLOBE | 73 | 759.89 | 10.409452 |
1 | 10080 | GROOVY CACTUS INFLATABLE | 24 | 119.09 | 4.962083 |
2 | 10120 | DOGGY RUBBER | 30 | 40.53 | 1.351000 |
3 | 10123C | HEARTS WRAPPING TAPE | 4 | 3.25 | 0.812500 |
4 | 10123G | None | 1 | -0.00 | -0.000000 |
There are loads of other SQL mathematical operators you can use to perform calculations on data you’ve aggregated using GROUP BY
. They all work in a similar manner and allow you to perform complex calculations directly in the database, which can be efficient and time-saving.
Matt Clarke, Thursday, April 14, 2022