How to use GROUP BY and HAVING in SQL statements

The GROUP BY and HAVING clauses are used to group the results of a SELECT statement and are extremely useful when performing exploratory data analysis. Here's how to use them.

How to use GROUP BY and HAVING in SQL statements
Picture by Steve Johnson, Pexels.
13 minutes to read

The SQL GROUP BY clause groups row-based data into aggregated data, reducing the number of rows in the dataset, and is commonly used to perform aggregate calculations.

The GROUP BY clause is used as an optional part of an SQL SELECT statement and will group the data on one or more columns. In ecommerce, GROUP BY is one of the most useful SQL clauses for performing Exploratory Data Analysis or EDA, and is great for performing calculations and analyses on aggregate data.

In this simple tutorial I’ll show you how you can use GROUP BY to analyse some ecommerce sales data from the Online Retail Dataset so you can investigate product sales and customer behaviour and filter the data in a variety of ways.

Start your MySQL server

Before you can start, you’ll need to have access to a MySQL database that you can query. To make this easier to follow along, I’ve created a MySQL database dump file of the popular Online Retail Dataset that you can import into a Docker container.

To get started, you’ll first need to create a MySQL Docker container and then import the SQL database dump to create a new database. Although it sounds scary, the whole process should only take a few minutes and I’ve provided the code you need for every step.

Install and load the packages

Next, open a Jupyter notebook and install the SQLAlchemy and PyMYSQL Python packages. We’ll be using PyMYSQL to connect to the MySQL Docker container and then using SQLAlchemy to pass SQL queries to MySQL via Pandas. Once installed, import Pandas and the create_engine module from sqlalchemy.

!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine

Create a database connection

Now we need to configure our connection to the MySQL Docker container using the create_engine() method. You’ll need to adjust the username, password, IP address, and database name from the values shown here to match your own set up.

engine = create_engine('mysql+pymysql://root:********@172.17.0.3:3306/ecommerce')

SELECT FROM an SQL table

To test your connection to MySQL and look at the data in its unaggregated form, we’ll run a simple SQL SELECT statement to return all columns and rows from the order_items table.

df = pd.read_sql("SELECT * FROM order_items", con=engine)
df.head()
invoice sku description qty order_date unit_price customer_id country line_price
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Use GROUP BY to aggregate data from a column

To get started using the GROUP BY clause, we’ll write an SQL SELECT statement that returns the sku, description, and unit_price for each sku. We first select the columns we want, then after we’ve declared the table, we can use GROUP BY sku to aggregate the data on the sku column.

To get meaningful results from a GROUP BY clause, you need to be selective about the columns you return - it won’t work properly if the columns contain multiple values, unless you perform some kind of calculation upon the underlying column values.

query = """
SELECT 
    sku,
    description, 
    unit_price
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description unit_price
0 10002 INFLATABLE POLITICAL GLOBE 0.85
1 10080 GROOVY CACTUS INFLATABLE 0.85
2 10120 DOGGY RUBBER 0.21
3 10123C HEARTS WRAPPING TAPE 0.65
4 10123G None 0.00

Selecting grouped data using the HAVING clause

The HAVING clause is often used along with GROUP BY to return only those values that match certain criteria. For example, let’s say we want to show only the products that have a unit_price that is greater than 50. We can do this using HAVING. You can also combine multiple arguments and operators to create more complex query filters.

query = """
SELECT 
    sku,
    description, 
    unit_price
FROM order_items
GROUP BY sku
HAVING unit_price > 50
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description unit_price
0 21769 VINTAGE POST OFFICE CABINET 79.95
1 22016 Dotcomgiftshop Gift Voucher £100.00 83.33
2 22503 CABIN BAG VINTAGE PAISLEY 59.53
3 22655 VINTAGE RED KITCHEN CABINET 295.00
4 22656 VINTAGE BLUE KITCHEN CABINET 125.00

COUNT column values with GROUP BY

As I mentioned above, GROUP BY will only return valid results when you are selective about the columns you return. When a column contains multiple values, it’s very common to use GROUP BY with SQL’s mathematical operators, such as COUNT(), MIN(), MAX(), SUM(), and numerous others.

As a basic example, let’s count the number of values in the invoice column for each sku in our resultset. We can do this by adding the COUNT(invoice) line in the SELECT statement.

query = """
SELECT 
    sku,
    description, 
    COUNT(invoice)
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description COUNT(invoice)
0 10002 INFLATABLE POLITICAL GLOBE 73
1 10080 GROOVY CACTUS INFLATABLE 24
2 10120 DOGGY RUBBER 30
3 10123C HEARTS WRAPPING TAPE 4
4 10123G None 1

As you’ll have noticed above, just using COUNT(invoice) will return that clause as the column name in the SQL resultset returned, which isn’t very tidy. You might want to change it to COUNT(invoice) AS orders to return the calculated data in a column called orders instead.

query = """
SELECT 
    sku,
    description, 
    COUNT(invoice) AS orders
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description orders
0 10002 INFLATABLE POLITICAL GLOBE 73
1 10080 GROOVY CACTUS INFLATABLE 24
2 10120 DOGGY RUBBER 30
3 10123C HEARTS WRAPPING TAPE 4
4 10123G None 1

SUM column values with GROUP BY

The other mathematical operators work in just the same way as COUNT(). For example, we’ll now calculate the amount of revenue each product generated by calculating the SUM() of the line_price column.

query = """
SELECT 
    sku,
    description, 
    SUM(line_price) AS revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description revenue
0 10002 INFLATABLE POLITICAL GLOBE 759.89
1 10080 GROOVY CACTUS INFLATABLE 119.09
2 10120 DOGGY RUBBER 40.53
3 10123C HEARTS WRAPPING TAPE 3.25
4 10123G None 0.00

Multiply column values with GROUP BY

You can also calculate values yourself by passing in other mathematical operators. For example, if we didn’t have the line_price column, we could calculate it by multiplying the qty value by the unit_price with SUM(qty * unit_price) AS revenue.

query = """
SELECT 
    sku,
    description, 
    SUM(qty * unit_price) AS revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description revenue
0 10002 INFLATABLE POLITICAL GLOBE 759.89
1 10080 GROOVY CACTUS INFLATABLE 119.09
2 10120 DOGGY RUBBER 40.53
3 10123C HEARTS WRAPPING TAPE 3.25
4 10123G None -0.00

Calculate an average with GROUP BY

Similarly, you can calculate means and averages by performing divisions and multiplications at the same time. To calculate the average amount each customer spent on a given sku we can use (SUM(qty * unit_price) / COUNT(invoice)) AS avg_revenue.

query = """
SELECT 
    sku,
    description, 
    COUNT(invoice) AS orders,
    SUM(qty * unit_price) AS revenue, 
    (SUM(qty * unit_price) / COUNT(invoice)) AS avg_revenue
FROM order_items
GROUP BY sku
"""
df = pd.read_sql(query, con=engine)
df.head()
sku description orders revenue avg_revenue
0 10002 INFLATABLE POLITICAL GLOBE 73 759.89 10.409452
1 10080 GROOVY CACTUS INFLATABLE 24 119.09 4.962083
2 10120 DOGGY RUBBER 30 40.53 1.351000
3 10123C HEARTS WRAPPING TAPE 4 3.25 0.812500
4 10123G None 1 -0.00 -0.000000

There are loads of other SQL mathematical operators you can use to perform calculations on data you’ve aggregated using GROUP BY. They all work in a similar manner and allow you to perform complex calculations directly in the database, which can be efficient and time-saving.

Matt Clarke, Thursday, April 14, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.