In SQL, when you want to SELECT
data that lies between two values, there are a number of different SQL operators you can use to return the correct data. However, the BETWEEN
operator is arguably the most logical one to read, which can make queries easier for data scientists to logically parse when reviewing code at a later date.
In this simple tutorial I’ll show you how you can use the BETWEEN
opeator with AND
to create an expression that returns values and dates that fall between a minimum and a maximum value. Once you know how to use it, you’ll likely find yourself using this SQL expression regularly, especially when selecting data that falls within a given date range.
For this example we’ll be accessing a MySQL Docker container running the Online Retail Dataset database, but you can use any database to which you have access. It only takes a few minutes to spin up a MySQL Docker container and import an SQL database, so if you want to follow along, I’d recommend trying this approach.
You can write your SQL SELECT
statements directly into the terminal or into a MySQL client, such as Beekeeper Studio, if you wish. However, to make it easier for you to follow I’m writing the code in Pandas. To do this you’ll first need to install the sqlalchemy
and pmysql
Python packages used to query a SQL database using Pandas. Then you’ll need to import pandas
and the create_engine
module from sqlalchemy
.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
Next, we need to create a connection to our SQL server. I’m connecting to the ecommerce
database on my MySQL Docker container which is located at IP address 172.17.0.3
on port 3306
. I’m connecting using the username root
and my password, which I’m hidden with asterisks.
engine = create_engine('mysql+pymysql://root:********@172.17.0.3:3306/ecommerce')
To test your connection and take a look at the data, let’s first write a simple SQL statement to SELECT
all of the columns and rows from the order_items
table. Since the Pandas read_sql_query()
passes the SQL statement to the SQL server and returns a Pandas dataframe, we can run the statement and use head()
to print the first five rows. As you can see, the order_date
column contains the date each order was placed.
df = pd.read_sql_query("SELECT * FROM order_items", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
We can take a quick look at the data in the order_date
column by using the Pandas min()
and max()
functions to determine the first and last order dates in the dataset.
df['order_date'].min()
Timestamp('2010-12-01 08:26:00')
df['order_date'].max()
Timestamp('2011-12-09 12:50:00')
Next, we’ll select the data for a specific date. Since the data in order_date
are expressed as datetime
, if you attempt to select only those rows where the order_date
is equal to a date, i.e. 2010-12-01
, you’ll get no data back.
query = """
SELECT * FROM order_items
WHERE order_date = '2010-12-01'
"""
df = pd.read_sql(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price |
---|
Instead, you’ll likely want to use LIKE
and query part of the datetime
value. The query below will return any order_date
values that start with our chosen date. If you apply the min()
and max()
functions to the values in the dataframe you can sanity check that your query has returned valid results.
query = """
SELECT * FROM order_items
WHERE order_date LIKE '2010-12-01 %'
"""
df = pd.read_sql(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
df['order_date'].min()
Timestamp('2010-12-01 08:26:00')
df['order_date'].max()
Timestamp('2010-12-01 17:35:00')
To select all data that falls between two dates, or two values, you can use a BETWEEN
… AND
expression. Unlike the queries above, BETWEEN
will work with datetime
values, so there’s no need for a LIKE
here. Instead, we simply define the column we want to query, i.e. order_date
and then use BETWEEN
and AND
to separate the minimum value and the maximum value.
query = """
SELECT * FROM order_items
WHERE order_date BETWEEN '2010-12-01' AND '2010-12-03'
"""
df = pd.read_sql(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
The important caveat with BETWEEN
is that it is not functionally equivalent to the >= AND <=
operator expression. For example, if you print the min()
and max()
values returned by the SQL statement above, you’ll notice that it returns values greater than or equal to the mininum, i.e. >=
, but that are less than i.e. <
the maximum value.
This means our WHERE order_date BETWEEN '2010-12-01' AND '2010-12-03'
expression will return orders from 2010-12-01
to 2010-12-02
, not the 2020-12-03
shown in the query. This is a common mistake when writing SQL statements that select date ranges or other values and can result in inaccurate results.
df['order_date'].min()
Timestamp('2010-12-01 08:26:00')
df['order_date'].max()
Timestamp('2010-12-02 19:59:00')
Here’s an alternative way to select the same data shown above using regular SQL operators. Whenever you’re selecting data between ranges, I’d highly recommend that you sanity check the results using min()
and max()
to double-check that the query is returning the data you expect.
query = """
SELECT * FROM order_items
WHERE order_date >= '2010-12-01'
AND order_date < '2010-12-03'
"""
df = pd.read_sql(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
df['order_date'].min()
Timestamp('2010-12-01 08:26:00')
df['order_date'].max()
Timestamp('2010-12-02 19:59:00')
Matt Clarke, Wednesday, April 13, 2022