When working with customer data or upon time series data science projects, you’ll often find the need to calculate the difference between two dates in your MySQL queries. The MySQL DATEDIFF()
function makes this process very simple and can be used for both dates stored in columns, or dates you’ve created yourself.
In this simple SQL tutorial I’ll show how you can use DATEDIFF()
to calculate the difference between two dates and work out how long it’s been since a customer’s first and last orders using some real ecommerce data.
If you want to run these queries yourself in Python, you’ll first need access to a MySQL server and a database to query. I run my MySQL server in a Docker container and have created a test database based on the widely used Online Retail Dataset.
I’ve created a couple of step-by-step guides that you can follow to create a MySQL Docker container and import a MySQL database into the Docker container, so you can follow along on the very same data. It only takes a few minutes.
Next, you’ll need to install the SQLAlchemy package that is used to query MySQL and other databases via Pandas, and the PyMySQL driver for connecting to a MySQL server from Python. You can install these using the Pip package manager, then import the required modules into a Jupyter notebook or Python script.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
query = """
SELECT * FROM customers
"""
df = pd.read_sql(query, con=engine)
df.head()
customer_id | revenue | orders | skus | units | first_order_date | last_order_date | |
---|---|---|---|---|---|---|---|
0 | 14646.0 | 279489.02 | 77 | 703 | 196719 | 2010-12-20 10:09:00 | 2011-12-08 12:12:00 |
1 | 18102.0 | 256438.49 | 62 | 151 | 64122 | 2010-12-07 16:42:00 | 2011-12-09 11:50:00 |
2 | 17450.0 | 187482.17 | 55 | 127 | 69029 | 2010-12-07 09:23:00 | 2011-12-01 13:29:00 |
3 | 14911.0 | 132572.62 | 248 | 1794 | 77180 | 2010-12-01 14:05:00 | 2011-12-08 15:54:00 |
4 | 12415.0 | 123725.45 | 26 | 444 | 77242 | 2011-01-06 11:12:00 | 2011-11-15 14:22:00 |
The MySQL DATEDIFF()
function takes two dates and returns the number of days between them. The first date is the later and the second is the earlier date (flipping them the other way around will return a negative value). So, for example, DATEDIFF(last_order_date, first_order_date)
will return the number of days between a customer’s last order and their first order.
You can use DATEDIFF()
in two main ways. You can either use two column values, as shown in the example above, or you can pass in the current date or another date you’ve created that is not currently stored in a MySQL column value.
For example, suppose we want to calculate how many days it’s been since each customer placed their first order. We can do this by passing in the NOW()
function as the first argument, which returns the current MySQL datetime, and then pass first_order_date
as the second argument like this: DATEDIFF(NOW(), first_order_date)
. You can do the same with last_order_date
to find the number of days that have elapsed since a customer last placed an order.
query = """
SELECT
customer_id,
first_order_date,
last_order_date,
DATEDIFF(last_order_date, first_order_date) AS days_between_orders,
DATEDIFF(NOW(), first_order_date) AS days_since_first_order,
DATEDIFF(NOW(), last_order_date) AS days_since_last_order
FROM customers
"""
df = pd.read_sql(query, con=engine)
df.head()
customer_id | first_order_date | last_order_date | days_between_orders | days_since_first_order | days_since_last_order | |
---|---|---|---|---|---|---|
0 | 14646.0 | 2010-12-20 10:09:00 | 2011-12-08 12:12:00 | 353 | 4145 | 3792 |
1 | 18102.0 | 2010-12-07 16:42:00 | 2011-12-09 11:50:00 | 367 | 4158 | 3791 |
2 | 17450.0 | 2010-12-07 09:23:00 | 2011-12-01 13:29:00 | 359 | 4158 | 3799 |
3 | 14911.0 | 2010-12-01 14:05:00 | 2011-12-08 15:54:00 | 372 | 4164 | 3792 |
4 | 12415.0 | 2011-01-06 11:12:00 | 2011-11-15 14:22:00 | 313 | 4128 | 3815 |
Matt Clarke, Friday, April 22, 2022