Transactional item data can be used to create a number of other useful datasets to help you analyse ecommerce products and customers. From the core list of items purchased you can create additional datasets containing order data, customer data, and product data, which can be used for analysis, reporting, or modeling.
Pandas makes it really straightforward to perform this process, thanks to its helpful groupby()
and agg()
functions which allow you to group by a given column and then calculate aggregate statistics on the selected data. As well as standard metrics, such as sum
, count
, min
, max
, nunique
, and std
, you can also pass in Numpy functions or create your own custom metrics. Here’s how it’s done.
The data set I’m using here is the Online Retail dataset from the UCI Machine Learning Repository. It’s typical of the transaction item data set that you’ll deal with every day if you work in ecommerce and contains one line per SKU, with details on the order ID, the customer ID, order date and the price paid. To tidy it up, I’ve renamed some of the columns and skipped the first row of headers.
import pandas as pd
df = pd.read_csv('online-retail.csv',
names=['invoice', 'sku', 'description', 'qty', 'order_date',
'unit_price', 'customer_id', 'country'],
low_memory=False,
skiprows=1)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
This data set doesn’t include a line price, so to save hassle in later steps, we will first calculate this by multiplying the qty
field by the unit_price
field. The order_date
field is also an object instead of a datetime, so we can reset this to the correct type using the Pandas to_datetime()
function.
df['line_price'] = df['qty'] * df['unit_price']
df['order_date'] = pd.to_datetime(df['order_date'])
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
From our single dataframe of transactional items, we can now using grouping and aggregation to create other dataframes containing data on orders, customers, and products. To create the orders dataframe, we first groupby()
the invoice
column and then use the agg()
function to calculate new metrics for each of the columns of interest.
To calculate the revenue
on each order we use the revenue=('line_price', 'sum')
argument, which calculates the sum
of the line_price
column. We’ll calculate the number of unique SKUs in the order with nunique
and the total number of items by calculating the sum
of the qty
field.
df_orders = df.groupby('invoice').agg(
revenue=('line_price', 'sum'),
skus=('sku', 'nunique'),
items=('qty', 'sum'),
)
df_orders.head()
revenue | skus | items | |
---|---|---|---|
invoice | |||
536365 | 139.12 | 7 | 40 |
536366 | 22.20 | 2 | 12 |
536367 | 278.73 | 12 | 83 |
536368 | 70.05 | 4 | 15 |
536369 | 17.85 | 1 | 3 |
This should be pretty self explanatory, as we’re just repeating the above approach to create a customers dataframe. This groups on the customer_id
and then uses agg()
to calculate the total revenue, the number of unique orders, the number of unique SKUs, the total number of units, and the first and last date of the customers orders.
df_customers = df.groupby('customer_id').agg(
revenue=('line_price', 'sum'),
orders=('invoice', 'nunique'),
skus=('sku', 'nunique'),
units=('qty', 'sum'),
first_order_date=('order_date', 'min'),
last_order_date=('order_date', 'max')
).reset_index().sort_values(by='revenue', ascending=False)
df_customers.head()
customer_id | revenue | orders | skus | units | first_order_date | last_order_date | |
---|---|---|---|---|---|---|---|
1703 | 14646.0 | 279489.02 | 77 | 703 | 196719 | 2010-12-20 10:09:00 | 2011-12-08 12:12:00 |
4233 | 18102.0 | 256438.49 | 62 | 151 | 64122 | 2010-12-07 16:42:00 | 2011-12-09 11:50:00 |
3758 | 17450.0 | 187482.17 | 55 | 127 | 69029 | 2010-12-07 09:23:00 | 2011-12-01 13:29:00 |
1895 | 14911.0 | 132572.62 | 248 | 1794 | 77180 | 2010-12-01 14:05:00 | 2011-12-08 15:54:00 |
55 | 12415.0 | 123725.45 | 26 | 444 | 77242 | 2011-01-06 11:12:00 | 2011-11-15 14:22:00 |
Finally, to create a products dataframe, we can groupby()
the sku
column and use the agg()
function to calculate the total number of units sold, the average number of units purchased by each customer, the average line price paid, the total number of unique invoices and the total number of customers who’ve purchased the product.
df_products = df.groupby('sku').agg(
units=('qty', 'sum'),
avg_units=('qty', 'mean'),
revenue=('line_price', 'sum'),
orders=('invoice', 'nunique'),
customers=('customer_id', 'nunique'),
)
df_products.head()
units | avg_units | revenue | orders | customers | |
---|---|---|---|---|---|
sku | |||||
10002 | 1037 | 14.205479 | 759.89 | 73 | 40 |
10080 | 495 | 20.625000 | 119.09 | 24 | 19 |
10120 | 193 | 6.433333 | 40.53 | 29 | 25 |
10123C | -13 | -3.250000 | 3.25 | 4 | 3 |
10123G | -38 | -38.000000 | 0.00 | 1 | 0 |
Matt Clarke, Saturday, March 06, 2021