Customer segmentation can give you huge insights into your business and identify a whole range of different things about your customers, allowing you to change your marketing and improve results.
The downside is that customer segmentation is complicated and takes time and effort to get right. To simplify the process, I created a Python package called EcommerceTools which lets you segment customers using several standardised techniques, and use the data to predict their future behaviour. It’s dead easy to use.
In this project, I’ll show you how you can take a standard dataset of transaction items and segment your customers using RFM, ABC, heterogeneity, latency, and acquisition and purchase cohort, and even predict their CLV, AOV, and how many orders they’ll place in the next few months!
First, open up a Jupyter notebook and install my EcommerceTools Python package if you do not already have it. You can do this by entering !pip3 install --upgrade ecommercetools
into a notebook cell and then executing it. You’ll then need to import pandas
and the utilities
, transactions
, and customers
modules from the ecommercetools
package.
!pip3 install --upgrade ecommercetools
import pandas as pd
from ecommercetools import utilities
from ecommercetools import transactions
from ecommercetools import customers
Next load up an ecommerce transaction items dataset. I’ve used the widely used sample Online Retail dataset from the UCI Machine Learning Repository, which you can download directly from GitHub. If you load up the data, you can view the column names more easily.
filename = 'https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/all/online-retail-dataset.csv'
df = pd.read_csv(filename)
df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
EcommerceTools requires that you standardise the column names in your transaction items dataframe. There’s a helper function in the utilities
module called load_transaction_items()
which makes it quick and easy to remap the original column names to the ones EcommerceTools requires. Simply pass in the location of the CSV file, and then define the name of each column.
transaction_items_df = utilities.load_transaction_items(
filename,
date_column='InvoiceDate',
order_id_column='InvoiceNo',
customer_id_column='CustomerID',
sku_column='StockCode',
quantity_column='Quantity',
unit_price_column='UnitPrice'
)
transaction_items_df.head()
order_id | sku | Description | quantity | order_date | unit_price | customer_id | Country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Next, we’ll take our original dataframe of transaction items and create a dataframe of transactions. We can do that by passing the name of our transaction items dataframe to the get_transactions()
function. This aggregates the data on the order_id
, and returns the count of SKUs and items, and sums the total revenue, and identifies whether the item was a replacement and its order number. A value of 1
in the order_number
column denotes an acquisition, and everything higher is a returning customer.
transactions_df = transactions.get_transactions(transaction_items_df)
transactions_df.head()
order_id | order_date | customer_id | skus | items | revenue | replacement | order_number | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 2010-12-01 08:26:00 | 17850.0 | 7 | 40 | 139.12 | 0 | 1 |
1 | 536366 | 2010-12-01 08:28:00 | 17850.0 | 2 | 12 | 22.20 | 0 | 2 |
2 | 536367 | 2010-12-01 08:34:00 | 13047.0 | 12 | 83 | 278.73 | 0 | 1 |
3 | 536368 | 2010-12-01 08:34:00 | 13047.0 | 4 | 15 | 70.05 | 0 | 2 |
4 | 536369 | 2010-12-01 08:35:00 | 13047.0 | 1 | 3 | 17.85 | 0 | 3 |
Next, we can create a dataframe of customers using the get_customers()
function. This also takes the name of the dataframe of transaction items and returns a customer-level dataset containing the total spend, total number of SKUs and items purchased, the first order and last order dates, the tenure, recency, and average order size and value metrics for each customer.
customers_df = customers.get_customers(transaction_items_df)
customers_df.head()
customer_id | revenue | orders | skus | items | first_order_date | last_order_date | avg_items | avg_order_value | tenure | recency | cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12346.0 | 0.00 | 2 | 1 | 0 | 2011-01-18 10:01:00 | 2011-01-18 10:17:00 | 0.00 | 0.00 | 3722 | 3722 | 20111 |
1 | 12347.0 | 4310.00 | 7 | 7 | 2458 | 2010-12-07 14:57:00 | 2011-12-07 15:52:00 | 351.14 | 615.71 | 3763 | 3398 | 20104 |
2 | 12348.0 | 1797.24 | 4 | 4 | 2341 | 2010-12-16 19:09:00 | 2011-09-25 13:13:00 | 585.25 | 449.31 | 3754 | 3471 | 20104 |
3 | 12349.0 | 1757.55 | 1 | 1 | 631 | 2011-11-21 09:51:00 | 2011-11-21 09:51:00 | 631.00 | 1757.55 | 3415 | 3415 | 20114 |
4 | 12350.0 | 334.40 | 1 | 1 | 197 | 2011-02-02 16:01:00 | 2011-02-02 16:01:00 | 197.00 | 334.40 | 3706 | 3706 | 20111 |
Now we have this customer-level dataset containing the raw recency, frequency and monetary value data for each customer, we can pass it to the get_rfm_segments()
function. This segmentation process will create a behavioural segmentation of our customer base.
The function returns a few duplicated columns for reference purposes, plus the individual R, F, and M scores, the combined RFM label (from 111 to 555), the RFM score (i.e. 3 to 15), and a label representing the segment name. The segmentation process creates groups of customers with common characteristics that can be targeted via marketing campaigns aimed at influencing their behaviour.
For example, many B2C and B2B companies trigger marketing campaigns use RFM segments to deliver marketing messages to groups of customers who show high F and M scores, but have R scores that are declining, as this can help win back lapsing or churning customers.
rfm_df = customers.get_rfm_segments(customers_df)
rfm_df.head()
customer_id | acquisition_date | recency_date | recency | frequency | monetary | heterogeneity | tenure | r | f | m | h | rfm | rfm_score | rfm_segment_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12346.0 | 2011-01-18 10:01:00 | 2011-01-18 10:17:00 | 3722 | 2 | 0.00 | 1 | 3722 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
1 | 12350.0 | 2011-02-02 16:01:00 | 2011-02-02 16:01:00 | 3706 | 1 | 334.40 | 1 | 3706 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
2 | 12373.0 | 2011-02-01 13:10:00 | 2011-02-01 13:10:00 | 3707 | 1 | 364.60 | 1 | 3707 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
3 | 12377.0 | 2010-12-20 09:37:00 | 2011-01-28 15:45:00 | 3711 | 2 | 1628.12 | 2 | 3751 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
4 | 12386.0 | 2010-12-08 09:53:00 | 2011-01-06 12:37:00 | 3734 | 2 | 401.90 | 2 | 3763 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
Next, we’ll use the transactions dataframe to calculate the purchase latency for each customer. This segmentation allows the minimum, mean, and maximum gap between each of a customer’s orders to be calculated, and identifies the standard deviation, and the Coefficient of Variation.
It then uses these values to predict the expected number of days to the customer’s next order and identifies whether the customer is due to order or not. This segmentation technique, which is also known as order gap analysis can be very powerful, particularly in email marketing.
latency_df = customers.get_latency(transactions_df)
latency_df.head()
customer_id | frequency | recency_date | recency | avg_latency | min_latency | max_latency | std_latency | cv | days_to_next_order | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12680.0 | 4 | 2011-12-09 12:50:00 | 3397 | 28 | 16 | 73 | 30.859898 | 1.102139 | -3338.0 | Order overdue |
1 | 13113.0 | 24 | 2011-12-09 12:49:00 | 3397 | 15 | 0 | 52 | 12.060126 | 0.804008 | -3370.0 | Order overdue |
2 | 15804.0 | 13 | 2011-12-09 12:31:00 | 3397 | 15 | 1 | 39 | 11.008261 | 0.733884 | -3371.0 | Order overdue |
3 | 13777.0 | 33 | 2011-12-09 12:25:00 | 3397 | 11 | 0 | 48 | 12.055274 | 1.095934 | -3374.0 | Order overdue |
4 | 17581.0 | 25 | 2011-12-09 12:21:00 | 3397 | 14 | 0 | 67 | 21.974293 | 1.569592 | -3361.0 | Order overdue |
To examine our customers based on their value, we can use an ABC segmentation model. This calculates the cumulative sum of revenue contributed over the past 12 months (by default) and returns their rank and class. If you use a historic dataset, like this one, you’ll obvously get useless data back out (since the dataset is more than 12 months old). However, ABC is a great way to see which customers are most valuable, especially in B2B ecommerce.
abc_df = customers.get_abc_segments(customers_df)
abc_df.head()
customer_id | abc_class_12m | abc_rank_12m | |
---|---|---|---|
0 | 12346.0 | D | 1.0 |
1 | 12347.0 | D | 1.0 |
2 | 12348.0 | D | 1.0 |
3 | 12349.0 | D | 1.0 |
4 | 12350.0 | D | 1.0 |
Finally, we can use the transactions dataframe to predict the Customer Lifetime Value (CLV), expected number of orders over the next 90 days, and the average order value each customer will generate. These calculations are achieved using the Gamma-Gamma and BG/NBD model from the Lifetimes package.
df_predictions = customers.get_customer_predictions(transactions_df,
observation_period_end='2011-12-09',
days=90)
df_predictions.head(10)
customer_id | predicted_purchases | aov | clv | |
---|---|---|---|---|
0 | 12346.0 | 0.188830 | NaN | NaN |
1 | 12347.0 | 1.408736 | 569.978836 | 836.846896 |
2 | 12348.0 | 0.805907 | 333.784235 | 308.247354 |
3 | 12349.0 | 0.855607 | NaN | NaN |
4 | 12350.0 | 0.196304 | NaN | NaN |
5 | 12352.0 | 1.682277 | 376.175359 | 647.826169 |
6 | 12353.0 | 0.272541 | NaN | NaN |
7 | 12354.0 | 0.247183 | NaN | NaN |
8 | 12355.0 | 0.262909 | NaN | NaN |
9 | 12356.0 | 0.645368 | 324.039419 | 256.855226 |
These simple functions give you a huge amount of customer data to play with. You can use this to predict revenue in future periods, set your Customer Acquisition Cost (CAC) strategy, target customers at the right time without annoying them, and a whole load of other things.
Matt Clarke, Sunday, March 28, 2021