How to segment your customers using EcommerceTools

EcommerceTools makes it quick and easy to segment your customers using a range of powerful techniques and predict their future behaviour and value. Here's how it works.

How to segment your customers using EcommerceTools
Picture by Paul Felberbauer, Unsplash.
14 minutes to read

Customer segmentation can give you huge insights into your business and identify a whole range of different things about your customers, allowing you to change your marketing and improve results.

The downside is that customer segmentation is complicated and takes time and effort to get right. To simplify the process, I created a Python package called EcommerceTools which lets you segment customers using several standardised techniques, and use the data to predict their future behaviour. It’s dead easy to use.

In this project, I’ll show you how you can take a standard dataset of transaction items and segment your customers using RFM, ABC, heterogeneity, latency, and acquisition and purchase cohort, and even predict their CLV, AOV, and how many orders they’ll place in the next few months!

Load the packages

First, open up a Jupyter notebook and install my EcommerceTools Python package if you do not already have it. You can do this by entering !pip3 install --upgrade ecommercetools into a notebook cell and then executing it. You’ll then need to import pandas and the utilities, transactions, and customers modules from the ecommercetools package.

!pip3 install --upgrade ecommercetools
import pandas as pd
from ecommercetools import utilities
from ecommercetools import transactions
from ecommercetools import customers

Load the data

Next load up an ecommerce transaction items dataset. I’ve used the widely used sample Online Retail dataset from the UCI Machine Learning Repository, which you can download directly from GitHub. If you load up the data, you can view the column names more easily.

filename = 'https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/all/online-retail-dataset.csv'
df = pd.read_csv(filename)
df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom

EcommerceTools requires that you standardise the column names in your transaction items dataframe. There’s a helper function in the utilities module called load_transaction_items() which makes it quick and easy to remap the original column names to the ones EcommerceTools requires. Simply pass in the location of the CSV file, and then define the name of each column.

transaction_items_df = utilities.load_transaction_items(
    filename,
    date_column='InvoiceDate',
    order_id_column='InvoiceNo',
    customer_id_column='CustomerID',
    sku_column='StockCode',
    quantity_column='Quantity',
    unit_price_column='UnitPrice'
)
transaction_items_df.head()
order_id sku Description quantity order_date unit_price customer_id Country line_price
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Create a transactions dataset

Next, we’ll take our original dataframe of transaction items and create a dataframe of transactions. We can do that by passing the name of our transaction items dataframe to the get_transactions() function. This aggregates the data on the order_id, and returns the count of SKUs and items, and sums the total revenue, and identifies whether the item was a replacement and its order number. A value of 1 in the order_number column denotes an acquisition, and everything higher is a returning customer.

transactions_df = transactions.get_transactions(transaction_items_df)
transactions_df.head()
order_id order_date customer_id skus items revenue replacement order_number
0 536365 2010-12-01 08:26:00 17850.0 7 40 139.12 0 1
1 536366 2010-12-01 08:28:00 17850.0 2 12 22.20 0 2
2 536367 2010-12-01 08:34:00 13047.0 12 83 278.73 0 1
3 536368 2010-12-01 08:34:00 13047.0 4 15 70.05 0 2
4 536369 2010-12-01 08:35:00 13047.0 1 3 17.85 0 3

Create a customer dataset

Next, we can create a dataframe of customers using the get_customers() function. This also takes the name of the dataframe of transaction items and returns a customer-level dataset containing the total spend, total number of SKUs and items purchased, the first order and last order dates, the tenure, recency, and average order size and value metrics for each customer.

customers_df = customers.get_customers(transaction_items_df)
customers_df.head()
customer_id revenue orders skus items first_order_date last_order_date avg_items avg_order_value tenure recency cohort
0 12346.0 0.00 2 1 0 2011-01-18 10:01:00 2011-01-18 10:17:00 0.00 0.00 3722 3722 20111
1 12347.0 4310.00 7 7 2458 2010-12-07 14:57:00 2011-12-07 15:52:00 351.14 615.71 3763 3398 20104
2 12348.0 1797.24 4 4 2341 2010-12-16 19:09:00 2011-09-25 13:13:00 585.25 449.31 3754 3471 20104
3 12349.0 1757.55 1 1 631 2011-11-21 09:51:00 2011-11-21 09:51:00 631.00 1757.55 3415 3415 20114
4 12350.0 334.40 1 1 197 2011-02-02 16:01:00 2011-02-02 16:01:00 197.00 334.40 3706 3706 20111

Behavioural segmentation using RFM and heterogeneity

Now we have this customer-level dataset containing the raw recency, frequency and monetary value data for each customer, we can pass it to the get_rfm_segments() function. This segmentation process will create a behavioural segmentation of our customer base.

The function returns a few duplicated columns for reference purposes, plus the individual R, F, and M scores, the combined RFM label (from 111 to 555), the RFM score (i.e. 3 to 15), and a label representing the segment name. The segmentation process creates groups of customers with common characteristics that can be targeted via marketing campaigns aimed at influencing their behaviour.

For example, many B2C and B2B companies trigger marketing campaigns use RFM segments to deliver marketing messages to groups of customers who show high F and M scores, but have R scores that are declining, as this can help win back lapsing or churning customers.

rfm_df = customers.get_rfm_segments(customers_df)
rfm_df.head()
customer_id acquisition_date recency_date recency frequency monetary heterogeneity tenure r f m h rfm rfm_score rfm_segment_name
0 12346.0 2011-01-18 10:01:00 2011-01-18 10:17:00 3722 2 0.00 1 3722 1 1 1 1 111 3 Risky
1 12350.0 2011-02-02 16:01:00 2011-02-02 16:01:00 3706 1 334.40 1 3706 1 1 1 1 111 3 Risky
2 12373.0 2011-02-01 13:10:00 2011-02-01 13:10:00 3707 1 364.60 1 3707 1 1 1 1 111 3 Risky
3 12377.0 2010-12-20 09:37:00 2011-01-28 15:45:00 3711 2 1628.12 2 3751 1 1 1 1 111 3 Risky
4 12386.0 2010-12-08 09:53:00 2011-01-06 12:37:00 3734 2 401.90 2 3763 1 1 1 1 111 3 Risky

Segment customers by latency

Next, we’ll use the transactions dataframe to calculate the purchase latency for each customer. This segmentation allows the minimum, mean, and maximum gap between each of a customer’s orders to be calculated, and identifies the standard deviation, and the Coefficient of Variation.

It then uses these values to predict the expected number of days to the customer’s next order and identifies whether the customer is due to order or not. This segmentation technique, which is also known as order gap analysis can be very powerful, particularly in email marketing.

latency_df = customers.get_latency(transactions_df)
latency_df.head()
customer_id frequency recency_date recency avg_latency min_latency max_latency std_latency cv days_to_next_order label
0 12680.0 4 2011-12-09 12:50:00 3397 28 16 73 30.859898 1.102139 -3338.0 Order overdue
1 13113.0 24 2011-12-09 12:49:00 3397 15 0 52 12.060126 0.804008 -3370.0 Order overdue
2 15804.0 13 2011-12-09 12:31:00 3397 15 1 39 11.008261 0.733884 -3371.0 Order overdue
3 13777.0 33 2011-12-09 12:25:00 3397 11 0 48 12.055274 1.095934 -3374.0 Order overdue
4 17581.0 25 2011-12-09 12:21:00 3397 14 0 67 21.974293 1.569592 -3361.0 Order overdue

Segment customers using ABC

To examine our customers based on their value, we can use an ABC segmentation model. This calculates the cumulative sum of revenue contributed over the past 12 months (by default) and returns their rank and class. If you use a historic dataset, like this one, you’ll obvously get useless data back out (since the dataset is more than 12 months old). However, ABC is a great way to see which customers are most valuable, especially in B2B ecommerce.

abc_df = customers.get_abc_segments(customers_df)
abc_df.head()
customer_id abc_class_12m abc_rank_12m
0 12346.0 D 1.0
1 12347.0 D 1.0
2 12348.0 D 1.0
3 12349.0 D 1.0
4 12350.0 D 1.0

Get predicted CLV, order volume, and AOV per customer

Finally, we can use the transactions dataframe to predict the Customer Lifetime Value (CLV), expected number of orders over the next 90 days, and the average order value each customer will generate. These calculations are achieved using the Gamma-Gamma and BG/NBD model from the Lifetimes package.

df_predictions = customers.get_customer_predictions(transactions_df, 
                                          observation_period_end='2011-12-09', 
                                          days=90)
df_predictions.head(10)
customer_id predicted_purchases aov clv
0 12346.0 0.188830 NaN NaN
1 12347.0 1.408736 569.978836 836.846896
2 12348.0 0.805907 333.784235 308.247354
3 12349.0 0.855607 NaN NaN
4 12350.0 0.196304 NaN NaN
5 12352.0 1.682277 376.175359 647.826169
6 12353.0 0.272541 NaN NaN
7 12354.0 0.247183 NaN NaN
8 12355.0 0.262909 NaN NaN
9 12356.0 0.645368 324.039419 256.855226

These simple functions give you a huge amount of customer data to play with. You can use this to predict revenue in future periods, set your Customer Acquisition Cost (CAC) strategy, target customers at the right time without annoying them, and a whole load of other things.

Matt Clarke, Sunday, March 28, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.