While the Recency, Frequency, Monetary value or RFM model for customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally the RFM metrics which rank mostly highly when you examine model feature importance.
The raw RFM variables can be used to create RFM scores, or you can use them in ABC analysis. This is based upon the ABC inventory classification model used in operations management, and uses the Pareto principle (or 80/20 rule) to group customers together in segments based on their revenue contribution.
While these metrics can be time-consuming and processor intensive to calculate in SQL, they’re quick and easy to create in Python, thanks to Pandas. Here’s how it’s done.
I’m using a transactional item dataset, rather than a transactional dataset, for this project because I want to look at some brand-level metrics. However, you can perform a similar analysis using standard transactional data too. Export the data from your database or data warehouse and load it into a Pandas dataframe.
import pandas as pd
import numpy as np
df = pd.read_csv('transaction_items.csv', low_memory=False)
df.head(3)
order_item_id | order_id | customer_id | transaction_id | variant_id | brand_name | quantity | line_revenue | date_created | |
---|---|---|---|---|---|---|---|---|---|
0 | 319045 | 73764 | 78503 | 12345 | 201992 | brand3 | 1 | 39.99 | 2010-12-09 12:19:19 |
1 | 319059 | 73767 | 78505 | 123456 | 872928 | brand2 | 1 | 2.49 | 2010-12-09 12:45:22 |
2 | 319085 | 73775 | 76070 | 12347 | 536051 | brand1 | 1 | 69.99 | 2010-12-09 16:00:16 |
Since we’ll be doing some calendrical calculations, it’s important to ensure that your dataset includes a date field and that it is recognised as a datetime format. You can check the format of the columns using df.info()
. If you find that the date column is an object, you can convert it to a datetime using the Pandas to_datetime()
function.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74079 entries, 0 to 74078
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_item_id 74079 non-null int64
1 order_id 74079 non-null int64
2 customer_id 74079 non-null int64
3 transaction_id 74079 non-null object
4 variant_id 74079 non-null int64
5 brand_name 74078 non-null object
6 quantity 74079 non-null int64
7 line_revenue 74079 non-null float64
8 date_created 74079 non-null object
dtypes: float64(1), int64(5), object(3)
memory usage: 5.1+ MB
df['date_created'] = pd.to_datetime(df['date_created'])
In order to store out customer segmentation data we’ll first need to create a new dataframe containing all of the unique customers within the dataframe. We’ll use unique()
to select these and assign them to a dataframe called df_customers
, then we’ll set the column name as customer_id
.
df_customers = pd.DataFrame(df['customer_id'].unique())
df_customers.columns = ['customer_id']
df_customers.head()
customer_id | |
---|---|
0 | 78503 |
1 | 78505 |
2 | 76070 |
3 | 56387 |
4 | 46307 |
The recency variable in RFM is a measure of how long it’s been since the customer’s last purchase. Some models require the use of the recency date itself (i.e. the date of the last purchase) so we’ll calculate and store this, as well as the recency metric, or the number of days since the customer’s last order. Customers who have a very high number for recency haven’t been seen for ages and are more likely to be lapsed than those with lower numbers.
The recency_date
represent the max()
value from the date_created
column for each customer, while the recency
metric represents the time delta between the current date and the recency_date
in days. If you cast this to an integer you’ll be able to group the data more easily.
df_recency = df.groupby('customer_id')['date_created'].max().reset_index()
df_recency.columns = ['customer_id','recency_date']
df_customers = df_customers.merge(df_recency, on='customer_id')
df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\
/ np.timedelta64(1, 'D') ).astype(int)
df_customers.head()
customer_id | recency_date | recency | |
---|---|---|---|
0 | 78503 | 2010-12-09 12:19:19 | 3601 |
1 | 78505 | 2010-12-09 12:45:22 | 3601 |
2 | 76070 | 2020-10-14 16:14:27 | 4 |
3 | 56387 | 2010-12-09 17:07:33 | 3601 |
4 | 46307 | 2010-12-09 21:27:09 | 3601 |
Frequency is a measure of the number of orders a customer has placed. To calculate this metric for each customer we can simpply count the number of unique order IDs. Once we’ve obtained the value for each customer, we can store it on the df_customers
dataframe with our other data.
df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index()
df_frequency.columns = ['customer_id','frequency']
df_customers = df_customers.merge(df_frequency, on='customer_id')
df_customers.head()
customer_id | recency_date | recency | frequency | |
---|---|---|---|---|
0 | 78503 | 2010-12-09 12:19:19 | 3601 | 1 |
1 | 78505 | 2010-12-09 12:45:22 | 3601 | 1 |
2 | 76070 | 2020-10-14 16:14:27 | 4 | 15 |
3 | 56387 | 2010-12-09 17:07:33 | 3601 | 1 |
4 | 46307 | 2010-12-09 21:27:09 | 3601 | 1 |
The monetary metric represents the total revenue generated by each customer. It can be calculated in a similar manner to the frequency. We simply sum()
the line_revenue
column to obtain the total spend and then assign the value to the df_customers
dataframe.
df_monetary = df.groupby('customer_id')['line_revenue'].sum().reset_index()
df_monetary.columns = ['customer_id','monetary']
df_customers = df_customers.merge(df_monetary, on='customer_id')
Tenure, or T, is a measure of how long a given customer has been with the company. The tenure_date
is effectively the opposite of the recency_date
and is calculated from the min()
of the date_created
field rather than the max()
, while the tenure metric is the time delta between the tenure_date
and the current date in days, expressed as an integer. In related models, such as LRIFMQ (Length, Recency, Interval, Frequency, Monetary, Quantity) the T or Tenure value I have used is the same as the L or Length value (Matz and Hermawan, 2020).
df_tenure = df.groupby('customer_id')['date_created'].min().reset_index()
df_tenure.columns = ['customer_id','tenure_date']
df_customers = df_customers.merge(df_tenure, on='customer_id')
df_customers['tenure'] = round((pd.to_datetime('today') - df_customers['tenure_date']) \
/ np.timedelta64(1, 'D') ).astype(int)
df_customers.head()
customer_id | recency_date | recency | frequency | tenure_date | tenure | |
---|---|---|---|---|---|---|
0 | 78503 | 2010-12-09 12:19:19 | 3601 | 1 | 2010-12-09 12:19:19 | 3601 |
1 | 78505 | 2010-12-09 12:45:22 | 3601 | 1 | 2010-12-09 12:45:22 | 3601 |
2 | 76070 | 2020-10-14 16:14:27 | 4 | 15 | 2010-12-09 16:00:16 | 3601 |
3 | 56387 | 2010-12-09 17:07:33 | 3601 | 1 | 2010-12-09 17:07:33 | 3601 |
4 | 46307 | 2010-12-09 21:27:09 | 3601 | 1 | 2010-12-09 21:27:09 | 3601 |
Variety is a measure of the number of unique SKUs purchased. It’s a useful way to understand how much “share of wallet” you might be getting. For example, low numbers could indicate that customers are purchasing some of their products from competitors, perhaps because your prices are too high or you don’t stock what they want. We can calculate variety
by counting the nunique()
values in the variant_id
column for each customer.
df_variety = df.groupby('customer_id')['variant_id'].nunique().reset_index()
df_variety.columns = ['customer_id','variety']
df_customers = df_customers.merge(df_variety, on='customer_id')
df_customers.head()
customer_id | recency_date | recency | frequency | monetary | tenure_date | tenure | variety | |
---|---|---|---|---|---|---|---|---|
0 | 78503 | 2010-12-09 12:19:19 | 3601 | 1 | 39.99 | 2010-12-09 12:19:19 | 3601 | 1 |
1 | 78505 | 2010-12-09 12:45:22 | 3601 | 1 | 75.95 | 2010-12-09 12:45:22 | 3601 | 6 |
2 | 76070 | 2020-10-14 16:14:27 | 4 | 15 | 924.12 | 2010-12-09 16:00:16 | 3601 | 7 |
3 | 56387 | 2010-12-09 17:07:33 | 3601 | 1 | 139.98 | 2010-12-09 17:07:33 | 3601 | 1 |
4 | 46307 | 2010-12-09 21:27:09 | 3601 | 1 | 105.00 | 2010-12-09 21:27:09 | 3601 | 1 |
Like variety, the brands metric is simply a count of the number of unique brands purchased and it can be calculated in the same way and added to the df_customers
dataframe.
df_brands = df.groupby('customer_id')['brand_name'].nunique().reset_index()
df_brands.columns = ['customer_id','brands']
df_customers = df_customers.merge(df_brands, on='customer_id')
df_customers[['customer_id','variety','brands']].head()
customer_id | variety | brands | |
---|---|---|---|
0 | 78503 | 1 | 1 |
1 | 78505 | 6 | 5 |
2 | 76070 | 7 | 3 |
3 | 56387 | 1 | 1 |
4 | 46307 | 1 | 1 |
Average order value (or AOV) is one of the standard metrics in ecommerce. We can calculate this by dividing the total spend for each customer from the monetary
column by the number of orders they’ve placed, which is stored in the frequency
column. Usually, when the site is being managed well, you’d normally expect to see AOV rise as more orders are placed.
df_customers['avg_order_value'] = round(df_customers['monetary'] / df_customers['frequency'],2)
df_customers[['customer_id','variety','brands', 'avg_order_value']].head()
customer_id | variety | brands | avg_order_value | |
---|---|---|---|---|
0 | 78503 | 1 | 1 | 39.99 |
1 | 78505 | 6 | 5 | 75.95 |
2 | 76070 | 7 | 3 | 61.61 |
3 | 56387 | 1 | 1 | 139.98 |
4 | 46307 | 1 | 1 | 105.00 |
Basket size represents the total number of items in the basket. Most ecommerce sites try to encourage customers to purchase more items, so this is a useful KPI. It can sometimes vary quite a lot across customers, so can be useful in segmenting them in order to try and grow basket sizes on those who only make smaller orders.
df_items = df.groupby('customer_id')['quantity'].sum().reset_index()
df_items.columns = ['customer_id','items']
df_customers = df_customers.merge(df_items, on='customer_id')
df_customers['avg_basket_size'] = round(df_customers['items'] / df_customers['frequency'],2)
df_customers[['customer_id','variety','brands', 'avg_order_value','avg_basket_size']].head()
customer_id | variety | brands | avg_order_value | avg_basket_size | |
---|---|---|---|---|---|
0 | 78503 | 1 | 1 | 39.99 | 1.00 |
1 | 78505 | 6 | 5 | 75.95 | 7.00 |
2 | 76070 | 7 | 3 | 61.61 | 1.47 |
3 | 56387 | 1 | 1 | 139.98 | 2.00 |
4 | 46307 | 1 | 1 | 105.00 | 2.00 |
ABC classification is an approach that comes from operations management. It’s usually applied to the classification of product inventories, so procurement managers can better manage their stock and avoid costly stock outs.
It’s based on the Pareto principle, or 80/20 rule and allocates products (or customers) to groups A, B, or C, based on their cumulative contribution to the overall revenue. Customers in Class A contribute 80% of the cumulative running total percentage of revenue, while those in B generate the next 10%, and those in Class C, generate the bottom 10%.
In my experience, when using this approach on customers it works best when you calculate it over two periods, say 12 months and 24 months, as this allows you to see relative change in their contribution. It also makes sense, especially on well established ecommerce sites, to assign the lapsed customers to class D. It’s slightly more complex to calculate than the RFM metrics, so to make it simpler we’ll create a function we can re-use for the two date periods.
def classify_customer(percentage):
"""Apply an ABC classification to each customer based on
its ranked percentage revenue contribution. Any split
can be used to suit your data.
:param percentage: Running percentage of revenue contributed
:return: ABC classification
"""
if percentage > 0 and percentage <= 80:
return 'A'
elif percentage > 80 and percentage <= 90:
return 'B'
else:
return 'C'
To classify the customers we’ll split the customers dataframe based on the number of months specified using the recency
metric and sort the dataframe from highest spend to lowest. Then we’ll calculate the cumulative sum of the monetary
column and calculate the running percentage of revenue contributed.
Using the percentage value, we can then run the classify_customer()
function above and determine which class they’re in, while their rank is obtained using the Pandas rank()
function. Customers with a recency
greater than the months
value didn’t purchase and get assigned to Class D and have a rank at the bottom of the pile.
def abc(df, months, abc_class_name, abc_rank_name):
"""Creates a customer ABC classification based on a given number of months.
:param df: Pandas dataframe containing customer metrics
:param months: Number of months for analysis, i.e. 12 or 24
:param abc_class_name: Column name to assign to hold the ABC class i.e. abc_class_12m
:param abc_rank_name: Column name to assign to hold the ABC rank i.e. abc_rank_12m
"""
# Split data according to time period
df_purchased = df[df['recency'] <= (months * 30)]
df_purchased = df_purchased.sort_values(by='monetary', ascending=False)
# Calculate temporary metrics
df_purchased['revenue_cumsum'] = df_purchased['monetary'].cumsum()
df_purchased['revenue_total'] = df_purchased['monetary'].sum()
df_purchased['revenue_running_percentage'] = \
(df_purchased['revenue_cumsum'] / df_purchased['revenue_total']) * 100
# Assign values and drop temporary metrics
df_purchased[abc_class_name] = df_purchased['revenue_running_percentage'].apply(classify_customer)
df_purchased[abc_rank_name] = df_purchased['revenue_running_percentage'].rank().astype(int)
df_purchased.drop(['revenue_cumsum','revenue_total','revenue_running_percentage'], axis=1, inplace=True)
# Assign lapsed customers to class D
df_lapsed = df[df['recency'] > (months * 30)]
df_all = df_purchased.append(df_lapsed)
df_all[abc_class_name].fillna('D', inplace=True)
df_all[abc_rank_name].fillna(len(df_purchased)+1, inplace=True)
return df_all
Finally, we can run the function on 12 and 24 month periods and assign the values back to our customers dataframe.
df_final = abc(df_customers, 12, 'abc_class_12m', 'abc_rank_12m')
df_final = abc(df_final, 24, 'abc_class_24m', 'abc_rank_24m')
df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].head()
customer_id | abc_class_12m | abc_rank_12m | abc_class_24m | abc_rank_24m | |
---|---|---|---|---|---|
1260 | 81750 | A | 1.0 | A | 1.0 |
482 | 107397 | A | 2.0 | A | 2.0 |
555 | 119446 | A | 3.0 | A | 3.0 |
2512 | 83853 | A | 4.0 | A | 4.0 |
6230 | 124642 | A | 5.0 | A | 5.0 |
df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].tail()
customer_id | abc_class_12m | abc_rank_12m | abc_class_24m | abc_rank_24m | |
---|---|---|---|---|---|
9197 | 128282 | D | 5832.0 | D | 9870.0 |
9198 | 128283 | D | 5832.0 | D | 9870.0 |
9199 | 128284 | D | 5832.0 | D | 9870.0 |
9200 | 128286 | D | 5832.0 | D | 9870.0 |
9201 | 128288 | D | 5832.0 | D | 9870.0 |
As you can see, our customers are now neatly assigned to classes based on their contribution to the cumulative revenue over both 12 and 24 month periods, and they’re ranked across the business and within each class.
df_final.abc_class_24m.value_counts()
D 4784
C 4718
A 3573
B 1578
Name: abc_class_24m, dtype: int64
df_final.abc_class_12m.value_counts()
D 8822
C 2662
A 2326
B 843
Name: abc_class_12m, dtype: int64
Matt Clarke, Sunday, March 14, 2021