How to segment customers using RFM and ABC

Creating a value-based segmentation using RFM and ABC is a great way to tell the good customers from the bad ones. Here's how it's done.

How to segment customers using RFM and ABC
ABC. Brett Jordan, Unsplash.
23 minutes to read

While the Recency, Frequency, Monetary value or RFM approach to customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally the RFM metrics which rank mostly highly when you examine model feature importance.

The raw RFM variables can be used to create RFM scores, or you can use them in ABC analysis. This is based upon the ABC inventory classification model used in operations management, and uses the Pareto principle (or 80/20 rule) to group customers together in segments based on their revenue contribution.

While these metrics can be time-consuming and processor intensive to calculate in SQL, they’re quick and easy to create in Python, thanks to Pandas. Here’s how it’s done.

Extract and load your data

I’m using a transactional item dataset, rather than a transactional dataset, for this project because I want to look at some brand-level metrics. However, you can perform a similar analysis using standard transactional data too. Export the data from your database or data warehouse and load it into a Pandas dataframe.

import pandas as pd
import numpy as np
df = pd.read_csv('transaction_items.csv', low_memory=False)
df.head(3)
order_item_id order_id customer_id transaction_id variant_id brand_name quantity line_revenue date_created
0 319045 73764 78503 12345 201992 brand3 1 39.99 2010-12-09 12:19:19
1 319059 73767 78505 123456 872928 brand2 1 2.49 2010-12-09 12:45:22
2 319085 73775 76070 12347 536051 brand1 1 69.99 2010-12-09 16:00:16

Ensure the date column is set

Since we’ll be doing some calendrical calculations, it’s important to ensure that your dataset includes a date field and that it is recognised as a datetime format. You can check the format of the columns using df.info(). If you find that the date column is an object, you can convert it to a datetime using the Pandas to_datetime() function.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74079 entries, 0 to 74078
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   order_item_id   74079 non-null  int64  
 1   order_id        74079 non-null  int64  
 2   customer_id     74079 non-null  int64  
 3   transaction_id  74079 non-null  object 
 4   variant_id      74079 non-null  int64  
 5   brand_name      74078 non-null  object 
 6   quantity        74079 non-null  int64  
 7   line_revenue    74079 non-null  float64
 8   date_created    74079 non-null  object 
dtypes: float64(1), int64(5), object(3)
memory usage: 5.1+ MB
df['date_created'] = pd.to_datetime(df['date_created'])

Create a customers dataframe

In order to store out customer segmentation data we’ll first need to create a new dataframe containing all of the unique customers within the dataframe. We’ll use unique() to select these and assign them to a dataframe called df_customers, then we’ll set the column name as customer_id.

df_customers = pd.DataFrame(df['customer_id'].unique())
df_customers.columns = ['customer_id']
df_customers.head()
customer_id
0 78503
1 78505
2 76070
3 56387
4 46307
Calculating recency

The recency variable in RFM is a measure of how long it’s been since the customer’s last purchase. Some models require the use of the recency date itself (i.e. the date of the last purchase) so we’ll calculate and store this, as well as the recency metric, or the number of days since the customer’s last order. Customers who have a very high number for recency haven’t been seen for ages and are more likely to be lapsed than those with lower numbers.

The recency_date represent the max() value from the date_created column for each customer, while the recency metric represents the time delta between the current date and the recency_date in days. If you cast this to an integer you’ll be able to group the data more easily.

df_recency = df.groupby('customer_id')['date_created'].max().reset_index()
df_recency.columns = ['customer_id','recency_date']
df_customers = df_customers.merge(df_recency, on='customer_id')
df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\
                                / np.timedelta64(1, 'D') ).astype(int)
df_customers.head()
customer_id recency_date recency
0 78503 2010-12-09 12:19:19 3601
1 78505 2010-12-09 12:45:22 3601
2 76070 2020-10-14 16:14:27 4
3 56387 2010-12-09 17:07:33 3601
4 46307 2010-12-09 21:27:09 3601
Calculating frequency

Frequency is a measure of the number of orders a customer has placed. To calculate this metric for each customer we can simpply count the number of unique order IDs. Once we’ve obtained the value for each customer, we can store it on the df_customers dataframe with our other data.

df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index()
df_frequency.columns = ['customer_id','frequency']
df_customers = df_customers.merge(df_frequency, on='customer_id')
df_customers.head()
customer_id recency_date recency frequency
0 78503 2010-12-09 12:19:19 3601 1
1 78505 2010-12-09 12:45:22 3601 1
2 76070 2020-10-14 16:14:27 4 15
3 56387 2010-12-09 17:07:33 3601 1
4 46307 2010-12-09 21:27:09 3601 1
Calculating monetary

The monetary metric represents the total revenue generated by each customer. It can be calculated in a similar manner to the frequency. We simply sum() the line_revenue column to obtain the total spend and then assign the value to the df_customers dataframe.

df_monetary = df.groupby('customer_id')['line_revenue'].sum().reset_index()
df_monetary.columns = ['customer_id','monetary']
df_customers = df_customers.merge(df_monetary, on='customer_id')
Calculating tenure

Tenure, or T, is a measure of how long a given customer has been with the company. The tenure_date is effectively the opposite of the recency_date and is calculated from the min() of the date_created field rather than the max(), while the tenure metric is the time delta between the tenure_date and the current date in days, expressed as an integer. In related models, such as LRIFMQ (Length, Recency, Interval, Frequency, Monetary, Quantity) the T or Tenure value I have used is the same as the L or Length value (Matz and Hermawan, 2020).

df_tenure = df.groupby('customer_id')['date_created'].min().reset_index()
df_tenure.columns = ['customer_id','tenure_date']
df_customers = df_customers.merge(df_tenure, on='customer_id')
df_customers['tenure'] = round((pd.to_datetime('today') - df_customers['tenure_date']) \
                               / np.timedelta64(1, 'D') ).astype(int)
df_customers.head()
customer_id recency_date recency frequency tenure_date tenure
0 78503 2010-12-09 12:19:19 3601 1 2010-12-09 12:19:19 3601
1 78505 2010-12-09 12:45:22 3601 1 2010-12-09 12:45:22 3601
2 76070 2020-10-14 16:14:27 4 15 2010-12-09 16:00:16 3601
3 56387 2010-12-09 17:07:33 3601 1 2010-12-09 17:07:33 3601
4 46307 2010-12-09 21:27:09 3601 1 2010-12-09 21:27:09 3601
Calculating variety

Variety is a measure of the number of unique SKUs purchased. It’s a useful way to understand how much “share of wallet” you might be getting. For example, low numbers could indicate that customers are purchasing some of their products from competitors, perhaps because your prices are too high or you don’t stock what they want. We can calculate variety by counting the nunique() values in the variant_id column for each customer.

df_variety = df.groupby('customer_id')['variant_id'].nunique().reset_index()
df_variety.columns = ['customer_id','variety']
df_customers = df_customers.merge(df_variety, on='customer_id')
df_customers.head()
customer_id recency_date recency frequency monetary tenure_date tenure variety
0 78503 2010-12-09 12:19:19 3601 1 39.99 2010-12-09 12:19:19 3601 1
1 78505 2010-12-09 12:45:22 3601 1 75.95 2010-12-09 12:45:22 3601 6
2 76070 2020-10-14 16:14:27 4 15 924.12 2010-12-09 16:00:16 3601 7
3 56387 2010-12-09 17:07:33 3601 1 139.98 2010-12-09 17:07:33 3601 1
4 46307 2010-12-09 21:27:09 3601 1 105.00 2010-12-09 21:27:09 3601 1
Calculating brands

Like variety, the brands metric is simply a count of the number of unique brands purchased and it can be calculated in the same way and added to the df_customers dataframe.

df_brands = df.groupby('customer_id')['brand_name'].nunique().reset_index()
df_brands.columns = ['customer_id','brands']
df_customers = df_customers.merge(df_brands, on='customer_id')
df_customers[['customer_id','variety','brands']].head()
customer_id variety brands
0 78503 1 1
1 78505 6 5
2 76070 7 3
3 56387 1 1
4 46307 1 1
Calculating AOV

Average order value (or AOV) is one of the standard metrics in ecommerce. We can calculate this by dividing the total spend for each customer from the monetary column by the number of orders they’ve placed, which is stored in the frequency column. Usually, when the site is being managed well, you’d normally expect to see AOV rise as more orders are placed.

df_customers['avg_order_value'] = round(df_customers['monetary'] / df_customers['frequency'],2)
df_customers[['customer_id','variety','brands', 'avg_order_value']].head()
customer_id variety brands avg_order_value
0 78503 1 1 39.99
1 78505 6 5 75.95
2 76070 7 3 61.61
3 56387 1 1 139.98
4 46307 1 1 105.00
Calculating average basket size

Basket size represents the total number of items in the basket. Most ecommerce sites try to encourage customers to purchase more items, so this is a useful KPI. It can sometimes vary quite a lot across customers, so can be useful in segmenting them in order to try and grow basket sizes on those who only make smaller orders.

df_items = df.groupby('customer_id')['quantity'].sum().reset_index()
df_items.columns = ['customer_id','items']
df_customers = df_customers.merge(df_items, on='customer_id')
df_customers['avg_basket_size'] = round(df_customers['items'] / df_customers['frequency'],2)
df_customers[['customer_id','variety','brands', 'avg_order_value','avg_basket_size']].head()
customer_id variety brands avg_order_value avg_basket_size
0 78503 1 1 39.99 1.00
1 78505 6 5 75.95 7.00
2 76070 7 3 61.61 1.47
3 56387 1 1 139.98 2.00
4 46307 1 1 105.00 2.00

Creating an ABC customer classification

ABC classification is an approach that comes from operations management. It’s usually applied to the classification of product inventories, so procurement managers can better manage their stock and avoid costly stock outs.

It’s based on the Pareto principle, or 80/20 rule and allocates products (or customers) to groups A, B, or C, based on their cumulative contribution to the overall revenue. Customers in Class A contribute 80% of the cumulative running total percentage of revenue, while those in B generate the next 10%, and those in Class C, generate the bottom 10%.

In my experience, when using this approach on customers it works best when you calculate it over two periods, say 12 months and 24 months, as this allows you to see relative change in their contribution. It also makes sense, especially on well established ecommerce sites, to assign the lapsed customers to class D. It’s slightly more complex to calculate than the RFM metrics, so to make it simpler we’ll create a function we can re-use for the two date periods.

def classify_customer(percentage):
    """Apply an ABC classification to each customer based on 
    its ranked percentage revenue contribution. Any split 
    can be used to suit your data. 
    
    :param percentage: Running percentage of revenue contributed
    :return: ABC classification
    """
    
    if percentage > 0 and percentage <= 80:
        return 'A'
    elif percentage > 80 and percentage <= 90:
        return 'B'
    else:
        return 'C'

To classify the customers we’ll split the customers dataframe based on the number of months specified using the recency metric and sort the dataframe from highest spend to lowest. Then we’ll calculate the cumulative sum of the monetary column and calculate the running percentage of revenue contributed.

Using the percentage value, we can then run the classify_customer() function above and determine which class they’re in, while their rank is obtained using the Pandas rank() function. Customers with a recency greater than the months value didn’t purchase and get assigned to Class D and have a rank at the bottom of the pile.

def abc(df, months, abc_class_name, abc_rank_name):
    """Creates a customer ABC classification based on a given number of months. 
    
    :param df: Pandas dataframe containing customer metrics
    :param months: Number of months for analysis, i.e. 12 or 24
    :param abc_class_name: Column name to assign to hold the ABC class i.e. abc_class_12m
    :param abc_rank_name: Column name to assign to hold the ABC rank i.e. abc_rank_12m

    """
    
    # Split data according to time period
    df_purchased = df[df['recency'] <= (months * 30)]    
    df_purchased = df_purchased.sort_values(by='monetary', ascending=False)
    
    # Calculate temporary metrics
    df_purchased['revenue_cumsum'] = df_purchased['monetary'].cumsum()
    df_purchased['revenue_total'] = df_purchased['monetary'].sum()
    df_purchased['revenue_running_percentage'] = \
    (df_purchased['revenue_cumsum'] / df_purchased['revenue_total']) * 100
    
    # Assign values and drop temporary metrics
    df_purchased[abc_class_name] = df_purchased['revenue_running_percentage'].apply(classify_customer)
    df_purchased[abc_rank_name] = df_purchased['revenue_running_percentage'].rank().astype(int)
    df_purchased.drop(['revenue_cumsum','revenue_total','revenue_running_percentage'], axis=1, inplace=True)
    
    # Assign lapsed customers to class D
    df_lapsed = df[df['recency'] > (months * 30)]
    df_all = df_purchased.append(df_lapsed)
    
    df_all[abc_class_name].fillna('D', inplace=True)
    df_all[abc_rank_name].fillna(len(df_purchased)+1, inplace=True)

    return df_all

Finally, we can run the function on 12 and 24 month periods and assign the values back to our customers dataframe.

df_final = abc(df_customers, 12, 'abc_class_12m', 'abc_rank_12m')
df_final = abc(df_final, 24, 'abc_class_24m', 'abc_rank_24m')
df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].head()
customer_id abc_class_12m abc_rank_12m abc_class_24m abc_rank_24m
1260 81750 A 1.0 A 1.0
482 107397 A 2.0 A 2.0
555 119446 A 3.0 A 3.0
2512 83853 A 4.0 A 4.0
6230 124642 A 5.0 A 5.0
df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].tail()
customer_id abc_class_12m abc_rank_12m abc_class_24m abc_rank_24m
9197 128282 D 5832.0 D 9870.0
9198 128283 D 5832.0 D 9870.0
9199 128284 D 5832.0 D 9870.0
9200 128286 D 5832.0 D 9870.0
9201 128288 D 5832.0 D 9870.0

As you can see, our customers are now neatly assigned to classes based on their contribution to the cumulative revenue over both 12 and 24 month periods, and they’re ranked across the business and within each class.

df_final.abc_class_24m.value_counts()
D    4784
C    4718
A    3573
B    1578
Name: abc_class_24m, dtype: int64
df_final.abc_class_12m.value_counts()
D    8822
C    2662
A    2326
B     843
Name: abc_class_12m, dtype: int64

Further reading

  • Matz, A. and Hermawan, A.T., 2020. Customer Loyalty Clustering Model Using K-Means Algorithm with LRIFMQ Parameters. Inform, 5(2).

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Unsupervised Learning in Python

Learn how to cluster, transform, visualize, and extract insights from unlabeled datasets using scikit-learn and scipy.

Start course for FREE

Comments