How to segment customers using RFM and ABC

ABC. Brett Jordan, Unsplash.

23 minutes to read

While the Recency, Frequency, Monetary value or RFM model for customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally the RFM metrics which rank mostly highly when you examine model feature importance.

The raw RFM variables can be used to create RFM scores, or you can use them in ABC analysis. This is based upon the ABC inventory classification model used in operations management, and uses the Pareto principle (or 80/20 rule) to group customers together in segments based on their revenue contribution.

While these metrics can be time-consuming and processor intensive to calculate in SQL, they’re quick and easy to create in Python, thanks to Pandas. Here’s how it’s done.

Extract and load your data

I’m using a transactional item dataset, rather than a transactional dataset, for this project because I want to look at some brand-level metrics. However, you can perform a similar analysis using standard transactional data too. Export the data from your database or data warehouse and load it into a Pandas dataframe.

import pandas as pd
import numpy as np

df = pd.read_csv('transaction_items.csv', low_memory=False)
df.head(3)

	order_item_id	order_id	customer_id	transaction_id	variant_id	brand_name	quantity	line_revenue	date_created
0	319045	73764	78503	12345	201992	brand3	1	39.99	2010-12-09 12:19:19
1	319059	73767	78505	123456	872928	brand2	1	2.49	2010-12-09 12:45:22
2	319085	73775	76070	12347	536051	brand1	1	69.99	2010-12-09 16:00:16

Ensure the date column is set

Since we’ll be doing some calendrical calculations, it’s important to ensure that your dataset includes a date field and that it is recognised as a datetime format. You can check the format of the columns using df.info(). If you find that the date column is an object, you can convert it to a datetime using the Pandas to_datetime() function.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74079 entries, 0 to 74078
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   order_item_id   74079 non-null  int64  
 1   order_id        74079 non-null  int64  
 2   customer_id     74079 non-null  int64  
 3   transaction_id  74079 non-null  object 
 4   variant_id      74079 non-null  int64  
 5   brand_name      74078 non-null  object 
 6   quantity        74079 non-null  int64  
 7   line_revenue    74079 non-null  float64
 8   date_created    74079 non-null  object 
dtypes: float64(1), int64(5), object(3)
memory usage: 5.1+ MB

df['date_created'] = pd.to_datetime(df['date_created'])

Create a customers dataframe

In order to store out customer segmentation data we’ll first need to create a new dataframe containing all of the unique customers within the dataframe. We’ll use unique() to select these and assign them to a dataframe called df_customers, then we’ll set the column name as customer_id.

df_customers = pd.DataFrame(df['customer_id'].unique())
df_customers.columns = ['customer_id']
df_customers.head()

	customer_id
0	78503
1	78505
2	76070
3	56387
4	46307

Calculating recency

The recency variable in RFM is a measure of how long it’s been since the customer’s last purchase. Some models require the use of the recency date itself (i.e. the date of the last purchase) so we’ll calculate and store this, as well as the recency metric, or the number of days since the customer’s last order. Customers who have a very high number for recency haven’t been seen for ages and are more likely to be lapsed than those with lower numbers.

The recency_date represent the max() value from the date_created column for each customer, while the recency metric represents the time delta between the current date and the recency_date in days. If you cast this to an integer you’ll be able to group the data more easily.

df_recency = df.groupby('customer_id')['date_created'].max().reset_index()
df_recency.columns = ['customer_id','recency_date']
df_customers = df_customers.merge(df_recency, on='customer_id')

df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\
                                / np.timedelta64(1, 'D') ).astype(int)

df_customers.head()

	customer_id	recency_date	recency
0	78503	2010-12-09 12:19:19	3601
1	78505	2010-12-09 12:45:22	3601
2	76070	2020-10-14 16:14:27	4
3	56387	2010-12-09 17:07:33	3601
4	46307	2010-12-09 21:27:09	3601

Calculating frequency

Frequency is a measure of the number of orders a customer has placed. To calculate this metric for each customer we can simpply count the number of unique order IDs. Once we’ve obtained the value for each customer, we can store it on the df_customers dataframe with our other data.

df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index()
df_frequency.columns = ['customer_id','frequency']
df_customers = df_customers.merge(df_frequency, on='customer_id')

df_customers.head()

	customer_id	recency_date	recency	frequency
0	78503	2010-12-09 12:19:19	3601	1
1	78505	2010-12-09 12:45:22	3601	1
2	76070	2020-10-14 16:14:27	4	15
3	56387	2010-12-09 17:07:33	3601	1
4	46307	2010-12-09 21:27:09	3601	1

Calculating monetary

The monetary metric represents the total revenue generated by each customer. It can be calculated in a similar manner to the frequency. We simply sum() the line_revenue column to obtain the total spend and then assign the value to the df_customers dataframe.

df_monetary = df.groupby('customer_id')['line_revenue'].sum().reset_index()
df_monetary.columns = ['customer_id','monetary']
df_customers = df_customers.merge(df_monetary, on='customer_id')

Calculating tenure

Tenure, or T, is a measure of how long a given customer has been with the company. The tenure_date is effectively the opposite of the recency_date and is calculated from the min() of the date_created field rather than the max(), while the tenure metric is the time delta between the tenure_date and the current date in days, expressed as an integer. In related models, such as LRIFMQ (Length, Recency, Interval, Frequency, Monetary, Quantity) the T or Tenure value I have used is the same as the L or Length value (Matz and Hermawan, 2020).

df_tenure = df.groupby('customer_id')['date_created'].min().reset_index()
df_tenure.columns = ['customer_id','tenure_date']
df_customers = df_customers.merge(df_tenure, on='customer_id')

df_customers['tenure'] = round((pd.to_datetime('today') - df_customers['tenure_date']) \
                               / np.timedelta64(1, 'D') ).astype(int)

df_customers.head()

	customer_id	recency_date	recency	frequency	tenure_date	tenure
0	78503	2010-12-09 12:19:19	3601	1	2010-12-09 12:19:19	3601
1	78505	2010-12-09 12:45:22	3601	1	2010-12-09 12:45:22	3601
2	76070	2020-10-14 16:14:27	4	15	2010-12-09 16:00:16	3601
3	56387	2010-12-09 17:07:33	3601	1	2010-12-09 17:07:33	3601
4	46307	2010-12-09 21:27:09	3601	1	2010-12-09 21:27:09	3601

Calculating variety

Variety is a measure of the number of unique SKUs purchased. It’s a useful way to understand how much “share of wallet” you might be getting. For example, low numbers could indicate that customers are purchasing some of their products from competitors, perhaps because your prices are too high or you don’t stock what they want. We can calculate variety by counting the nunique() values in the variant_id column for each customer.

df_variety = df.groupby('customer_id')['variant_id'].nunique().reset_index()
df_variety.columns = ['customer_id','variety']
df_customers = df_customers.merge(df_variety, on='customer_id')

df_customers.head()

	customer_id	recency_date	recency	frequency	monetary	tenure_date	tenure	variety
0	78503	2010-12-09 12:19:19	3601	1	39.99	2010-12-09 12:19:19	3601	1
1	78505	2010-12-09 12:45:22	3601	1	75.95	2010-12-09 12:45:22	3601	6
2	76070	2020-10-14 16:14:27	4	15	924.12	2010-12-09 16:00:16	3601	7
3	56387	2010-12-09 17:07:33	3601	1	139.98	2010-12-09 17:07:33	3601	1
4	46307	2010-12-09 21:27:09	3601	1	105.00	2010-12-09 21:27:09	3601	1

Calculating brands

Like variety, the brands metric is simply a count of the number of unique brands purchased and it can be calculated in the same way and added to the df_customers dataframe.

df_brands = df.groupby('customer_id')['brand_name'].nunique().reset_index()
df_brands.columns = ['customer_id','brands']
df_customers = df_customers.merge(df_brands, on='customer_id')

df_customers[['customer_id','variety','brands']].head()

	customer_id	variety	brands
0	78503	1	1
1	78505	6	5
2	76070	7	3
3	56387	1	1
4	46307	1	1

Calculating AOV

Average order value (or AOV) is one of the standard metrics in ecommerce. We can calculate this by dividing the total spend for each customer from the monetary column by the number of orders they’ve placed, which is stored in the frequency column. Usually, when the site is being managed well, you’d normally expect to see AOV rise as more orders are placed.

df_customers['avg_order_value'] = round(df_customers['monetary'] / df_customers['frequency'],2)

df_customers[['customer_id','variety','brands', 'avg_order_value']].head()

	customer_id	variety	brands	avg_order_value
0	78503	1	1	39.99
1	78505	6	5	75.95
2	76070	7	3	61.61
3	56387	1	1	139.98
4	46307	1	1	105.00

Calculating average basket size

Basket size represents the total number of items in the basket. Most ecommerce sites try to encourage customers to purchase more items, so this is a useful KPI. It can sometimes vary quite a lot across customers, so can be useful in segmenting them in order to try and grow basket sizes on those who only make smaller orders.

df_items = df.groupby('customer_id')['quantity'].sum().reset_index()
df_items.columns = ['customer_id','items']
df_customers = df_customers.merge(df_items, on='customer_id')

df_customers['avg_basket_size'] = round(df_customers['items'] / df_customers['frequency'],2)

df_customers[['customer_id','variety','brands', 'avg_order_value','avg_basket_size']].head()

	customer_id	variety	brands	avg_order_value	avg_basket_size
0	78503	1	1	39.99	1.00
1	78505	6	5	75.95	7.00
2	76070	7	3	61.61	1.47
3	56387	1	1	139.98	2.00
4	46307	1	1	105.00	2.00

Creating an ABC customer classification

ABC classification is an approach that comes from operations management. It’s usually applied to the classification of product inventories, so procurement managers can better manage their stock and avoid costly stock outs.

It’s based on the Pareto principle, or 80/20 rule and allocates products (or customers) to groups A, B, or C, based on their cumulative contribution to the overall revenue. Customers in Class A contribute 80% of the cumulative running total percentage of revenue, while those in B generate the next 10%, and those in Class C, generate the bottom 10%.

In my experience, when using this approach on customers it works best when you calculate it over two periods, say 12 months and 24 months, as this allows you to see relative change in their contribution. It also makes sense, especially on well established ecommerce sites, to assign the lapsed customers to class D. It’s slightly more complex to calculate than the RFM metrics, so to make it simpler we’ll create a function we can re-use for the two date periods.

def classify_customer(percentage):
    """Apply an ABC classification to each customer based on 
    its ranked percentage revenue contribution. Any split 
    can be used to suit your data. 
    
    :param percentage: Running percentage of revenue contributed
    :return: ABC classification
    """
    
    if percentage > 0 and percentage <= 80:
        return 'A'
    elif percentage > 80 and percentage <= 90:
        return 'B'
    else:
        return 'C'

To classify the customers we’ll split the customers dataframe based on the number of months specified using the recency metric and sort the dataframe from highest spend to lowest. Then we’ll calculate the cumulative sum of the monetary column and calculate the running percentage of revenue contributed.

Using the percentage value, we can then run the classify_customer() function above and determine which class they’re in, while their rank is obtained using the Pandas rank() function. Customers with a recency greater than the months value didn’t purchase and get assigned to Class D and have a rank at the bottom of the pile.

def abc(df, months, abc_class_name, abc_rank_name):
    """Creates a customer ABC classification based on a given number of months. 
    
    :param df: Pandas dataframe containing customer metrics
    :param months: Number of months for analysis, i.e. 12 or 24
    :param abc_class_name: Column name to assign to hold the ABC class i.e. abc_class_12m
    :param abc_rank_name: Column name to assign to hold the ABC rank i.e. abc_rank_12m

    """
    
    # Split data according to time period
    df_purchased = df[df['recency'] <= (months * 30)]    
    df_purchased = df_purchased.sort_values(by='monetary', ascending=False)
    
    # Calculate temporary metrics
    df_purchased['revenue_cumsum'] = df_purchased['monetary'].cumsum()
    df_purchased['revenue_total'] = df_purchased['monetary'].sum()
    df_purchased['revenue_running_percentage'] = \
    (df_purchased['revenue_cumsum'] / df_purchased['revenue_total']) * 100
    
    # Assign values and drop temporary metrics
    df_purchased[abc_class_name] = df_purchased['revenue_running_percentage'].apply(classify_customer)
    df_purchased[abc_rank_name] = df_purchased['revenue_running_percentage'].rank().astype(int)
    df_purchased.drop(['revenue_cumsum','revenue_total','revenue_running_percentage'], axis=1, inplace=True)
    
    # Assign lapsed customers to class D
    df_lapsed = df[df['recency'] > (months * 30)]
    df_all = df_purchased.append(df_lapsed)
    
    df_all[abc_class_name].fillna('D', inplace=True)
    df_all[abc_rank_name].fillna(len(df_purchased)+1, inplace=True)

    return df_all

Finally, we can run the function on 12 and 24 month periods and assign the values back to our customers dataframe.

df_final = abc(df_customers, 12, 'abc_class_12m', 'abc_rank_12m')
df_final = abc(df_final, 24, 'abc_class_24m', 'abc_rank_24m')

df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].head()

	customer_id	abc_class_12m	abc_rank_12m	abc_class_24m	abc_rank_24m
1260	81750	A	1.0	A	1.0
482	107397	A	2.0	A	2.0
555	119446	A	3.0	A	3.0
2512	83853	A	4.0	A	4.0
6230	124642	A	5.0	A	5.0

df_final[['customer_id','abc_class_12m','abc_rank_12m','abc_class_24m','abc_rank_24m']].tail()

	customer_id	abc_class_12m	abc_rank_12m	abc_class_24m	abc_rank_24m
9197	128282	D	5832.0	D	9870.0
9198	128283	D	5832.0	D	9870.0
9199	128284	D	5832.0	D	9870.0
9200	128286	D	5832.0	D	9870.0
9201	128288	D	5832.0	D	9870.0

As you can see, our customers are now neatly assigned to classes based on their contribution to the cumulative revenue over both 12 and 24 month periods, and they’re ranked across the business and within each class.

df_final.abc_class_24m.value_counts()

D    4784
C    4718
A    3573
B    1578
Name: abc_class_24m, dtype: int64

df_final.abc_class_12m.value_counts()

D    8822
C    2662
A    2326
B     843
Name: abc_class_12m, dtype: int64

How to use the Pandas truncate() function

Have you ever needed to chop the top or bottom off a Pandas dataframe, or extract a specific section from the middle? If so, there’s a Pandas function called truncate()...

How to use Spacy for noun phrase extraction

Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns...

How to use the Pandas filter() function

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only...

How to segment customers using RFM and ABC

Creating a value-based segmentation using RFM and ABC is a great way to tell the good customers from the bad ones. Here's how it's done.

Extract and load your data

Ensure the date column is set

Create a customers dataframe

Calculating recency

Calculating frequency

Calculating monetary

Calculating tenure

Calculating variety

Calculating brands

Calculating AOV

Calculating average basket size

Creating an ABC customer classification

Further reading

How to assign RFM scores with quantile-based discretization

How to engineer customer purchase latency features

How to use sort_values() to sort a Pandas DataFrame

How to bin or bucket customer data using Pandas

How to calculate CLV using BG/NBD and Gamma-Gamma

How to select, filter, and subset data in Pandas dataframes

How to use Pandas from_records() to create a dataframe

How to calculate an exponential moving average in Pandas

How to use the Pandas map() function

How to use Pandas pipe() to create data pipelines

How to use Pandas assign() to create new dataframe columns

How to measure Python code execution times with timeit

How to use Pandas from_records() to create a dataframe

How to calculate an exponential moving average in Pandas

How to use the Pandas map() function

How to use Pandas pipe() to create data pipelines

How to use Pandas assign() to create new dataframe columns

How to measure Python code execution times with timeit

How to segment customers using RFM and ABC

Creating a value-based segmentation using RFM and ABC is a great way to tell the good customers from the bad ones. Here's how it's done.

Extract and load your data

Ensure the date column is set

Create a customers dataframe

Calculating recency

Calculating frequency

Calculating monetary

Calculating tenure

Calculating variety

Calculating brands

Calculating AOV

Calculating average basket size

Creating an ABC customer classification

Further reading

Other posts you might like

Have you ever needed to chop the top or bottom off a Pandas dataframe, or extract a specific section from the middle? If so, there’s a Pandas function called truncate()...

Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns...

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only...

Get the newsletter