While the Recency, Frequency, Monetary value or RFM approach to customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally the RFM metrics which rank mostly highly when you examine model feature importance.
The raw RFM variables can be used to create RFM scores, or you can use them in ABC analysis. This is based upon the ABC inventory classification model used in operations management, and uses the Pareto principle (or 80/20 rule) to group customers together in segments based on their revenue contribution.
While these metrics can be time-consuming and processor intensive to calculate in SQL, they’re quick and easy to create in Python, thanks to Pandas. Here’s how it’s done.
I’m using a transactional item dataset, rather than a transactional dataset, for this project because I want to look at some brand-level metrics. However, you can perform a similar analysis using standard transactional data too. Export the data from your database or data warehouse and load it into a Pandas dataframe.
import pandas as pd import numpy as np
df = pd.read_csv('transaction_items.csv', low_memory=False) df.head(3)
Since we’ll be doing some calendrical calculations, it’s important to ensure that your dataset includes a date field and that it is recognised as a datetime format. You can check the format of the columns using
df.info(). If you find that the date column is an object, you can convert it to a datetime using the Pandas
<class 'pandas.core.frame.DataFrame'> RangeIndex: 74079 entries, 0 to 74078 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_item_id 74079 non-null int64 1 order_id 74079 non-null int64 2 customer_id 74079 non-null int64 3 transaction_id 74079 non-null object 4 variant_id 74079 non-null int64 5 brand_name 74078 non-null object 6 quantity 74079 non-null int64 7 line_revenue 74079 non-null float64 8 date_created 74079 non-null object dtypes: float64(1), int64(5), object(3) memory usage: 5.1+ MB
df['date_created'] = pd.to_datetime(df['date_created'])
In order to store out customer segmentation data we’ll first need to create a new dataframe containing all of the unique customers within the dataframe. We’ll use
unique() to select these and assign them to a dataframe called
df_customers, then we’ll set the column name as
df_customers = pd.DataFrame(df['customer_id'].unique()) df_customers.columns = ['customer_id'] df_customers.head()
The recency variable in RFM is a measure of how long it’s been since the customer’s last purchase. Some models require the use of the recency date itself (i.e. the date of the last purchase) so we’ll calculate and store this, as well as the recency metric, or the number of days since the customer’s last order. Customers who have a very high number for recency haven’t been seen for ages and are more likely to be lapsed than those with lower numbers.
recency_date represent the
max() value from the
date_created column for each customer, while the
recency metric represents the time delta between the current date and the
recency_date in days. If you cast this to an integer you’ll be able to group the data more easily.
df_recency = df.groupby('customer_id')['date_created'].max().reset_index() df_recency.columns = ['customer_id','recency_date'] df_customers = df_customers.merge(df_recency, on='customer_id')
df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\ / np.timedelta64(1, 'D') ).astype(int)
Frequency is a measure of the number of orders a customer has placed. To calculate this metric for each customer we can simpply count the number of unique order IDs. Once we’ve obtained the value for each customer, we can store it on the
df_customers dataframe with our other data.
df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index() df_frequency.columns = ['customer_id','frequency'] df_customers = df_customers.merge(df_frequency, on='customer_id')
The monetary metric represents the total revenue generated by each customer. It can be calculated in a similar manner to the frequency. We simply
line_revenue column to obtain the total spend and then assign the value to the
df_monetary = df.groupby('customer_id')['line_revenue'].sum().reset_index() df_monetary.columns = ['customer_id','monetary'] df_customers = df_customers.merge(df_monetary, on='customer_id')
Tenure, or T, is a measure of how long a given customer has been with the company. The
tenure_date is effectively the opposite of the
recency_date and is calculated from the
min() of the
date_created field rather than the
max(), while the tenure metric is the time delta between the
tenure_date and the current date in days, expressed as an integer. In related models, such as LRIFMQ (Length, Recency, Interval, Frequency, Monetary, Quantity) the T or Tenure value I have used is the same as the L or Length value (Matz and Hermawan, 2020).
df_tenure = df.groupby('customer_id')['date_created'].min().reset_index() df_tenure.columns = ['customer_id','tenure_date'] df_customers = df_customers.merge(df_tenure, on='customer_id')
df_customers['tenure'] = round((pd.to_datetime('today') - df_customers['tenure_date']) \ / np.timedelta64(1, 'D') ).astype(int)
|0||78503||2010-12-09 12:19:19||3601||1||2010-12-09 12:19:19||3601|
|1||78505||2010-12-09 12:45:22||3601||1||2010-12-09 12:45:22||3601|
|2||76070||2020-10-14 16:14:27||4||15||2010-12-09 16:00:16||3601|
|3||56387||2010-12-09 17:07:33||3601||1||2010-12-09 17:07:33||3601|
|4||46307||2010-12-09 21:27:09||3601||1||2010-12-09 21:27:09||3601|
Variety is a measure of the number of unique SKUs purchased. It’s a useful way to understand how much “share of wallet” you might be getting. For example, low numbers could indicate that customers are purchasing some of their products from competitors, perhaps because your prices are too high or you don’t stock what they want. We can calculate
variety by counting the
nunique() values in the
variant_id column for each customer.
df_variety = df.groupby('customer_id')['variant_id'].nunique().reset_index() df_variety.columns = ['customer_id','variety'] df_customers = df_customers.merge(df_variety, on='customer_id')
|0||78503||2010-12-09 12:19:19||3601||1||39.99||2010-12-09 12:19:19||3601||1|
|1||78505||2010-12-09 12:45:22||3601||1||75.95||2010-12-09 12:45:22||3601||6|
|2||76070||2020-10-14 16:14:27||4||15||924.12||2010-12-09 16:00:16||3601||7|
|3||56387||2010-12-09 17:07:33||3601||1||139.98||2010-12-09 17:07:33||3601||1|
|4||46307||2010-12-09 21:27:09||3601||1||105.00||2010-12-09 21:27:09||3601||1|
Like variety, the brands metric is simply a count of the number of unique brands purchased and it can be calculated in the same way and added to the
df_brands = df.groupby('customer_id')['brand_name'].nunique().reset_index() df_brands.columns = ['customer_id','brands'] df_customers = df_customers.merge(df_brands, on='customer_id')
Average order value (or AOV) is one of the standard metrics in ecommerce. We can calculate this by dividing the total spend for each customer from the
monetary column by the number of orders they’ve placed, which is stored in the
frequency column. Usually, when the site is being managed well, you’d normally expect to see AOV rise as more orders are placed.
df_customers['avg_order_value'] = round(df_customers['monetary'] / df_customers['frequency'],2)
Basket size represents the total number of items in the basket. Most ecommerce sites try to encourage customers to purchase more items, so this is a useful KPI. It can sometimes vary quite a lot across customers, so can be useful in segmenting them in order to try and grow basket sizes on those who only make smaller orders.
df_items = df.groupby('customer_id')['quantity'].sum().reset_index() df_items.columns = ['customer_id','items'] df_customers = df_customers.merge(df_items, on='customer_id')
df_customers['avg_basket_size'] = round(df_customers['items'] / df_customers['frequency'],2)
ABC classification is an approach that comes from operations management. It’s usually applied to the classification of product inventories, so procurement managers can better manage their stock and avoid costly stock outs.
It’s based on the Pareto principle, or 80/20 rule and allocates products (or customers) to groups A, B, or C, based on their cumulative contribution to the overall revenue. Customers in Class A contribute 80% of the cumulative running total percentage of revenue, while those in B generate the next 10%, and those in Class C, generate the bottom 10%.
In my experience, when using this approach on customers it works best when you calculate it over two periods, say 12 months and 24 months, as this allows you to see relative change in their contribution. It also makes sense, especially on well established ecommerce sites, to assign the lapsed customers to class D. It’s slightly more complex to calculate than the RFM metrics, so to make it simpler we’ll create a function we can re-use for the two date periods.
def classify_customer(percentage): """Apply an ABC classification to each customer based on its ranked percentage revenue contribution. Any split can be used to suit your data. :param percentage: Running percentage of revenue contributed :return: ABC classification """ if percentage > 0 and percentage <= 80: return 'A' elif percentage > 80 and percentage <= 90: return 'B' else: return 'C'
To classify the customers we’ll split the customers dataframe based on the number of months specified using the
recency metric and sort the dataframe from highest spend to lowest. Then we’ll calculate the cumulative sum of the
monetary column and calculate the running percentage of revenue contributed.
Using the percentage value, we can then run the
classify_customer() function above and determine which class they’re in, while their rank is obtained using the Pandas
rank() function. Customers with a
recency greater than the
months value didn’t purchase and get assigned to Class D and have a rank at the bottom of the pile.
def abc(df, months, abc_class_name, abc_rank_name): """Creates a customer ABC classification based on a given number of months. :param df: Pandas dataframe containing customer metrics :param months: Number of months for analysis, i.e. 12 or 24 :param abc_class_name: Column name to assign to hold the ABC class i.e. abc_class_12m :param abc_rank_name: Column name to assign to hold the ABC rank i.e. abc_rank_12m """ # Split data according to time period df_purchased = df[df['recency'] <= (months * 30)] df_purchased = df_purchased.sort_values(by='monetary', ascending=False) # Calculate temporary metrics df_purchased['revenue_cumsum'] = df_purchased['monetary'].cumsum() df_purchased['revenue_total'] = df_purchased['monetary'].sum() df_purchased['revenue_running_percentage'] = \ (df_purchased['revenue_cumsum'] / df_purchased['revenue_total']) * 100 # Assign values and drop temporary metrics df_purchased[abc_class_name] = df_purchased['revenue_running_percentage'].apply(classify_customer) df_purchased[abc_rank_name] = df_purchased['revenue_running_percentage'].rank().astype(int) df_purchased.drop(['revenue_cumsum','revenue_total','revenue_running_percentage'], axis=1, inplace=True) # Assign lapsed customers to class D df_lapsed = df[df['recency'] > (months * 30)] df_all = df_purchased.append(df_lapsed) df_all[abc_class_name].fillna('D', inplace=True) df_all[abc_rank_name].fillna(len(df_purchased)+1, inplace=True) return df_all
Finally, we can run the function on 12 and 24 month periods and assign the values back to our customers dataframe.
df_final = abc(df_customers, 12, 'abc_class_12m', 'abc_rank_12m') df_final = abc(df_final, 24, 'abc_class_24m', 'abc_rank_24m')
As you can see, our customers are now neatly assigned to classes based on their contribution to the cumulative revenue over both 12 and 24 month periods, and they’re ranked across the business and within each class.
D 4784 C 4718 A 3573 B 1578 Name: abc_class_24m, dtype: int64
D 8822 C 2662 A 2326 B 843 Name: abc_class_12m, dtype: int64
Matt Clarke, Sunday, March 14, 2021