How to assign RFM scores with quantile-based discretization

Pictures by Erik Mclean, Remy Loz, Joel Forestbird, Unsplash.

44 minutes to read

RFM segmentation is one of the oldest and most effective ways to segment customers. RFM models are based on three simple values - recency, frequency, and monetary value - which define for each customer how long it’s been since their last order, how many times they’ve purchased, and how much they’ve spent.

These three simple metrics are extremely powerful in predicting future customer behaviour and allow you to visualise their differences when segmented.

By binning the raw RFM metrics into quintiles from 1-5, you can create individual scores that allow you to compare each customer’s recency, frequency, and monetary value to those of others. Customers who have shopped recently get a 5, while those who’ve not been seen for a long time get a 1. Customers who have made lots of orders get a 5 for frequency, while those who’ve placed a single order are assigned a 1. Customers who’ve spent the most get a 5 for monetary, while those who’ve spent the least get a 1.

In combination, the three digit RFM scores cover 125 different states from 111 (least recency, lowest frequency, lowest monetary value) to 555 (most recent, highest frequency, highest monetary value) allowing marketers and sales staff to understand where a customer sits in relation to others without needing to see or analyse the raw data of potentially tens of thousands of other people.

While it’s an old methodology, RFM is still 100% valid today and is still subject to new research. Just this year a number of new papers on RFM have been published which document some of the processes savvy retailers are using.

We’re going to tackle one of these new approaches to assigning RFM scores here by comparing the use of the standard quantile-based discretization approach to the use of K-mean clustering and K-means clustering with an additional log transformation.

Load the packages

For this project we’ll be using Pandas and Numpy for data manipulation, the KMeans model from the Scikit Learn cluster package, and Seaborn for creating some simple charts to visualise the statistical distribution of our data.

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans

Load your transactional data

You don’t need particularly sophisticated raw data in order to calculate the underlying recency, frequency, and monetary metrics. They’re based on standard transactional data which needs to comprise the customer ID, the order ID, the total order revenue, and the date the order was placed. Load your data into a Pandas dataframe and set your order date field to datetime format.

df = pd.read_csv('orders.csv', low_memory=False)
df = df[['order_id','customer_id','total_revenue','date_created']]
df['date_created'] = pd.to_datetime(df['date_created'])
df.head(3)

	order_id	customer_id	total_revenue	date_created
0	299527	166958	74.01	2017-04-07 05:54:37
1	299528	191708	44.62	2017-04-07 07:32:54
2	299529	199961	16.99	2017-04-07 08:18:45

Create a dataframe of unique customers

As RFM is a customer-level metric, we next need to create a dataframe of unique customers from our initial transactional dataset. We can do this using the Pandas unique() function to select only the unique customer_id values and then rename the column accordingly.

df_customers = pd.DataFrame(df['customer_id'].unique())
df_customers.columns = ['customer_id']
df_customers.head()

	customer_id
0	166958
1	191708
2	199961
3	199962
4	199963

Calculate customer recency

Recency is a measure of the number of days since the customer last placed an order. Therefore, to calculate this we need to first identify the date of their most recent transaction, which we can do using df.groupby('customer_id')['date_created'].max().reset_index(). Once we’ve got that we can calculate the timedelta between the current datetime and the date of their last order and return the number of days that have elapsed.

df_recency = df.groupby('customer_id')['date_created'].max().reset_index()
df_recency.columns = ['customer_id','recency_date']
df_customers = df_customers.merge(df_recency, on='customer_id')
df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\
                                / np.timedelta64(1, 'D') ).astype(int)

Calculate order frequency

Order frequency is simply the number of orders each customer has placed. This is easy to calculate. Just groupby() the customer_id column and return the number of unique values in the order_id column.

df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index()
df_frequency.columns = ['customer_id','frequency']
df_customers = df_customers.merge(df_frequency, on='customer_id')

Calculate monetary value

The monetary value column is simply the sum() of the total_revenue column for each customer. (Some researchers do favour the mean() of the total_revenue or the AOV, so feel free to try this approach.) Calculate the value and then merge it back onto the customers dataframe. I’ve saved my processed data to a separate file so I can use it elsewhere, but you can skip this if you don’t need it.

df_monetary = df.groupby('customer_id')['total_revenue'].sum().reset_index()
df_monetary.columns = ['customer_id','monetary']
df_customers = df_customers.merge(df_monetary, on='customer_id')

df_customers.head()

	customer_id	recency_date	recency	frequency	monetary
0	166958	2019-12-18 06:01:18	315	5	188.92
1	191708	2020-04-11 07:49:24	199	15	683.79
2	199961	2017-04-07 08:18:45	1299	1	16.99
3	199962	2017-04-07 08:20:00	1299	1	11.99
4	199963	2017-04-07 08:21:34	1299	1	14.49

df_customers.to_csv('raw_rfm_metrics.csv', index=False)

Examine the statistical distributions

Next we’ll use Seaborn to plot the statistical distributions of the recency, frequency, and monetary data. The frequency and monetary values are heavily skewed, but the recency is far less so. As this retailer has strong seasonality, there are some interesting curves in the slope. As you might imagine, this introduces some challenges, which we’ll see later.

ax = sns.distplot(df_customers['recency'])

png

ax = sns.distplot(df_customers['frequency'])

png

ax = sns.distplot(df_customers['monetary'])

png

Calculating RFM scores using quantile-based discretization

Calculate the R score

For recency, customers with a lower recency are more valuable than those with a higher recency. For example, someone who shopped 7 days ago is far more likely to be a customer than someone last seen 1298 days ago. As a result, we score these differently to the frequency and monetary labels. To create five approximately equal segments of customers we’ll apply qcut() to the recency column and assign labels from 5 to 1.

df_customers['r'] = pd.qcut(df_customers['recency'], q=5, labels=[5, 4, 3, 2, 1])

If you run an aggregation and group the data by the r score and count the number of customers and their minimum, maximum, and mean recency, you’ll see some clear splits in the data. The ones with a score of 5 are very recent, last seen between 6 and 122 days ago, while those with a score of 1 were last seen 776 to 1298 days ago.

df_customers.groupby('r').agg(
    count=('customer_id', 'count'),
    min_recency=('recency', min),
    max_recency=('recency', max),
    std_recency=('recency', 'std'),    
    avg_recency=('recency', 'mean')
).sort_values(by='avg_recency')

	count	min_recency	max_recency	std_recency	avg_recency
r
5	36371	7	123	32.914273	66.371367
4	36185	124	189	18.027304	160.338428
3	35768	190	436	82.832877	278.756794
2	36058	437	776	92.764614	562.302263
1	36076	777	1299	163.633135	1013.147078

Calculate the M score

Since a higher monetary value is a better thing for the business, we score customers here from 1 to 5. Those who spend the most get a five and those who spend the least get a 1. We can use the same approach for qcut() and just switch the direction of the labels.

df_customers['m'] = pd.qcut(df_customers['monetary'], q=5, labels=[1, 2, 3, 4, 5])

Generating the grouped statistics for the monetary score shows us that customers with a score of 1 have spent between zero (presumably because they returned or canceled their order) and £16.87, while those with a score of 5 have spent between £114.99 and £1,036,590. Clearly, there’s a big difference between the spends of some of the customers.

df_customers.groupby('m').agg(
    count=('customer_id', 'count'),
    min_monetary=('monetary', min),
    max_monetary=('monetary', max),
    std_monetary=('monetary', 'std'),
    avg_monetary=('monetary', 'mean')
).sort_values(by='avg_monetary')

	count	min_monetary	max_monetary	std_monetary	avg_monetary
m
1	36096	0.00	16.87	3.545414	11.714210
2	36087	16.88	31.35	4.336020	23.365174
3	36095	31.36	54.42	6.821548	41.994196
4	36106	54.43	114.99	17.146486	79.267260
5	36074	114.99	1036590.86	5497.076346	353.171247

This really highlights the potential drawback of using quantile-based discretization on skewed data. While monetary scores from 1 to 4 show low standard deviations, the standard deviation in the 5 segment is so vast that it makes the segment almost unusable by marketers. While the segments have been created in the recommended manner, this shows we sometimes need to use an alternative approach to binning based on the underlying statistical distributions in the data.

Calculate the F score

Using qcut() quantile-based discretization to create frequency scores is more challenging because the typical spread of order volumes means that bin edges may overlap. If you experience this, qcut() will throw an error telling you “ValueError: Bin edges must be unique” and that you can “You can drop duplicate edges by setting the ‘duplicates’ kwarg”. You can avoid this by appending .rank(method='first').

df_customers['f'] = pd.qcut(df_customers['frequency'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5])
df_customers.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f
0	166958	2019-12-18 06:01:18	315	5	188.92	3	5	5
1	191708	2020-04-11 07:49:24	199	15	683.79	3	5	5
2	199961	2017-04-07 08:18:45	1299	1	16.99	1	2	1
3	199962	2017-04-07 08:20:00	1299	1	11.99	1	1	1
4	199963	2017-04-07 08:21:34	1299	1	14.49	1	1	1

However, as with the monetary data, we encounter another issue with quantile-based discretization for calculating RFM scores on skewed data. It gives you lovely equal group sizes, but the scores don’t make a lot of sense because the frequency is heavily skewed resulting in a really long and quite useless tail.

df_customers.groupby('f').agg(
    count=('customer_id', 'count'),
    min_frequency=('frequency', min),
    max_frequency=('frequency', max),
    std_frequency=('frequency', 'std'),
    avg_frequency=('frequency', 'mean')
).sort_values(by='avg_frequency')

	count	min_frequency	max_frequency	std_frequency	avg_frequency
f
1	36092	1	1	0.000000	1.000000
2	36091	1	1	0.000000	1.000000
3	36092	1	1	0.000000	1.000000
4	36091	1	2	0.364138	1.157352
5	36092	2	392	4.383576	3.697218

Create the RFM segment

Finally, we can concatenate the r, f, and m values in a single column to create the RFM score for each customer. If you print out the head() of your df_customers dataframe you’ll see the raw recency, frequency, and monetary values for each customers alongside their scores.

df_customers['rfm'] = df_customers['r'].astype(str) +\
                      df_customers['f'].astype(str) +\
                      df_customers['m'].astype(str)

Create the aggregate RFM score

You can also calculate additional aggregate metrics, such as the RFM score (recency + frequency + monetary) and use weightings to give certain variables, such as recency more impact than monetary. I tend to find that the individual values get used most, as the rfm_score metric obviously conceals some detail, however, it can be useful for quick sorting.

df_customers['rfm_score'] = df_customers['r'].astype(int) +\
                            df_customers['f'].astype(int) +\
                            df_customers['m'].astype(int)
df_customers.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f	rfm	rfm_score
0	166958	2019-12-18 06:01:18	315	5	188.92	3	5	5	355	13
1	191708	2020-04-11 07:49:24	199	15	683.79	3	5	5	355	13
2	199961	2017-04-07 08:18:45	1299	1	16.99	1	2	1	112	4
3	199962	2017-04-07 08:20:00	1299	1	11.99	1	1	1	111	3
4	199963	2017-04-07 08:21:34	1299	1	14.49	1	1	1	111	3

Examining the mean_recency, mean_frequency, and mean_monetary for each of the RFM scores from 3 to 15 gives a really nice picture of our customers. The lowest scoring customers with an rfm_score of 3 who are in the rfm_segment 111 have a mean recency of 991 days, have placed a single order and spent an average of £11.

The top customers in the rfm_score 15 group and the rfm_segment 555 have been seen 61 days ago, have placed an average of 5.7 orders and spent £591. Marketers can clearly see who to target with reactivation campaigns and who they should be giving preferential treatment to ensure they don’t churn.

df_customers.groupby('rfm_score').agg(
    customers=('customer_id', 'count'),
    mean_recency=('recency', 'mean'),
    mean_frequency=('frequency', 'mean'),
    mean_monetary=('monetary', 'mean'),
).sort_values(by='rfm_score')

	customers	mean_recency	mean_frequency	mean_monetary
rfm_score
3	7356	991.781403	1.000000	11.002807
4	8195	980.658694	1.000000	22.480073
5	14741	784.199986	1.000000	26.762419
6	15797	701.943913	1.007660	43.556238
7	18184	530.294985	1.020568	70.994995
8	18173	343.364277	1.046828	44.141714
9	17427	327.299420	1.134045	64.289315
10	17807	256.091200	1.222104	63.773739
11	16690	237.216058	1.473697	77.827959
12	15278	222.878125	1.836890	127.955221
13	12732	157.297204	2.385171	168.094980
14	9718	103.189648	2.794814	241.868315
15	8360	61.600478	5.739234	591.236373

Overcoming skewness when calculating RFM scores

As we’ve seen above, using quantile-based discretization for the creation of RFM segments on a highly skewed dataset can cause some issues. There are a few things that you can do to work around this. If it makes sense to do so, you could simply remove the outliers (those massive customers likely need preferential treatment anyway), or you could log transform the data before binning. However, there are two more common approaches:

Option 1: Use a time constraint

The easiest option would be to calculate the RFM metrics over a shorter period so the level of skewness is greatly reduced. For example, we could limit our scoring to just those customers who purchased in the last two years the standard deviation will drop massively. However, while this is commonly used, it does leave you with only a fraction of your customers with scores assigned and makes targeting lapsed customers much trickier.

Option 2: Use unequal binning

The other approach, and the one I prefer, is to use unequal binning. While the standard method of RFM scoring is to create equal bins with up to 125 quintiles from 111 to 555, in most cases they’re only used to get a basic understanding of the customers for targeting purposes.

The scores themselves are not generally used as model features, we use the raw data instead, so the sizes of the bins may not really matter that much. For practical marketing purposes, I think it’s better to have usable bins than full bins. K-means clustering is becoming one of the most popular ways to perform this binning on such datasets, and has been covered in several papers recently.

png

K-means clustering

When you’re using unequal binning on RFM data there are a number of different algorithms you can apply. One of the simplest is to use define the bounds for each binning using cut(), but this would mean we’d need to adjust our code as the customer base changed. To avoid that, we could either manually calculate the bounds using percentiles to automatically create dynamic bin edges, or we could use the Fisher-Jenks algorithm to determine the natural bounds and then pass those values into cut().

However, the easiest option is to use an unsupervised clustering algorithm such as K-means and let it figure things out using each of the raw RFM metrics. K-means clustering is a good choice for solving the problem of skewness we experienced in this data set. However, there’s an important factor to remember when using it for binning RFM scores.

Ordinarily, the labels assigned to K-means clusters aren’t important, but in RFM segmentation they need to denote the correct ordering of the data. As a result, we need to sort them, and ensure the recency sort order gives higher scores to lower values, and the frequency and monetary sorting gives higher scores to higher values. Here’s why:

df_clusters = df_customers[['recency']]

kmeans = KMeans(n_clusters=5)
kmeans.fit(df_clusters)

KMeans(n_clusters=5)

As you can see from the data below, if you fit a KMeans model to the recency column data and don’t sort it, the cluster numbers assigned don’t follow the average recency of each cluster. The cluster with the lowest recency would be R=1 in RFM terms, but occupies position 3 of the index. We need to sort the data like this then reassign the correct segment to each clustered customer.

df_clusters = df_clusters.assign(cluster=kmeans.labels_)
df_clusters.groupby('cluster')['recency'].mean().sort_values(ascending=False).to_frame()

	recency
cluster
2	1172.660823
1	839.627699
4	497.569594
0	196.813082
3	73.539744

Sorting K-means clusters

It’s a bit of a ball-ache to sort the K-Means clusters so they’re in the right order. To achieve this I wrote the function below. It’s a bit complex but it basically lets you fit a K-Means model to a given metric column (i.e. recency, frequency, or monetary), clusters the data and then re-order it so the cluster label (i.e. 1-5) matches the mean distribution of the metric within the dataset.

Since recency is sorted the opposite way to frequency and monetary (because lower recency is better, while higher frequency/monetary are better), it also handles sort order and lets you log+1 transform the data to handle skewed statistical distributions. There’s probably a more elegant and less hacky solution to this quick version, so please let me know in the comments if you know one!

def sorted_kmeans(df, metric_column, cluster_name, ascending=True, log=False):
    """Runs a K-means clustering algorithm on a specific metric column
    in a Pandas dataframe; sorts the data in a specified direction; and
    reassigns cluster numbers to match the data distribution, so they 
    are appropriate for RFM segmentation. Includes a log+1 transformation
    for heavily skewed datasets.

    :param df: Pandas dataframe containing RFM data
    :param metric_column: Column name of metric to cluster, i.e. 'recency'
    :param cluster_name: Name to assign to clustered metric, i.e. 'R'
    :param ascending: Sort ascending (M and F), or descending (R)
    :return
        Original Pandas dataframe with additional column
    """

    if log:
        df[metric_column] = np.log(df[metric_column]+1)

    # Fit the model
    kmeans = KMeans(n_clusters=5)
    kmeans.fit(df[[metric_column]])

    # Assign the initial unsorted cluster
    initial_cluster = 'unsorted_'+cluster_name
    df[initial_cluster] = kmeans.predict(df[[metric_column]])+1
    df[cluster_name] = df[initial_cluster]

    # Group the clusters and re-rank to determine the correct order
    df_sorted = df.groupby(initial_cluster)[metric_column].mean().round(2).reset_index()
    df_sorted = df_sorted.sort_values(by=metric_column, ascending=ascending).reset_index(drop=True)
    df_sorted[cluster_name] = df_sorted[metric_column].rank(method='max', ascending=ascending).astype(int)

    # Merge data and drop redundant columns
    df = df.merge(df_sorted[[cluster_name, initial_cluster]], on=[initial_cluster])
    df = df.drop(initial_cluster, axis=1)
    df = df.drop(cluster_name+'_x', axis=1)
    df = df.rename(columns={cluster_name+'_y':cluster_name})

    return df

To create the RFM segments using K-means clustering we can simply run the function three times, and assign the resulting dataframe back to the function so it adds the next value to the data. As I’ve not defined the log argument, the data here aren’t being log transformed.

df_customers = sorted_kmeans(df_customers, 'recency', 'R', ascending=False)
df_customers = sorted_kmeans(df_customers, 'frequency', 'F', ascending=True)
df_customers = sorted_kmeans(df_customers, 'monetary', 'M', ascending=True)
df_customers.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f	rfm	rfm_score	R	F	M
0	166958	2019-12-18 06:01:18	315	5	188.92	3	5	5	355	13	4	2	1
1	12052	2020-03-20 18:35:15	221	3	208.59	3	5	5	355	13	4	2	1
2	199983	2020-02-20 14:17:46	250	3	61.20	3	4	5	354	12	4	2	1
3	199994	2020-05-20 10:07:15	160	5	202.69	4	5	5	455	14	4	2	1
4	200003	2020-04-15 14:31:19	195	3	93.87	3	4	5	354	12	4	2	1

Examining the statistics for the R segment shows we’ve got a lot of fairly recent customers, indicating that this company is doing well acquisition and customer retention wise. There’s plenty of data in each segment to allow marketers enough data to make their reactivation campaigns viable.

df_customers.groupby('R').agg(
    customers=('customer_id', 'count'),
    min_recency=('recency', min),
    max_recency=('recency', max),
    std_recency=('recency', 'std'),
    avg_recency=('recency', 'mean')
).sort_values(by='avg_recency')

	customers	min_recency	max_recency	std_recency	avg_recency
R
5	41012	7	136	36.946592	73.539744
4	57836	137	348	47.805535	196.975154
3	39562	349	669	77.355098	498.059653
2	25359	670	1007	82.002497	840.557632
1	16689	1008	1299	77.827297	1173.467374

The statistics for F show that a huge number of customers have only placed one order, which is something that’s very common in ecommerce. The customers with a score of 5 for frequency are very different from the others and have much higher average frequencies.

df_customers.groupby('F').agg(
    customers=('customer_id', 'count'),
    min_frequency=('frequency', min),
    max_frequency=('frequency', max),
    std_frequency=('frequency', 'std'),
    avg_frequency=('frequency', 'mean')
).sort_values(by='avg_frequency')

	customers	min_frequency	max_frequency	std_frequency	avg_frequency
F
1	161526	1	2	0.348430	1.141395
2	15263	3	6	0.979677	3.787067
3	3190	7	15	2.274094	9.277743
4	473	16	87	8.262115	22.012685
5	6	145	392	96.099254	218.333333

Finally, the M score shows a similar picture to F, due to all those single-order customers, while at the opposite end of the spectrum we’ve got one massively lucrative customer who generates a metric shit-tonne of revenue. We end up with segments that better indicate the value of the customers to the business, but which are obviously rather unequal in size.

df_customers.groupby('M').agg(
    customers=('customer_id', 'count'),
    min_monetary=('monetary', min),
    max_monetary=('monetary', max),
    std_monetary=('monetary', 'std'),
    avg_monetary=('monetary', 'mean')
).sort_values(by='avg_monetary')

	customers	min_monetary	max_monetary	std_monetary	avg_monetary
M
1	172489	0.00	374.42	68.218285	6.431079e+01
2	7766	374.46	2227.00	351.353277	6.868829e+02
3	199	2238.25	20096.58	2601.816439	3.791054e+03
4	3	41940.85	78970.38	20562.845756	5.529054e+04
5	1	1036590.86	1036590.86	NaN	1.036591e+06

png

Applying log transformation

The final approach is to use K-means clustering for the creation of the segments, but to also apply a log+1 transformation to the metrics to handle the skewness in the dataset. This gives us segments which fall somewhere between the quantile-based discretization approach and the regular non-transformed K-means, so that looks to be the best approach for this company. You can also try using StandardScaler() to see if this improves performance, as K-means does expect equal variance.

Recency via K-means with log+1 transformation

clusters_log = sorted_kmeans(df_customers, 'recency', 'R', ascending=False, log=True)
clusters_log.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f	rfm	rfm_score	F	M	R
0	166958	2019-12-18 06:01:18	5.755742	5	188.92	3	5	5	355	13	2	1	2
1	2019	2019-12-20 17:18:29	5.746203	4	187.54	3	5	5	355	13	2	1	2
2	200130	2019-12-06 14:36:49	5.789960	3	139.48	3	5	5	355	13	2	1	2
3	200395	2019-12-24 11:47:37	5.733341	4	147.78	3	5	5	355	13	2	1	2
4	196515	2019-12-29 17:07:22	5.717028	4	310.19	3	5	5	355	13	2	1	2

ax = sns.distplot(clusters_log['recency'])

png

clusters_log.groupby('R').agg(
    count=('customer_id', 'count'),
    min_recency=('recency', min),
    max_recency=('recency', max),
    std_recency=('recency', 'std'),
    avg_recency=('recency', 'mean')
).sort_values(by='avg_recency')

	count	min_recency	max_recency	std_recency	avg_recency
R
5	9008	2.079442	3.663562	0.444930	3.084151
4	23923	3.688879	4.727388	0.291616	4.293835
3	61241	4.736198	5.659482	0.202653	5.167734
2	44296	5.662960	6.508769	0.200576	6.155100
1	41990	6.510258	7.170120	0.185897	6.864239

Frequency via K-means with log+1 transformation

clusters_log = sorted_kmeans(clusters_log, 'frequency', 'F', ascending=True, log=True)
clusters_log.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f	rfm	rfm_score	M	R	F
0	166958	2019-12-18 06:01:18	5.755742	1.791759	188.92	3	5	5	355	13	1	2	4
1	173791	2019-12-18 22:04:08	5.752573	1.791759	111.31	3	4	5	354	12	1	2	4
2	201608	2020-01-01 13:28:54	5.707110	1.791759	167.23	3	5	5	355	13	1	2	4
3	185842	2019-12-27 14:11:35	5.723585	1.945910	240.22	3	5	5	355	13	1	2	4
4	180212	2019-11-30 13:37:00	5.808142	1.791759	259.74	3	5	5	355	13	1	2	4

ax = sns.distplot(clusters_log['frequency'])

png

clusters_log.groupby('F').agg(
    count=('customer_id', 'count'),
    min_frequency=('frequency', min),
    max_frequency=('frequency', max),
    std_frequency=('frequency', 'std'),
    avg_frequency=('frequency', 'mean')
).sort_values(by='avg_frequency')

	count	min_frequency	max_frequency	std_frequency	avg_frequency
F
1	138687	0.693147	0.693147	0.000000	0.693147
2	22839	1.098612	1.098612	0.000000	1.098612
3	11790	1.386294	1.609438	0.104155	1.457855
4	5445	1.791759	2.302585	0.171952	1.963479
5	1697	2.397895	5.973810	0.343545	2.702249

Monetary via K-means with log+1 transformation

clusters_log = sorted_kmeans(clusters_log, 'monetary', 'M', ascending=True, log=True)
clusters_log.head()

	customer_id	recency_date	recency	frequency	monetary	r	m	f	rfm	rfm_score	R	F	M
0	166958	2019-12-18 06:01:18	5.755742	1.791759	5.246603	3	5	5	355	13	2	4	4
1	173791	2019-12-18 22:04:08	5.752573	1.791759	4.721263	3	4	5	354	12	2	4	4
2	201608	2020-01-01 13:28:54	5.707110	1.791759	5.125332	3	5	5	355	13	2	4	4
3	185842	2019-12-27 14:11:35	5.723585	1.945910	5.485709	3	5	5	355	13	2	4	4
4	203929	2019-12-23 22:40:07	5.736572	1.791759	5.444709	3	5	5	355	13	2	4	4

ax = sns.distplot(clusters_log['monetary'])

png

clusters_log.groupby('M').agg(
    count=('customer_id', 'count'),
    min_monetary=('monetary', min),
    max_monetary=('monetary', max),
    std_monetary=('monetary', 'std'),
    avg_monetary=('monetary', 'mean')
).sort_values(by='avg_monetary')

	count	min_monetary	max_monetary	std_monetary	avg_monetary
M
1	38717	0.000000	2.908539	0.385274	2.514048
2	46879	2.909084	3.676807	0.221237	3.306364
3	47001	3.677060	4.484696	0.227218	4.053475
4	34202	4.484809	5.539654	0.291592	4.924446
5	13659	5.539890	13.851449	0.541797	6.165789