You might think human behaviour would be hard to predict but, in ecommerce data science, it’s not actually as difficult as you may think to predict whether a customer will purchase in the next period or not.
Once they’ve placed a few orders, customers behave in quite predictable ways and the science behind it is really quite logical. All you really need to make these predictions are some transactional data, comprising one order per line, and the ability to write some Python code to manipulate the data and put it through a model called the Beta Geometric Negative Binomial Distribution or BG/NBD model.
The BG/NBD model is an improvement on the earlier Pareto/NBD model and uses the same “Buy ‘Til You Die” approach, which allow you to calculate the probability of a customer being “alive” (or still a customer) at a given point in the future.
Here we’ll use a transactional dataset, and use Lifetimes to calculate some RFM metrics and then predict the probability that each customer is alive and estimate the number of orders each one will place in the next period.
First, we’ll load our transactional data. This is standard stuff in ecommerce and comprises the unique order ID, the customer ID, the total value of the order and the date on which it was placed. In our dataset we also have the channel and country fields, but we don’t need those. We’ve also got a redundant column called “Unnamed: 0”, so we’ll drop this to tidy things up.
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df_orders = pd.read_csv('data/orders.csv')
df_orders.drop(['Unnamed: 0'], axis=1, inplace=True)
df_orders.head()
order_id | customer_id | channel | country | total_revenue | order_date | |
---|---|---|---|---|---|---|
0 | 299527 | 166958 | 1 | 231 | 74.01 | 2017-04-07 04:55:58 |
1 | 299528 | 191708 | 1 | 231 | 44.62 | 2017-04-07 06:34:07 |
2 | 299529 | 199961 | 5 | 231 | 16.99 | 2017-04-07 07:18:50 |
3 | 299530 | 199962 | 5 | 231 | 11.99 | 2017-04-07 07:20:25 |
4 | 299531 | 199963 | 5 | 231 | 14.49 | 2017-04-07 07:21:40 |
In ecommerce data science almost everything revolves around four or five key metrics, which are really all derivatives of recency (R), frequency (F) and monetary value (M). These form the basis of the popular RFM model which has been used in marketing for decades. The other two metrics that matter in ecommerce are tenure (T) - how long the customer has been a customer - and latency - the number of days between their orders. It’s pretty easy to calculate these manually, however, we’re going to use Cameron Davidson-Pilon’s superb Lifetimes package as it does this easily and gives you access to some models to analyse the data.
First, you will need to install Lifetimes by entering pip3 install lifetimes
and then load the summary_data_from_transaction_data
module from lifetimes.utils
. By passing this method your Pandas dataframe of transactions and defining the columns that contain the customer ID and the order date, Lifetimes helper will calculate the frequency, recency and Tenure (or age) for you.
from lifetimes.utils import summary_data_from_transaction_data
data = summary_data_from_transaction_data(df_orders,
'customer_id',
'order_date',
observation_period_end='2020-09-07')
If you print the head()
of the data
dataframe returned by summary_data_from_transaction_data()
you’ll see that it’s identified each unique customer and has calculated their Recency, Frequency and T. There are lots of different ways to calculate similar metrics, so it’s worth getting to grips with what Lifetimes does.
Recency represents the age of the customer in days when they made their most recent purchase and is calculated from their tenure minus the number of days since their last order. A recency of zero indicates a newly acquired customer. Frequency measures the number of repeat orders a customer has placed, so a value of zero indicates a new customer who has placed a single order, a value of 1 indicates a customer placing their second order, and so on. T measures the tenure of the customer in days - that is, how many days have elapsed since their first order.
data.head()
frequency | recency | T | |
---|---|---|---|
customer_id | |||
6 | 5.0 | 990.0 | 1123.0 |
34 | 0.0 | 0.0 | 87.0 |
44 | 12.0 | 1055.0 | 1227.0 |
45 | 0.0 | 0.0 | 514.0 |
71 | 6.0 | 1128.0 | 1240.0 |
Now that we have our basic customer data set up, we can fit a model. Lifetimes includes several models. First, we will use the BetaGeoFitter
model, which provides the Beta-Geometric Negative Binomial Distribution model that is common to the so-called “Buy ‘Til You Die” customer lifetime models. To fit the model, we simply pass in the dataframe columns containing the frequency, recency, and tenure data.
from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(data['frequency'], data['recency'], data['T'])
bgf.summary
coef | se(coef) | lower 95% bound | upper 95% bound | |
---|---|---|---|---|
r | 0.108855 | 0.000988 | 0.106919 | 0.110792 |
alpha | 34.224816 | 0.647906 | 32.954920 | 35.494713 |
a | 0.503406 | 0.013369 | 0.477203 | 0.529609 |
b | 0.834213 | 0.027241 | 0.780821 | 0.887606 |
To examine the output of the model, we can pass the bgf
data to plot_frequency_recency_matrix()
. A recency/frequency matrix can show you the probability that a customer is still a customer or is “alive” based on their intra-activity latency, or the gap between orders. If a customer usually orders every week and hasn’t been seen for a few months, their probability of being alive is low. However, if a customer orders every few months and bought a couple of months ago, then they’re probably still alive.
A typical recency/frequency matrix shows a long tail at the bottom of the matrix. On the example below, a customer who has a frequency of 200+ and has been a customer for 1200+ days when they placed their last order is likely to be alive.
from lifetimes.plotting import plot_frequency_recency_matrix
plot_frequency_recency_matrix(bgf)
<matplotlib.axes._subplots.AxesSubplot at 0x7f312d881b20>
To identify the probability of whether customers are alive, you can use the plot_probability_alive_matrix()
method and pass it the bgf
data. Here, we can see that customers who have ordered very recently are likely to be alive, and this goes up with the number of orders placed.
from lifetimes.plotting import plot_probability_alive_matrix
plot_probability_alive_matrix(bgf)
<matplotlib.axes._subplots.AxesSubplot at 0x7f312d7eb820>
The other powerful thing you can do with the BG/NBD model is predict the number of purchases each customer is likely to make over the next period. Here, we set t
to be 30 so the model predicts the number of purchases each customer will make in the next 30 days, then we output the predictions and sort the results.
t = 30
data['predicted_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(t,
data['frequency'],
data['recency'],
data['T'])
data.sort_values(by='predicted_purchases').tail(10)
frequency | recency | T | predicted_purchases | |
---|---|---|---|---|
customer_id | ||||
236204 | 56.0 | 786.0 | 790.0 | 2.000242 |
379598 | 16.0 | 178.0 | 180.0 | 2.104599 |
437951 | 8.0 | 52.0 | 59.0 | 2.168156 |
306933 | 12.0 | 93.0 | 101.0 | 2.344462 |
371248 | 29.0 | 246.0 | 252.0 | 2.881791 |
5349 | 164.0 | 1239.0 | 1243.0 | 3.812563 |
370212 | 52.0 | 260.0 | 266.0 | 4.945195 |
350047 | 111.0 | 367.0 | 373.0 | 7.853591 |
350046 | 122.0 | 367.0 | 373.0 | 8.616947 |
311109 | 246.0 | 503.0 | 509.0 | 12.998316 |
To determine the fit of our BG/NBD model and the probable accuracy of its predictions we can plot the data to assess it. As with the other Lifetimes modules, we just need to pass the bgf
output to the function with plot_period_transactions()
and it will return a MatPlotLib chart showing the predictions against the actual. If the Actual and Model bars are similar, our model is pretty good at making the prediction.
from lifetimes.plotting import plot_period_transactions
plot_period_transactions(bgf)
<matplotlib.axes._subplots.AxesSubplot at 0x7f312d755af0>
To properly test the BG/NBD model it’s best to create a partitioned dataset. Here we will create a calibration period in which to train the model and then create a holdout period to validate our model. The model never gets to see the data in the holdout group, but we can compare the accuracy of the prediction with the known number of purchases after the model has run.
The calibration_and_holdout_data()
function creates that partitioned dataset for you. It’s much like the one we created above, but includes labels relating to the frequency, recency, and tenure during calibration, plus the duration of the holdout period in days and the predicted number of purchases we expect to see within the holdout period.
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df_orders,
'customer_id',
'order_date',
calibration_period_end='2020-06-01',
observation_period_end='2020-08-31')
summary_cal_holdout.head()
frequency_cal | recency_cal | T_cal | frequency_holdout | duration_holdout | |
---|---|---|---|---|---|
customer_id | |||||
6 | 5.0 | 990.0 | 1025.0 | 0.0 | 91.0 |
44 | 12.0 | 1055.0 | 1129.0 | 0.0 | 91.0 |
45 | 0.0 | 0.0 | 416.0 | 0.0 | 91.0 |
71 | 6.0 | 1128.0 | 1142.0 | 0.0 | 91.0 |
242 | 0.0 | 0.0 | 1128.0 | 0.0 | 91.0 |
By plotting the actual number of purchases in the holdout period against the model’s predictions we can eyeball the model’s accuracy. For customers who placed three purchases in the calibration period, we’d expect to see about 0.3 in the holdout period.
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
bgf.fit(summary_cal_holdout['frequency_cal'],
summary_cal_holdout['recency_cal'],
summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
<matplotlib.axes._subplots.AxesSubplot at 0x7f312c8f35e0>
Setting a longer period for the holdout might be required to give you more useful predictions, but it depends on the dataset and the typical frequency with which your customers shop. If we re-run the model with the holdout period set to a year, instead of a few months, we can see that the predictions are pretty close to the actual. You can, of course, output the data itself in Pandas and re-join it to your original data to double-check it on an individual customer level.
from lifetimes.utils import calibration_and_holdout_data
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
summary_cal_holdout = calibration_and_holdout_data(df_orders,
'customer_id',
'order_date',
calibration_period_end='2019-08-31',
observation_period_end='2020-08-31')
bgf.fit(summary_cal_holdout['frequency_cal'],
summary_cal_holdout['recency_cal'],
summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3126ae8700>
Matt Clarke, Wednesday, March 03, 2021