How to analyse product replenishment

Picture by Andrea Piacquadio, Pexels.

21 minutes to read

Subscription commerce was all the rage for a while, but it’s not really become as popular as many in ecommerce perhaps envisaged. While we may have subscriptions for certain things, most of us have continued to buy consumable products in the same way - by replenishing them whenever we run out.

Online retailers are wise to this common behaviour, and many create specific product replenishment email campaigns to target such customers. According to some marketing studies, these product replenishment emails can have the highest click through rates of any marketing automation emails, so they’re worth adding to your email marketing arsenal.

Product replenishment email campaigns

Product replenishment emails aim to contact each customer just before they are due to replenish a given product and give them with a gentle reminder to buy some more before they run out. They work because they’re timely and relevant. However, replenishment campaigns are harder to get right than you might first imagine, because customers all behave in slightly different ways.

Volume purchased	Customers who buy a product in bulk may well need to stock up again less frequently than those who purchase smaller quantities more frequently. Similarly, an individual customer who usually purchases, say, a 2kg pack of protein powder which lasts them a week, might not need to purchase again at their usual interval if their last order was for a 10kg sack.
Consumption rate	Different people have different rates of product consumption, so they won't all replenish their products at the same rates. The mean time to replenish a given product might be 30 days, but some customers will use the product more quickly and need to replenish weekly, while for others their pack could last them months.
Seasonality	Replenishment rates can also be seasonal. Certain products get replenished only during certain seasons. For example, you might mince pies several times in the run up to Christmas, but try finding them in the summer...
Product type	Some products are just not replenished. For example, electrical appliances and gadgets tend to have a longer lifetime and customers rarely need more than one. By the time the product requires replacement, it's usually obsolete and has to be replaced with an alternative of the same general type.

In this project, I’ll show you how you can use Pandas to calculate product replenishment rates for each of your products and calculate the average time it takes customers to replenish them. For simple products, the data could be used as-is to power basic product replenishment campaigns, but they also serve as the raw features for product replenishment campaign models.

Load the packages

First, open a Jupyter notebook and import the pandas, numpy, and seaborn packages. You’ll likely already have these installed, but if you don’t you can install them via the PyPi package manager by entering pip3 install package-name in your terminal or !pip3 install package-name in a cell in your Jupyter notebook.

import pandas as pd
import numpy as np
import seaborn as sns

%config InlineBackend.figure_format = 'retina'
sns.set_context('notebook')
sns.set(rc={'figure.figsize':(15, 6)})

Load the data

Next, load up your dataset of transaction items. I’m using the Online Retail dataset from the UCI Machine Learning Repository. This is widely used for demonstrations in ecommerce data science and includes a wide range of regularly purchased items from an online retailer selling party supplies. This includes a number of items that regularly get repurchased.

To tidy up the data, I have renamed the columns, and importantly, I have ordered the data in descending order of the order_date column. If you miss this vital step, you’ll get nonsensical results, since everything we’re doing relies on the data being ordered chronologically.

df = pd.read_csv('online_retail.csv', 
                 names=['order_id', 'sku', 'description', 'quantity', 'order_date', 'unit_price', 'customer_id', 'country'], 
                 skiprows=1).sort_values(by=['order_date'], ascending=False)
df.head()

	order_id	sku	description	quantity	order_date	unit_price	customer_id	country
541908	581587	22138	BAKING SET 9 PIECE RETROSPOT	3	2011-12-09 12:50:00	4.95	12680.0	France
541901	581587	22367	CHILDRENS APRON SPACEBOY DESIGN	8	2011-12-09 12:50:00	1.95	12680.0	France
541895	581587	22556	PLASTERS IN TIN CIRCUS PARADE	12	2011-12-09 12:50:00	1.65	12680.0	France
541896	581587	22555	PLASTERS IN TIN STRONGMAN	12	2011-12-09 12:50:00	1.65	12680.0	France
541897	581587	22728	ALARM CLOCK BAKELIKE PINK	4	2011-12-09 12:50:00	3.75	12680.0	France

Remove zero-value order lines

The second important thing we must do is ensure that any zero-value order lines, such as returns or free gifts, are removed from the dataset. An item that arrived damaged or was faulty and has been replaced free of charge shouldn’t be confused with a genuine replenishment of the product, so this step is critical. I’ve done this twice - once using unit_price and once using quantity and then have checked the aggregate statistics using describe().

df = df[df['unit_price'] > 0]
df = df[df['quantity'] > 0]

df.describe()

	quantity	unit_price	customer_id
count	530104.000000	530104.000000	397884.000000
mean	10.542037	3.907625	15294.423453
std	155.524124	35.915681	1713.141560
min	1.000000	0.001000	12346.000000
25%	1.000000	1.250000	13969.000000
50%	3.000000	2.080000	15159.000000
75%	10.000000	4.130000	16795.000000
max	80995.000000	13541.330000	18287.000000

Calculate each customer’s previous order date for each SKU

In order to calculate the average replenishment time for each SKU, our first step is to calculate the replenishment date for each customer-SKU combination. To do this, we’ll use the Pandas groupby() function and will pass it both the customer_id and the sku. We’ll then append the order_date column and use the shift() function to get the previous value from the order_date column using -1.

df['previous_sku_order_date'] = df.groupby(['customer_id', 'sku'])['order_date'].shift(-1)

To check that the date of each customer’s previous order for each SKU has been calculated correctly, we can filter our dataframe to look at a specific customer-SKU combination. Customer 12680 has purchased the SKU 22728 three times, first on August 18th 2011, then on September 27th 2011, and finally on December 9th 2011. For each purchase, we’ve correctly assigned the date of the previous order to the previous_sku_order_date column, and have a NaN when the purchase was a customer’s first for the given SKU.

df[(df['customer_id']==12680) & (df['sku']=='22728')].sort_values(by='order_date', ascending=True)

	order_id	sku	description	quantity	order_date	unit_price	customer_id	country	previous_sku_order_date
305794	563712	22728	ALARM CLOCK BAKELIKE PINK	4	2011-08-18 15:44:00	3.75	12680.0	France	NaN
362787	568518	22728	ALARM CLOCK BAKELIKE PINK	4	2011-09-27 12:53:00	3.75	12680.0	France	2011-08-18 15:44:00
541897	581587	22728	ALARM CLOCK BAKELIKE PINK	4	2011-12-09 12:50:00	3.75	12680.0	France	2011-09-27 12:53:00

Calculate the number of days to replenishment

Next, we’ll create a function to calculate the number of days between each of a customer’s orders of a given SKU. This takes three values: the original dataframe, the “before” date, and the “after” date. It then calculates the time delta - or the time difference between the two dates - and returns the number of days as an integer, that we can then assign to a new column.

def get_days_since_date(df, before_datetime, after_datetime):
    """Return a new column containing the difference between two dates in days.
    Args:
        df: Pandas DataFrame.
        before_datetime: Earliest datetime (will convert value)
        after_datetime: Latest datetime (will convert value)
    Returns:
        Days since previous date. 
    """
    df = df.copy()
    df[before_datetime] = pd.to_datetime(df[before_datetime])
    df[after_datetime] = pd.to_datetime(df[after_datetime])

    diff = df[after_datetime] - df[before_datetime]
    return round(diff / np.timedelta64(1, 'D')).fillna(0).astype(int)

To run the function we simply pass in the values and assign the output to a new column in our original dataframe called days_since_previous_sku_order. Running the sample() function returns a sample set of rows, so we can check that the function has worked correctly.

df['days_since_previous_sku_order'] = get_days_since_date(df, 'previous_sku_order_date', 'order_date')
df[['order_id','sku','previous_sku_order_date','days_since_previous_sku_order']].sample(5)

	order_id	sku	previous_sku_order_date
192073	553390	22705	NaN
487136	577768	90165B	NaN
359404	568188	23326	NaN
382126	569898	21787	NaN
116800	546306	22308	NaN

Calculate aggregate statistics on product replenishment

Since certain items are rarely, if ever, replenished, we also need to calculate some product-level data to identify replenishment rates across the product catalogue. We wouldn’t want to inadvertently contact a customer to tell them their TV is running out, when it’s a non-replenished item… Instead, we’d likely want to mark these as non-replenished items that we ignore from our campaigns.

Arguably, there’s also a difference between something that is replenished because it runs out (i.e., protein powder, coffee, or dog food) and something that is repurchased because it’s needed again for a completely different purpose (i.e., batteries for a new product). However, figuring this out is likely to be something that may require human input.

To calculate these data we’ll create a helper function called get_replenishments() to which we’ll pass our dataframe of transactional items. We’ll first calculate the number of times each SKU has been purchased by each customer and store it in the dataframe. Then, we’ll count the number of times it’s been purchased once only.

Next, we’ll create an aggregation containing the SKUs from the dataframe and calculate a range of metrics for each one. Then we’ll join the selected fields from our original transaction items dataframe back to the SKUs dataframe and return the output.

def get_replenishments(df):
    
    df['times_purchased'] = df.groupby(['sku', 'customer_id'])['order_id'].transform('count')
    df['times_purchased'] = df['times_purchased'].fillna(0).astype(int)
    
    df['purchased_once'] = df[df['times_purchased']==1].groupby('sku')['order_id'].transform('count')
    df['purchased_once'] = df['purchased_once'].fillna(0).astype(int)
    
    df_skus = df.groupby('sku').agg(
        total_customers=('customer_id', 'nunique'),
        total_orders=('order_id', 'nunique'),
        total_quantity=('quantity', 'sum'),
        avg_quantity=('quantity', 'mean'),
        avg_price=('unit_price', 'mean'),
        avg_replenishment_days=('days_since_previous_sku_order', 'mean'),
        std_replenishment_days=('days_since_previous_sku_order', 'std'),
        max_replenishment_days=('days_since_previous_sku_order', 'max'),
    )
    
    df_skus['orders_per_customer'] = df_skus['total_orders'] / df_skus['total_customers']
    df_skus['items_per_customer'] = df_skus['total_quantity'] / df_skus['total_customers']

    df_subset = df[['sku', 'purchased_once']].sort_values(by='purchased_once', ascending=False)
    df_subset = df_subset.drop_duplicates('sku', keep='first')
    df_skus = df_skus.merge(df_subset, on='sku', how='left')
    
    df_skus['total_replenishments'] = df_skus['total_orders'] - df_skus['purchased_once']
    df_skus['replenishment_rate'] = df_skus['total_replenishments'] / df_skus['total_orders']
    
    return df_skus

Running the function and sorting the data in descending order of total_replenishments shows us a range of products that are regularly purchased repeatedly by customers. While this is useful for powering our replenishment campaign emails, it’s potentially also worth looking at how we might be able to acquire customers more likely to purchase these SKUs, since they’re likely to have good customer lifetime values with this level of repeat purchasing.

df_skus = get_replenishments(df)
df_skus.sort_values(by='total_replenishments', ascending=False).head(20)

	sku	total_customers	total_orders	total_quantity	avg_quantity	avg_price	avg_replenishment_days	std_replenishment_days	max_replenishment_days	orders_per_customer	items_per_customer	purchased_once	total_replenishments	replenishment_rate
3387	85099B	635	2089	48474	22.951705	2.485459	27.729640	47.914234	326	3.289764	76.337008	321	1768	0.846338
3407	85123A	856	2198	37660	16.626932	3.116949	32.940397	55.590776	365	2.567757	43.995327	432	1766	0.803458
1310	22423	881	1988	13879	6.881011	13.976926	28.187407	53.463092	355	2.256527	15.753689	547	1441	0.724849
175	20725	532	1565	19553	12.258934	2.127373	29.196238	47.193996	305	2.941729	36.753759	253	1312	0.838339
2670	47566	708	1685	18295	10.723916	5.794273	21.480657	39.548225	307	2.379944	25.840395	420	1265	0.750742
1109	22197	407	1392	56921	39.916550	1.042468	24.587658	47.259088	357	3.420147	139.855037	194	1198	0.860632
1276	22383	434	1284	12552	9.466063	2.160392	27.868024	45.716550	301	2.958525	28.921659	199	1085	0.845016
3194	84879	678	1455	36461	24.486904	1.722290	33.314305	56.257717	364	2.146018	53.777286	373	1082	0.743643
177	20727	458	1273	12240	9.216867	2.098479	28.236446	49.529678	351	2.779476	26.724891	220	1053	0.827180
1279	22386	372	1218	21465	17.338449	2.596115	23.790792	43.689494	287	3.274194	57.701613	205	1013	0.831691
908	21931	333	1184	13654	11.406850	2.744837	22.206349	43.616718	297	3.555556	41.003003	186	998	0.842905
3914	POST	331	1126	3150	2.797513	31.076581	32.924512	44.244217	322	3.401813	9.516616	142	984	0.873890
1593	22720	640	1385	7493	5.355969	5.829292	26.709793	54.632458	334	2.164062	11.707812	412	973	0.702527
2056	23203	505	1231	20503	16.415532	2.274668	21.392314	33.670773	208	2.437624	40.600000	264	967	0.785540
1298	22411	375	1175	12602	10.589916	2.683807	21.716807	44.582990	337	3.133333	33.605333	224	951	0.809362
439	21212	635	1320	36419	26.583212	0.758431	23.559124	53.346187	339	2.078740	57.352756	429	891	0.675000
178	20728	479	1150	11817	10.065588	2.059353	26.731687	46.744654	294	2.400835	24.670146	269	881	0.766087
1275	22382	490	1157	10464	8.875318	2.008473	28.381679	51.915599	357	2.361224	21.355102	283	874	0.755402
2866	82482	410	1100	9006	8.106211	3.101656	27.913591	51.779763	356	2.682927	21.965854	230	870	0.790909
1599	22727	388	1051	7815	7.276536	4.389050	31.699255	56.129841	350	2.708763	20.141753	193	858	0.816365

Examine correlations with replenishment time

Finally, just to sanity check our data and see if the assumptions we made earlier are correct, we can run a Pearson correlation on the dataframe of SKU-level data. Before doing this, it’s worth zero-filling any inf values. You can also plot the outputs to see how the data are correlated.

df_skus = df_skus.replace([np.inf, -np.inf], np.nan).fillna(value=0)

df_skus[df_skus.columns[1:]].corr()['avg_replenishment_days'][:].sort_values(ascending=False)

avg_replenishment_days    1.000000
std_replenishment_days    0.919889
max_replenishment_days    0.800298
total_customers           0.597277
total_orders              0.587021
purchased_once            0.563462
total_replenishments      0.560351
total_quantity            0.385017
replenishment_rate        0.193493
items_per_customer       -0.009193
avg_quantity             -0.012034
orders_per_customer      -0.020139
avg_price                -0.022023
Name: avg_replenishment_days, dtype: float64

sns.heatmap(df_skus.corr(), annot=True, cmap="Blues")

<AxesSubplot:>

Product replenishment campaign correlation analysis

We now have a SKU-level dataset that identifies the products that are regularly replenished, and those that are not, so we can select or exclude them from our email marketing campaigns. Some retailers do use a metric such as avg_replenishment_days to estimate when a customer is likely to purchase a product again. However, the most accurate solution is likely to come from the creation of a product replenishment campaign model.

Replenishment campaign models typically use multiple linear regression and require a range of features to generate more accurate, personalised predictions. These features could include the SKU-level data above, plus the details on each individual customer-SKU relationship. With these data, they can then be trained to predict the number of days until each customer is likely to replenish their product.

Matt Clarke, Saturday, June 05, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.