ABC classification is a simple technique that is commonly used in inventory management and is based on the Pareto principle or 80/20 rule. This says that 80% of consequences come from 20% of causes. In ABC inventory classification, that means that 80% of product sales typically come from 20% of products, so managing the inventory on the most important 20% of products can lead to better efficiency and profits, through lower effort and fewer stockouts.
The Pareto principle applies to many things in ecommerce and marketing, including customer behaviour. As a result, ABC classification can also be used in customer segmentation, since 80% of revenue or orders often come from 20% of customers. Using ABC for customer segmentation means marketing efforts, or sales team efforts can be concentrated on the most important customers, saving labour and marketing costs by trying to treat all customers equally.
In this project, I’ll show you how you can use Pandas and Python to create a customer dataset from transactional data and then apply ABC classification to create an ABC customer segmentation. Let’s get started.
We only need one package for this project - Pandas. You’ll likely already have Pandas installed in your Jupyter notebook environment, but if you don’t you can install it by entering pip3 install pandas
in your terminal or Jupyter code cell.
import pandas as pd
To create an ABC segmentation of customers you first need a transaction item dataset. Transaction item data is readily available from most ecommerce platforms or finance systems, but we’ll use the Online Retail Dataset so you can follow along. You can download the Online Retail Dataset from my GitHub repository, then unzip the file, and load the CSV into a Pandas dataframe using read_csv()
.
df = pd.read_csv('online-retail.csv')
df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
To classify the customers based on their cumulative contribution to the overall revenue of the business we need to do some preprocessing work on the dataset to put things in the right format. The dataset lists every line in every order, but we need the line price, so we’ll first calculate this by multiplying the quantity of units by the unit price.
df['LinePrice'] = df['Quantity'] * df['UnitPrice']
df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | LinePrice | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Next, we need to take the raw transaction items dataset and create a new dataset based on the customer level. This needs to include the customer ID and the total revenue. We don’t need it for the actual ABC analysis, but while we’re here we may as well calculate the total number of items and SKUs purchased by each customer. To do this we’ll aggregate the Pandas data using the agg()
function.
Since we’re including all the data in the dataset, our ABC analysis is going to examine purchases from all time. If a customer spent lots in the past and has subsequently lapsed, they’ll potentially appear as a Class A customer. If this isn’t what you want, you may wish to filter your Pandas dataframe to include only orders from the previous 12-24 months before running this step.
df_customers = df.groupby('CustomerID').agg(
orders=('InvoiceNo', 'nunique'),
skus=('StockCode', 'nunique'),
quantity=('Quantity', 'sum'),
revenue=('LinePrice', 'sum'),
).reset_index()
df_customers.head()
CustomerID | orders | skus | quantity | revenue | |
---|---|---|---|---|---|
0 | 12346.0 | 2 | 1 | 0 | 0.00 |
1 | 12347.0 | 7 | 103 | 2458 | 4310.00 |
2 | 12348.0 | 4 | 22 | 2341 | 1797.24 |
3 | 12349.0 | 1 | 73 | 631 | 1757.55 |
4 | 12350.0 | 1 | 17 | 197 | 334.40 |
Now we have a customer dataset that includes total revenue we can create the customer ABC classification itself using Pandas. We’ll construct a Python function to handle this for us. Since ABC classification can be used for lots of things, it also gives us a useful function we can re-use for other projects.
The function works by sorting the Pandas dataframe of customers by the metric column (i.e. revenue) so that the customers who spent the most are at the top of the list. It then calculates the cumulative sum of the total revenue and the running percentage total. It assigns the Class A to all customers where the cumulative percentage is up to 80%, Class B to those where the cumulative percentage contribution is between 80% and 90%, and assigns Class C to the rest.
The function returns both the ABC class for each customer and its rank - with the customer who spent the most being ranked 1 and the number declining as cumulative contribution to revenue decreases. It then removes some columns used for the calculations and returns the original dataframe with the class and rank remaining.
def abc(df, metric_column, abc_class_name='class'):
"""Assign an ABC class and rank to a metric based on cumulative percentage contribution.
Args:
df: Pandas dataframe containing data.
metric_column (string): Name of column containing metric to calculate.
abc_class_name (string, optional): Name to assign to class column.
Return:
Pandas dataframe containing original data, plus the metric class and rank.
"""
def _abc_segment(percentage):
"""Assign an ABC segment based on cumulative percentage contribution.
Args:
percentage (float): Cumulative percentage of ranked metric.
Returns:
segments: Pandas DataFrame
"""
if 0 < percentage <= 80:
return 'A'
elif 80 < percentage <= 90:
return 'B'
else:
return 'C'
data = df.sort_values(by=metric_column, ascending=False)
data[metric_column+'_sum'] = data[metric_column].sum()
data[metric_column+'_cumsum'] = data[metric_column].cumsum()
data[metric_column+'_running_pc'] = (data[metric_column+'_cumsum'] / data[metric_column+'_sum']) * 100
data[abc_class_name] = data[metric_column+'_running_pc'].apply(_abc_segment)
data[abc_class_name+'_rank'] = data[metric_column+'_running_pc'].rank().astype(int)
data.drop([metric_column+'_sum', metric_column+'_cumsum', metric_column+'_running_pc'], axis=1, inplace=True)
return data
Finally, we can use the function to segment our customers. We’ll pass in the Pandas dataframe containing the customers (remember, you may want to filter this in Pandas, so it only includes those who purchased in the preceding 12-24 months, depending on your objective), and we’ll define the metric column as the revenue, and we’ll call the ABC class abc_class
.
df_segments = abc(df_customers, 'revenue', 'abc_class')
df_segments.head()
CustomerID | orders | skus | quantity | revenue | abc_class | abc_class_rank | |
---|---|---|---|---|---|---|---|
1703 | 14646.0 | 77 | 703 | 196719 | 279489.02 | A | 1 |
4233 | 18102.0 | 62 | 151 | 64122 | 256438.49 | A | 2 |
3758 | 17450.0 | 55 | 127 | 69029 | 187482.17 | A | 3 |
1895 | 14911.0 | 248 | 1794 | 77180 | 132572.62 | A | 4 |
55 | 12415.0 | 26 | 444 | 77242 | 123725.45 | A | 5 |
To examine the customer segments we’ll use the Pandas groupby()
function and agg()
again to create a summary of statistics from the dataframe. We’ll group by the abc_class
and then calculate the number of unique customers, and the sum of orders, SKUs, quantity, and revenue for each ABC Class.
Next we’ll calculate the average order value (AOV) for each ABC Class, as well as the average customer spend, the average number of orders, the average number of units purchased, and the percentage of revenue and customers in each Class.
The data show that the 80/20 “rule” does not perfectly apply to customers in this dataset. That’s something which is pretty common. Despite the name, the Pareto rule rarely gives you an exact 80/20 split, but it’s often very close for many metrics and businesses. We find that 26% of customers generate 80% of the revenue in this business.
df_summary = df_segments.groupby('abc_class').agg(
customers=('CustomerID', 'nunique'),
orders=('orders', 'sum'),
skus=('skus', 'sum'),
quantity=('quantity', 'sum'),
revenue=('revenue', 'sum')
).reset_index()
df_summary['avg_order_value'] = df_summary['revenue'] / df_summary['orders']
df_summary['avg_orders'] = df_summary['orders'] / df_summary['customers']
df_summary['avg_quantity'] = df_summary['quantity'] / df_summary['orders']
df_summary['avg_revenue'] = df_summary['revenue'] / df_summary['customers']
df_summary['pc_revenue'] = round((df_summary['revenue'] / df_summary['revenue'].sum()) * 100, 2)
df_summary['pc_customers'] = round((df_summary['customers'] / df_summary['customers'].sum()) * 100, 2)
df_summary
abc_class | customers | orders | skus | quantity | revenue | avg_order_value | avg_orders | avg_quantity | avg_revenue | pc_revenue | pc_customers | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A | 1170 | 14252 | 154764 | 3858321 | 6639188.701 | 465.842598 | 12.181197 | 270.721372 | 5674.520257 | 79.99 | 26.76 |
1 | B | 763 | 3255 | 47706 | 507175 | 830242.280 | 255.066753 | 4.266055 | 155.814132 | 1088.128807 | 10.00 | 17.45 |
2 | C | 2439 | 4683 | 65145 | 541392 | 830634.833 | 177.372375 | 1.920049 | 115.607944 | 340.563687 | 10.01 | 55.79 |
Matt Clarke, Saturday, June 18, 2022