How to create an ABC customer segmentation in Pandas

Picture by Dom J, Pexels.

14 minutes to read

ABC classification is a simple technique that is commonly used in inventory management and is based on the Pareto principle or 80/20 rule. This says that 80% of consequences come from 20% of causes. In ABC inventory classification, that means that 80% of product sales typically come from 20% of products, so managing the inventory on the most important 20% of products can lead to better efficiency and profits, through lower effort and fewer stockouts.

The Pareto principle applies to many things in ecommerce and marketing, including customer behaviour. As a result, ABC classification can also be used in customer segmentation, since 80% of revenue or orders often come from 20% of customers. Using ABC for customer segmentation means marketing efforts, or sales team efforts can be concentrated on the most important customers, saving labour and marketing costs by trying to treat all customers equally.

In this project, I’ll show you how you can use Pandas and Python to create a customer dataset from transactional data and then apply ABC classification to create an ABC customer segmentation. Let’s get started.

Import the packages

We only need one package for this project - Pandas. You’ll likely already have Pandas installed in your Jupyter notebook environment, but if you don’t you can install it by entering pip3 install pandas in your terminal or Jupyter code cell.

import pandas as pd

Load the data

To create an ABC segmentation of customers you first need a transaction item dataset. Transaction item data is readily available from most ecommerce platforms or finance systems, but we’ll use the Online Retail Dataset so you can follow along. You can download the Online Retail Dataset from my GitHub repository, then unzip the file, and load the CSV into a Pandas dataframe using read_csv().

df = pd.read_csv('online-retail.csv')
df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

Calculate the line revenue

To classify the customers based on their cumulative contribution to the overall revenue of the business we need to do some preprocessing work on the dataset to put things in the right format. The dataset lists every line in every order, but we need the line price, so we’ll first calculate this by multiplying the quantity of units by the unit price.

df['LinePrice'] = df['Quantity'] * df['UnitPrice']
df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	LinePrice
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34

Create a customer dataset

Next, we need to take the raw transaction items dataset and create a new dataset based on the customer level. This needs to include the customer ID and the total revenue. We don’t need it for the actual ABC analysis, but while we’re here we may as well calculate the total number of items and SKUs purchased by each customer. To do this we’ll aggregate the Pandas data using the agg() function.

Since we’re including all the data in the dataset, our ABC analysis is going to examine purchases from all time. If a customer spent lots in the past and has subsequently lapsed, they’ll potentially appear as a Class A customer. If this isn’t what you want, you may wish to filter your Pandas dataframe to include only orders from the previous 12-24 months before running this step.

df_customers = df.groupby('CustomerID').agg(
    orders=('InvoiceNo', 'nunique'),
    skus=('StockCode', 'nunique'),
    quantity=('Quantity', 'sum'),
    revenue=('LinePrice', 'sum'),
).reset_index()

df_customers.head()

	CustomerID	orders	skus	quantity	revenue
0	12346.0	2	1	0	0.00
1	12347.0	7	103	2458	4310.00
2	12348.0	4	22	2341	1797.24
3	12349.0	1	73	631	1757.55
4	12350.0	1	17	197	334.40

Create an ABC customer segmentation

Now we have a customer dataset that includes total revenue we can create the customer ABC classification itself using Pandas. We’ll construct a Python function to handle this for us. Since ABC classification can be used for lots of things, it also gives us a useful function we can re-use for other projects.

The function works by sorting the Pandas dataframe of customers by the metric column (i.e. revenue) so that the customers who spent the most are at the top of the list. It then calculates the cumulative sum of the total revenue and the running percentage total. It assigns the Class A to all customers where the cumulative percentage is up to 80%, Class B to those where the cumulative percentage contribution is between 80% and 90%, and assigns Class C to the rest.

The function returns both the ABC class for each customer and its rank - with the customer who spent the most being ranked 1 and the number declining as cumulative contribution to revenue decreases. It then removes some columns used for the calculations and returns the original dataframe with the class and rank remaining.

def abc(df, metric_column, abc_class_name='class'):
    """Assign an ABC class and rank to a metric based on cumulative percentage contribution. 
    
    Args:
        df: Pandas dataframe containing data. 
        metric_column (string): Name of column containing metric to calculate. 
        abc_class_name (string, optional): Name to assign to class column. 
    
    Return:
        Pandas dataframe containing original data, plus the metric class and rank. 
    """
    
    def _abc_segment(percentage):
        """Assign an ABC segment based on cumulative percentage contribution.
        Args:
            percentage (float): Cumulative percentage of ranked metric.
        Returns:
            segments: Pandas DataFrame
        """

        if 0 < percentage <= 80:
            return 'A'
        elif 80 < percentage <= 90:
            return 'B'
        else:
            return 'C'    
    
    data = df.sort_values(by=metric_column, ascending=False)
    data[metric_column+'_sum'] = data[metric_column].sum()
    data[metric_column+'_cumsum'] = data[metric_column].cumsum()
    data[metric_column+'_running_pc'] = (data[metric_column+'_cumsum'] / data[metric_column+'_sum']) * 100
    data[abc_class_name] = data[metric_column+'_running_pc'].apply(_abc_segment)
    data[abc_class_name+'_rank'] = data[metric_column+'_running_pc'].rank().astype(int)
    data.drop([metric_column+'_sum', metric_column+'_cumsum', metric_column+'_running_pc'], axis=1, inplace=True)
    return data

Segment the customers using ABC

Finally, we can use the function to segment our customers. We’ll pass in the Pandas dataframe containing the customers (remember, you may want to filter this in Pandas, so it only includes those who purchased in the preceding 12-24 months, depending on your objective), and we’ll define the metric column as the revenue, and we’ll call the ABC class abc_class.

df_segments = abc(df_customers, 'revenue', 'abc_class')
df_segments.head()

	CustomerID	orders	skus	quantity	revenue	abc_class	abc_class_rank
1703	14646.0	77	703	196719	279489.02	A	1
4233	18102.0	62	151	64122	256438.49	A	2
3758	17450.0	55	127	69029	187482.17	A	3
1895	14911.0	248	1794	77180	132572.62	A	4
55	12415.0	26	444	77242	123725.45	A	5

Use ABC analysis to examine the segments

To examine the customer segments we’ll use the Pandas groupby() function and agg() again to create a summary of statistics from the dataframe. We’ll group by the abc_class and then calculate the number of unique customers, and the sum of orders, SKUs, quantity, and revenue for each ABC Class.

Next we’ll calculate the average order value (AOV) for each ABC Class, as well as the average customer spend, the average number of orders, the average number of units purchased, and the percentage of revenue and customers in each Class.

The data show that the 80/20 “rule” does not perfectly apply to customers in this dataset. That’s something which is pretty common. Despite the name, the Pareto rule rarely gives you an exact 80/20 split, but it’s often very close for many metrics and businesses. We find that 26% of customers generate 80% of the revenue in this business.

df_summary = df_segments.groupby('abc_class').agg(
    customers=('CustomerID', 'nunique'),
    orders=('orders', 'sum'),
    skus=('skus', 'sum'),
    quantity=('quantity', 'sum'),
    revenue=('revenue', 'sum')
).reset_index()

df_summary['avg_order_value'] = df_summary['revenue'] / df_summary['orders']
df_summary['avg_orders'] = df_summary['orders'] / df_summary['customers']
df_summary['avg_quantity'] = df_summary['quantity'] / df_summary['orders']
df_summary['avg_revenue'] = df_summary['revenue'] / df_summary['customers']
df_summary['pc_revenue'] = round((df_summary['revenue'] / df_summary['revenue'].sum()) * 100, 2)
df_summary['pc_customers'] = round((df_summary['customers'] / df_summary['customers'].sum()) * 100, 2)

df_summary

	abc_class	customers	orders	skus	quantity	revenue	avg_order_value	avg_orders	avg_quantity	avg_revenue	pc_revenue	pc_customers
0	A	1170	14252	154764	3858321	6639188.701	465.842598	12.181197	270.721372	5674.520257	79.99	26.76
1	B	763	3255	47706	507175	830242.280	255.066753	4.266055	155.814132	1088.128807	10.00	17.45
2	C	2439	4683	65145	541392	830634.833	177.372375	1.920049	115.607944	340.563687	10.01	55.79

Matt Clarke, Saturday, June 18, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.