How to create an ABC customer segmentation in Pandas

ABC customer segmentation using the Pareto principle or 80/20 rule to segment customers based on their cumulative revenue contribution. Here's how to create this customer segmentation using Pandas.

How to create an ABC customer segmentation in Pandas
Picture by Dom J, Pexels.
14 minutes to read

ABC classification is a simple technique that is commonly used in inventory management and is based on the Pareto principle or 80/20 rule. This says that 80% of consequences come from 20% of causes. In ABC inventory classification, that means that 80% of product sales typically come from 20% of products, so managing the inventory on the most important 20% of products can lead to better efficiency and profits, through lower effort and fewer stockouts.

The Pareto principle applies to many things in ecommerce and marketing, including customer behaviour. As a result, ABC classification can also be used in customer segmentation, since 80% of revenue or orders often come from 20% of customers. Using ABC for customer segmentation means marketing efforts, or sales team efforts can be concentrated on the most important customers, saving labour and marketing costs by trying to treat all customers equally.

In this project, I’ll show you how you can use Pandas and Python to create a customer dataset from transactional data and then apply ABC classification to create an ABC customer segmentation. Let’s get started.

Import the packages

We only need one package for this project - Pandas. You’ll likely already have Pandas installed in your Jupyter notebook environment, but if you don’t you can install it by entering pip3 install pandas in your terminal or Jupyter code cell.

import pandas as pd

Load the data

To create an ABC segmentation of customers you first need a transaction item dataset. Transaction item data is readily available from most ecommerce platforms or finance systems, but we’ll use the Online Retail Dataset so you can follow along. You can download the Online Retail Dataset from my GitHub repository, then unzip the file, and load the CSV into a Pandas dataframe using read_csv().

df = pd.read_csv('online-retail.csv')
df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

Calculate the line revenue

To classify the customers based on their cumulative contribution to the overall revenue of the business we need to do some preprocessing work on the dataset to put things in the right format. The dataset lists every line in every order, but we need the line price, so we’ll first calculate this by multiplying the quantity of units by the unit price.

df['LinePrice'] = df['Quantity'] * df['UnitPrice']
df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country LinePrice
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Create a customer dataset

Next, we need to take the raw transaction items dataset and create a new dataset based on the customer level. This needs to include the customer ID and the total revenue. We don’t need it for the actual ABC analysis, but while we’re here we may as well calculate the total number of items and SKUs purchased by each customer. To do this we’ll aggregate the Pandas data using the agg() function.

Since we’re including all the data in the dataset, our ABC analysis is going to examine purchases from all time. If a customer spent lots in the past and has subsequently lapsed, they’ll potentially appear as a Class A customer. If this isn’t what you want, you may wish to filter your Pandas dataframe to include only orders from the previous 12-24 months before running this step.

df_customers = df.groupby('CustomerID').agg(
    orders=('InvoiceNo', 'nunique'),
    skus=('StockCode', 'nunique'),
    quantity=('Quantity', 'sum'),
    revenue=('LinePrice', 'sum'),
).reset_index()
df_customers.head()
CustomerID orders skus quantity revenue
0 12346.0 2 1 0 0.00
1 12347.0 7 103 2458 4310.00
2 12348.0 4 22 2341 1797.24
3 12349.0 1 73 631 1757.55
4 12350.0 1 17 197 334.40

Create an ABC customer segmentation

Now we have a customer dataset that includes total revenue we can create the customer ABC classification itself using Pandas. We’ll construct a Python function to handle this for us. Since ABC classification can be used for lots of things, it also gives us a useful function we can re-use for other projects.

The function works by sorting the Pandas dataframe of customers by the metric column (i.e. revenue) so that the customers who spent the most are at the top of the list. It then calculates the cumulative sum of the total revenue and the running percentage total. It assigns the Class A to all customers where the cumulative percentage is up to 80%, Class B to those where the cumulative percentage contribution is between 80% and 90%, and assigns Class C to the rest.

The function returns both the ABC class for each customer and its rank - with the customer who spent the most being ranked 1 and the number declining as cumulative contribution to revenue decreases. It then removes some columns used for the calculations and returns the original dataframe with the class and rank remaining.

def abc(df, metric_column, abc_class_name='class'):
    """Assign an ABC class and rank to a metric based on cumulative percentage contribution. 
    
    Args:
        df: Pandas dataframe containing data. 
        metric_column (string): Name of column containing metric to calculate. 
        abc_class_name (string, optional): Name to assign to class column. 
    
    Return:
        Pandas dataframe containing original data, plus the metric class and rank. 
    """
    
    def _abc_segment(percentage):
        """Assign an ABC segment based on cumulative percentage contribution.
        Args:
            percentage (float): Cumulative percentage of ranked metric.
        Returns:
            segments: Pandas DataFrame
        """

        if 0 < percentage <= 80:
            return 'A'
        elif 80 < percentage <= 90:
            return 'B'
        else:
            return 'C'    
    
    data = df.sort_values(by=metric_column, ascending=False)
    data[metric_column+'_sum'] = data[metric_column].sum()
    data[metric_column+'_cumsum'] = data[metric_column].cumsum()
    data[metric_column+'_running_pc'] = (data[metric_column+'_cumsum'] / data[metric_column+'_sum']) * 100
    data[abc_class_name] = data[metric_column+'_running_pc'].apply(_abc_segment)
    data[abc_class_name+'_rank'] = data[metric_column+'_running_pc'].rank().astype(int)
    data.drop([metric_column+'_sum', metric_column+'_cumsum', metric_column+'_running_pc'], axis=1, inplace=True)
    return data

Segment the customers using ABC

Finally, we can use the function to segment our customers. We’ll pass in the Pandas dataframe containing the customers (remember, you may want to filter this in Pandas, so it only includes those who purchased in the preceding 12-24 months, depending on your objective), and we’ll define the metric column as the revenue, and we’ll call the ABC class abc_class.

df_segments = abc(df_customers, 'revenue', 'abc_class')
df_segments.head()
CustomerID orders skus quantity revenue abc_class abc_class_rank
1703 14646.0 77 703 196719 279489.02 A 1
4233 18102.0 62 151 64122 256438.49 A 2
3758 17450.0 55 127 69029 187482.17 A 3
1895 14911.0 248 1794 77180 132572.62 A 4
55 12415.0 26 444 77242 123725.45 A 5

Use ABC analysis to examine the segments

To examine the customer segments we’ll use the Pandas groupby() function and agg() again to create a summary of statistics from the dataframe. We’ll group by the abc_class and then calculate the number of unique customers, and the sum of orders, SKUs, quantity, and revenue for each ABC Class.

Next we’ll calculate the average order value (AOV) for each ABC Class, as well as the average customer spend, the average number of orders, the average number of units purchased, and the percentage of revenue and customers in each Class.

The data show that the 80/20 “rule” does not perfectly apply to customers in this dataset. That’s something which is pretty common. Despite the name, the Pareto rule rarely gives you an exact 80/20 split, but it’s often very close for many metrics and businesses. We find that 26% of customers generate 80% of the revenue in this business.

df_summary = df_segments.groupby('abc_class').agg(
    customers=('CustomerID', 'nunique'),
    orders=('orders', 'sum'),
    skus=('skus', 'sum'),
    quantity=('quantity', 'sum'),
    revenue=('revenue', 'sum')
).reset_index()
df_summary['avg_order_value'] = df_summary['revenue'] / df_summary['orders']
df_summary['avg_orders'] = df_summary['orders'] / df_summary['customers']
df_summary['avg_quantity'] = df_summary['quantity'] / df_summary['orders']
df_summary['avg_revenue'] = df_summary['revenue'] / df_summary['customers']
df_summary['pc_revenue'] = round((df_summary['revenue'] / df_summary['revenue'].sum()) * 100, 2)
df_summary['pc_customers'] = round((df_summary['customers'] / df_summary['customers'].sum()) * 100, 2)
df_summary
abc_class customers orders skus quantity revenue avg_order_value avg_orders avg_quantity avg_revenue pc_revenue pc_customers
0 A 1170 14252 154764 3858321 6639188.701 465.842598 12.181197 270.721372 5674.520257 79.99 26.76
1 B 763 3255 47706 507175 830242.280 255.066753 4.266055 155.814132 1088.128807 10.00 17.45
2 C 2439 4683 65145 541392 830634.833 177.372375 1.920049 115.607944 340.563687 10.01 55.79

Matt Clarke, Saturday, June 18, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.