How to bin or bucket customer data using Pandas

Data binning or bucketing is a very useful technique for both preprocessing and understanding or visualising complex data. Here’s how to use it.

How to bin or bucket customer data using Pandas
Bins... Pictures by Pawel Czerwinski and Zachary Keimig, Unsplash.
26 minutes to read

Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous variables, such as revenue, orders, or recency, and groups wide ranges of numbers together into a small number of discrete “bins” or “buckets”. Effectively, it allows you to create categorical variables from continuous ones.

Two really common applications of data binning in ecommerce are [ABC inventory classification] (https://practicaldatascience.co.uk/data-science/how-to-create-an-abc-inventory-classification-model) and RFM segmentation. ABC inventory classification uses the cumulative percentage of revenue generated by each product SKU to allocate it to either class A, B, or C, based on its contribution to the business, thus helping procurement managers better control a broad product catalog.

RFM segmentation takes the powerful RFM variables - recency, frequency, and monetary value - and bins them into quintiles from 1 to 5, resulting in 125 scores from 111 to 555. These can show marketers the approximate value of each customer to the business and whether they’re at risk of lapsing, or have just purchased, something they’d not be able to do from the raw values alone.

Statistical binning can be performed quickly and easily in Python, using both Pandas, scikit-learn and custom functions. Here we’re going to use a variety of binning techniques to better understand a typical ecommerce dataset.

Load your dataset

In this project we’re going to use the UCI Machine Learning Repository’s Online Retail dataset. You can obtain a copy of this here. It’s a regular transactional dataset, so you’ll easily be able to create a similar version from your own data if you wish. Load up Pandas and Numpy and import the dataset.

import pandas as pd
import numpy as np
import jenkspy
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('uci_online_retail.csv')

Pandas can be quite slow at opening Excel files, so as I use this dataset often, I’ve re-saved it as a CSV file so it opens up more quickly. I’ve also dropped a redundant column, and renamed the columns in the dataframe so they’re a bit tidier and more consistent with my usual variable naming style.

df.drop(['Unnamed: 0', 'Country'], axis=1, inplace=True) 
df.rename(columns={'InvoiceNo': 'order_id',
                   'StockCode': 'variant_id', 
                   'Description': 'name',
                   'InvoiceDate': 'date_created', 
                   'UnitPrice': 'item_price', 
                   'Quantity': 'quantity',
                   'CustomerID': 'customer_id'}, inplace=True)
df.head()
order_id variant_id name quantity date_created item_price customer_id
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0

Engineer some additional features

Since the Online Retail dataset doesn’t contain the line revenue, and the datetime field isn’t stored correctly, we’ll first use the Pandas to_datetime() function to set the field to the correct type and then create a new column containing the line_revenue, which we can calculate by multiplying the quantity of units by the item_price.

df['date_created'] = pd.to_datetime(df['date_created'])
df['line_revenue'] = df['quantity'] * df['item_price']
df.head()
order_id variant_id name quantity date_created item_price customer_id line_revenue
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 20.34

Create a customer-level dataset

Since we’re going to be binning customer-level data, we need to take our transactional dataset and turn it into a customer-level one. The first step in this process is to create a new dataframe based on the unique customers within the data.

df_customers = pd.DataFrame(df['customer_id'].unique())
df_customers.columns = ['customer_id']
df_customers.head()
customer_id
0 17850.0
1 13047.0
2 12583.0
3 13748.0
4 15100.0

Next, we’ll calculate the recency_date for each customer using the date of their most recent order, then we’ll calculate their recency or the number of days since their last order was placed. As this dataset is from 2011, the recencies are all very high.

df_recency = df.groupby('customer_id')['date_created'].max().reset_index()
df_recency.columns = ['customer_id','recency_date']
df_customers = df_customers.merge(df_recency, on='customer_id')
df_customers['recency'] = round((pd.to_datetime('today') - df_customers['recency_date'])\
                                / np.timedelta64(1, 'D') ).astype(int)
df_customers.head()
customer_id recency_date recency
0 17850.0 2011-02-10 14:38:00 3546
1 13047.0 2011-11-08 12:10:00 3275
2 12583.0 2011-12-07 08:07:00 3246
3 13748.0 2011-09-05 09:45:00 3339
4 15100.0 2011-01-13 17:09:00 3574

To examine the number of orders placed, we can calculate the number of unique order_id values using the nunique() function, then we can assign the value to the customer dataframe.

df_frequency = df.groupby('customer_id')['order_id'].nunique().reset_index()
df_frequency.columns = ['customer_id','frequency']
df_customers = df_customers.merge(df_frequency, on='customer_id')
df_customers.head()
customer_id recency_date recency frequency
0 17850.0 2011-02-10 14:38:00 3546 35
1 13047.0 2011-11-08 12:10:00 3275 18
2 12583.0 2011-12-07 08:07:00 3246 18
3 13748.0 2011-09-05 09:45:00 3339 5
4 15100.0 2011-01-13 17:09:00 3574 6

To calculate the monetary value for each customer we simply sum() the values in the line_revenue column we created earlier, then assign the value back to the dataframe.

df_monetary = df.groupby('customer_id')['line_revenue'].sum().reset_index()
df_monetary.columns = ['customer_id','monetary']
df_customers = df_customers.merge(df_monetary, on='customer_id')
df_customers.head()
customer_id recency_date recency frequency monetary
0 17850.0 2011-02-10 14:38:00 3546 35 5288.63
1 13047.0 2011-11-08 12:10:00 3275 18 3079.10
2 12583.0 2011-12-07 08:07:00 3246 18 7187.34
3 13748.0 2011-09-05 09:45:00 3339 5 948.25
4 15100.0 2011-01-13 17:09:00 3574 6 635.10

Finally, to calculate the variety metric we can count the number of unique products purchased by using nunique() on the variant_id column. This will give us a broad range of values, ranging from customers who bought a single SKU in their lifetime to some who’ve purchased over a hundred. We’ve now got a nice customer-level dataset that we can use for binning.

df_variety = df.groupby('customer_id')['variant_id'].nunique().reset_index()
df_variety.columns = ['customer_id','variety']
df_customers = df_customers.merge(df_variety, on='customer_id')
df_customers.head()
customer_id recency_date recency frequency monetary variety
0 17850.0 2011-02-10 14:38:00 3546 35 5288.63 24
1 13047.0 2011-11-08 12:10:00 3275 18 3079.10 106
2 12583.0 2011-12-07 08:07:00 3246 18 7187.34 115
3 13748.0 2011-09-05 09:45:00 3339 5 948.25 24
4 15100.0 2011-01-13 17:09:00 3574 6 635.10 1

Some bins.

Quantile-based discretization

The first binning method we’ll apply is called quantile-based discretization function or qcut. This method creates equal (or approximately equal) buckets based on their rank or specific quantile boundaries. It’s the standard technique for creating RFM quintiles.

First, we’ll bin the data in the variety column. We’ll split the data up into five equal bins based on the number of unique SKUs each customer has purchased and we’ll assign them the appropriate label from our list.

For demonstration purposes, I’ve specifically defined the bin boundaries, so this bins 0-0.2 in Very low, 0.2-0.4 in Low, 0.4-0.6 in Moderate, 0.6-0.8 in High, and 0.8-1 in Very high. However, you don’t need to provide these if you’re spacing bins out evenly. You can simply provide the number of labels and qcut() will distribute them evenly, without you needing to define the boundaries.

variety_labels = ['Very low', 'Low', 'Moderate', 'High', 'Very high']

df_customers['variety_bin'] = pd.qcut(df_customers['variety'], 
                            q=[0, .2, .4, .6, .8, 1], 
                            labels=variety_labels)

df_customers['variety_bin'].value_counts()
Low          890
Very low     883
High         870
Very high    868
Moderate     861
Name: variety_bin, dtype: int64

If you take a look at the contents of the bins by using an aggregation and calculate the minimum, maximum and mean values for the variety column you’ll get a feel for how the data are distributed. The bin sizes are approximately equal (but not perfectly equal) and correctly increase in size as defined.

df_customers.groupby('variety_bin').agg(
    min_variety=('variety', min),
    max_variety=('variety', max),
    avg_variety=('variety', 'mean') 
).sort_values(by='avg_variety')
min_variety max_variety avg_variety
variety_bin
Very low 1 12 6.472254
Low 13 26 19.279775
Moderate 27 48 36.109175
High 49 92 67.258621
Very high 93 1794 178.728111

Next, we’ll create some tercile bins labeled A, B, and C, which will segment the customers based on the total monetary value they’ve generated the business. Since we want even boundaries, we can just pass 3 to the q argument and the data will be split equally. We get three bins with 1458, 1457, and 1457 customers in each bin.

monetary_labels = ['C', 'B', 'A']

df_customers['monetary_bin'] = pd.qcut(df_customers['monetary'], 
                            q=3, 
                            labels=monetary_labels)

df_customers['monetary_bin'].value_counts()
C    1458
A    1457
B    1457
Name: monetary_bin, dtype: int64

Using the same aggregation approach we can again examine the minimum, maximum and average values in each bin to understand how the customers are segmented. There’s a clear difference between the customers in the bins, so the ABC labels work well in helping people understand the contribution of each customer to the business at a glance.

df_customers.groupby('monetary_bin').agg(   
    min_monetary=('monetary', min),
    max_monetary=('monetary', max),
    avg_monetary=('monetary', 'mean'),    
).sort_values(by='avg_monetary', ascending=False)
min_monetary max_monetary avg_monetary
monetary_bin
A 1141.34 279489.02 4815.372952
B 374.70 1141.24 685.161594
C -4287.63 374.57 196.013018

To examine both bins together we can use describe() with its optional include='category' argument, which will then give us counts, unique values, top values and frequencies for each of the binned datasets.

df_customers.describe(include='category')
variety_bin monetary_bin
count 4372 4372
unique 5 3
top Low C
freq 890 1458

Cut

While binning via quantile-based discretization with qcut() creates bins of equal size (or as close to equal as possible), there are times when you may wish to use a different approach and have bins of unequal size. You can do this in Pandas using the cut() function.

This can sometimes be very useful when the statistical distribution of the data is very skewed. For example, maybe you’re analysing 20 years’ worth of customer data and most of the customers are lapsed, with the current revenue coming from a smaller portion who are active.

Using qcut() could mean your binned data contains a mixture of lapsed and active customers, but using cut() could ensure they’re better separated based on the bounds you define. While RFM segmentation is normally done using quantile-based discretization, this alternative approach can be more practical on heavily skewed datasets.

monetary_labels = ['C', 'B', 'A']

df_customers['monetary_bin_cut'] = pd.cut(df_customers['monetary'], 
                            bins=3, 
                            labels=monetary_labels)

df_customers['monetary_bin_cut'].value_counts()
C    4366
A       3
B       3
Name: monetary_bin_cut, dtype: int64

As you can see from the above code, running the cut() function with only the basic number of parameters creates bins of very unequal sizes, representing the truth of the underlying data.

K-means clusters

Another common approach to creating categorical data from continuous and other variables is to use unsupervised learning techniques, such as clustering. The K-means clustering algorithm is probably the most widely used for this purpose and can quickly and easily be applied to one or more columns in a dataframe to return a dynamically generated cluster.

from sklearn.cluster import KMeans

data = df_customers[['variety']]

kmeans = KMeans(n_clusters=3)
kmeans.fit(data)
KMeans(n_clusters=3)
data.loc[:, 'cluster'] = kmeans.labels_
class_map = {0: "C", 1: "B", 2: "A"}
data["class_kmeans"] = data["cluster"].apply(lambda x: class_map[x])
data.head()
variety cluster class_kmeans
0 24 0 C
1 106 0 C
2 115 0 C
3 24 0 C
4 1 0 C
data.class_kmeans.value_counts()
C    3781
B     582
A       9
Name: class_kmeans, dtype: int64

Unique value reduction

Another great application of data binning is when it’s applied to the reduction of unique values within a column of categorical variables. This can really help your model see the wood from the trees. For example, our name column in the original transactional dataset contains 4223 unique values, which we can see by running df.name.nunique(). It has a very high cardinality, so wouldn’t be suitable for one-hot encoding as a categorical variable.

However, let’s say we wanted to use it as a categorical variable (it’s a stupid idea, but you’ll hopefully get the picture). To do this we’d need to significantly reduce its cardinality, which we can do by binning the data based on a threshold of occurrences. Here’s a function which does that for us.

def cols_to_reduce_uniques(df, column_threshold_dict):
    """Reduce the number of unique values by creating a column of X values and the rest marked "Others".
    Args:
        column_threshold_dict:
        df: Pandas DataFrame.
        columns: Dictionary of column and threshold, i.e. {'col1' : 1000, 'col2' : 3000}
    Returns:
        Original DataFrame with additional prefixed columns. The most dominant values in the column will
        be assigned their original value. The less dominant results will be assigned to Others, which can
        help visualise and model data in some cases.
    """

    for key, value in column_threshold_dict.items():
        counts = df[key].value_counts()
        others = set(counts[counts < value].index)
        df['reduce_' + key] = df[key].replace(list(others), 'Others')

    return df

Now, if we run the feature on the name column and set the threshold to 2000 and then re-examine df.reduce_name using the Pandas value_counts() function we’ll see that we have three top-selling products with their own class name and everything else lumped into the Others class.

df = cols_to_reduce_uniques(df, {'name': 2000})
df.head()
order_id variant_id name quantity date_created item_price customer_id line_revenue reduce_name
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 15.30 WHITE HANGING HEART T-LIGHT HOLDER
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 20.34 Others
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 22.00 Others
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 20.34 Others
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 20.34 Others
df.reduce_name.value_counts()
Others                                533727
WHITE HANGING HEART T-LIGHT HOLDER      2369
REGENCY CAKESTAND 3 TIER                2200
JUMBO BAG RED RETROSPOT                 2159
Name: reduce_name, dtype: int64

Fisher-Jenks natural splits

The final approach we could apply is called the Fisher-Jenks algorithm. This was a new one to me, until Chris Moffitt at Practical Business Python covered it, but it looks really handy. The Fisher-Jenks algorithm, or Jenks optimisation method, is used for the identification of natural splits in one-dimensional datasets.

It seems to give results that are most similar to K-means clustering, but is worth considering if there’s a specific natural break you wish to identify within a series. Like K-means clustering, you need to define the number of breaks. If we run this on the variety column the Fisher-Jenks algorithm finds the following natural breaks.

breaks = jenkspy.jenks_breaks(df_customers['variety'], nb_class=3)
breaks
[1.0, 118.0, 636.0, 1794.0]

Now we know the natural positions of the breaks, we can use these as the bounds for a cut() function which will then bucket the data according to the natural breaks found, rather than our predefined boundaries. If you find that you’ve got NaN values you can pass in the optional include_lowest=True argument.

df_customers['jenks_breaks'] = pd.cut(df_customers['variety'], 
                                      bins=breaks, 
                                      labels=['V1', 'V2', 'V3'])
df_customers.head()

Matt Clarke, Thursday, March 04, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Joining Data with pandas

Learn to combine data from multiple tables by joining data together using pandas.

Start course for FREE

Comments