Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning. The same concept is also used in data visualisations such as histograms to make it clearer how data are distributed by grouping them into bins or buckets.
Data binning is used on continuous numeric variables to group or partition them into smaller chunks. For example, if you applied data binning to the amounts your customers spend you could see those who spent £0-100, £100-200, or £200+, making your data easier to visualise, analyse, and understand.
Data binning is widely used in ecommerce and marketing and forms the basis of the RFM model which bins customers on the Recency, Frequency, or Monetary value of their orders.
There are several ways to bin or bucket data in Python, but the Pandas cut()
and qcut()
functions are the ones most widely used in data science. In this tutorial, I’ll show you how you can use cut()
and qcut()
to bin data in Python using both discrete intervals of your own choosing, and through equal bucketing or quantile-based discretization. As a bonus, we’ll also cover a third way of binning data using the Fisher-Jenks natural classification.
To get started open a Jupyter notebook and run the command !pip3 install jenkspy
to install the Jenkspy package, then import Pandas, Numpy,and Jenkspy, and hide any warnings to keep things neat and tidy.
!pip3 install jenkspy
import pandas as pd
import numpy as np
import jenkspy
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns', 100)
Next we’ll import a dataset into a Pandas dataframe. The one I’m using is a customer churn dataset and includes a mixture of categorical and numeric variables. We only need a few columns, so I’ll filter it down to the essential ones only.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df = df[['customerID', 'tenure', 'MonthlyCharges']]
df.head()
customerID | tenure | MonthlyCharges | |
---|---|---|---|
0 | 7590-VHVEG | 1 | 29.85 |
1 | 5575-GNVDE | 34 | 56.95 |
2 | 3668-QPYBK | 2 | 53.85 |
3 | 7795-CFOCW | 45 | 42.30 |
4 | 9237-HQITU | 2 | 70.70 |
If you run df.info()
Pandas will tell you what data types are present. To use the qcut()
or cut()
functions in Pandas to bin data, the data type must be numeric, so if the column you want to bin contains an object
then you’ll need to “cast” or convert that data to the correct numeric data type first.
df['tenure'] = df['tenure'].astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 tenure 7043 non-null int64
2 MonthlyCharges 7043 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 165.2+ KB
The first approach to data binning we’ll apply is called quantile-based discretization, which Pandas handles via the qcut()
function. Despite the fancy name, quantile-based discretization is actually fairly simple - it assigns values to bins of approximately (but never exactly) equal size. As with all other binning techniques, it converts our continuous numeric variable into a categorical one.
To use qcut()
we’ll pass in the Pandas series we want to bin df['tenure']
, a Python list (or array) of quantile positions [0, .2, .4, .6, .8, 1]
, and a Python list of bin labels to apply to the quantiles ['Very low', 'Low', 'Moderate', 'High', 'Very high']
. We’ll assign the output of qcut()
to a new Pandas column called df['tenure_qcut_bin']
.
df['tenure_qcut_bin'] = pd.qcut(df['tenure'],
q=[0, .2, .4, .6, .8, 1],
labels=['Very low', 'Low', 'Moderate', 'High', 'Very high'])
If you run the value_counts()
function on the newly created tenure_qcut_bin
column, you’ll see that we get back five unique sets of data representing our binned values. They’re roughly the same size, ranging from 1350 to 1481 customers, but are never perfectly equal.
df['tenure_qcut_bin'].value_counts()
Very low 1481
Moderate 1408
Very high 1407
Low 1397
High 1350
Name: tenure_qcut_bin, dtype: int64
To examine the customers in the tenure_qcut_bin
we can use the Pandas groupby()
and agg()
functions to group the data on the tenure_qcut_bin
column and then count the number of unique customers using nunique
and the mean tenure using mean
. This shows us that our data are correctly binned, with the “Very low” tenure customers have a mean tenure of 2.51 compared to 68.11 in the “Very high” tenure bin.
df.groupby('tenure_qcut_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
customers | mean_tenure | |
---|---|---|
tenure_qcut_bin | ||
Very low | 1481 | 2.510466 |
Low | 1397 | 12.815319 |
Moderate | 1408 | 29.933949 |
High | 1350 | 50.660000 |
Very high | 1407 | 68.110163 |
The Pandas cut()
function bins data at discrete intervals and converts a continuous variable to a categorical one, which can be useful for analysis and modeling. If you provide a number of bins to the bins
argument, the cut()
function will slice up the data for you, like this.
df['tenure_cut_bin'] = pd.cut(df['tenure'],
bins=3,
labels=['A', 'B', 'C'])
df['tenure_cut_bin'].value_counts()
A 3210
C 2239
B 1594
Name: tenure_cut_bin, dtype: int64
df.groupby('tenure_cut_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
customers | mean_tenure | |
---|---|---|
tenure_cut_bin | ||
A | 3210 | 9.076324 |
B | 1594 | 36.191343 |
C | 2239 | 63.048682 |
However, the more commonly used approach with cut()
is to define the bin edges or bounds yourself. This lets you use your own logic in defining what goes into each bin. Let’s say that as a business we define anyone with a tenure
of 12 or less as an A, anyone with a tenure
of 12 to 24 as a B, and anyone with a tenure of over 24 as a C.
To do this we need to pass four values, not three, otherwise we’ll get back an error that says “Bin labels must be one fewer than the number of bin edges”. To fix it we need to put 0 as the first value, then 12 to define the top of the first bin, then 24 to define the top of the second bin, and then put in a high value to define the top of the third bin. One neat trick for this is to use df['tenure'].max()
to dynamically calculate the upper bound of the top bin.
df['tenure_cut_defined_bin'] = pd.cut(df['tenure'],
bins=[0, 12, 24, df['tenure'].max()],
labels=['A', 'B', 'C'])
df['tenure_cut_defined_bin'].value_counts()
C 3833
A 2175
B 1024
Name: tenure_cut_defined_bin, dtype: int64
df.groupby('tenure_cut_defined_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
customers | mean_tenure | |
---|---|---|
tenure_cut_defined_bin | ||
A | 2175 | 4.755402 |
B | 1024 | 18.351562 |
C | 3833 | 51.879729 |
The final approach we’ll use is a combination of the Fisher-Jenks natural classification algorithm and the Pandas cut()
function. Like the earlier approaches, this can also be used to bin data and convert continuous variables into categorical ones, but it works slightly differently.
The Fisher-Jenks natural classification algorithm returns a series of numbers that represent the edges or bounds of data where natural breaks or groupings occur. It’s implemented in the Python Jenkspy package, so we’ll use this first to identify where the bounds lie and then pass in those values to cut()
to bin the data to the defined natural breaks.
breaks = jenkspy.jenks_breaks(df['tenure'], nb_class=3)
breaks
[0.0, 20.0, 48.0, 72.0]
df['tenure_jenks_bin'] = pd.cut(df['tenure'],
bins=breaks,
labels=['A', 'B', 'C'])
df.groupby('tenure_jenks_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean'),
)
customers | mean_tenure | |
---|---|---|
tenure_jenks_bin | ||
A | 2867 | 7.541332 |
B | 1926 | 33.854102 |
C | 2239 | 63.048682 |
Matt Clarke, Thursday, August 18, 2022