Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning. The same concept is also used in data visualisations such as histograms to make it clearer how data are distributed by grouping them into bins or buckets.
Data binning is used on continuous numeric variables to group or partition them into smaller chunks. For example, if you applied data binning to the amounts your customers spend you could see those who spent £0-100, £100-200, or £200+, making your data easier to visualise, analyse, and understand.
Data binning is widely used in ecommerce and marketing and forms the basis of the RFM model which bins customers on the Recency, Frequency, or Monetary value of their orders.
There are several ways to bin or bucket data in Python, but the Pandas
qcut() functions are the ones most widely used in data science. In this tutorial, I’ll show you how you can use
qcut() to bin data in Python using both discrete intervals of your own choosing, and through equal bucketing or quantile-based discretization. As a bonus, we’ll also cover a third way of binning data using the Fisher-Jenks natural classification.
To get started open a Jupyter notebook and run the command
!pip3 install jenkspy to install the Jenkspy package, then import Pandas, Numpy,and Jenkspy, and hide any warnings to keep things neat and tidy.
!pip3 install jenkspy
import pandas as pd import numpy as np import jenkspy import warnings warnings.filterwarnings('ignore')
Next we’ll import a dataset into a Pandas dataframe. The one I’m using is a customer churn dataset and includes a mixture of categorical and numeric variables. We only need a few columns, so I’ll filter it down to the essential ones only.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv') df = df[['customerID', 'tenure', 'MonthlyCharges']] df.head()
If you run
df.info() Pandas will tell you what data types are present. To use the
cut() functions in Pandas to bin data, the data type must be numeric, so if the column you want to bin contains an
object then you’ll need to “cast” or convert that data to the correct numeric data type first.
df['tenure'] = df['tenure'].astype(int)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 tenure 7043 non-null int64 2 MonthlyCharges 7043 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 165.2+ KB
The first approach to data binning we’ll apply is called quantile-based discretization, which Pandas handles via the
qcut() function. Despite the fancy name, quantile-based discretization is actually fairly simple - it assigns values to bins of approximately (but never exactly) equal size. As with all other binning techniques, it converts our continuous numeric variable into a categorical one.
qcut() we’ll pass in the Pandas series we want to bin
df['tenure'], a Python list (or array) of quantile positions
[0, .2, .4, .6, .8, 1], and a Python list of bin labels to apply to the quantiles
['Very low', 'Low', 'Moderate', 'High', 'Very high']. We’ll assign the output of
qcut() to a new Pandas column called
df['tenure_qcut_bin'] = pd.qcut(df['tenure'], q=[0, .2, .4, .6, .8, 1], labels=['Very low', 'Low', 'Moderate', 'High', 'Very high'])
If you run the
value_counts() function on the newly created
tenure_qcut_bin column, you’ll see that we get back five unique sets of data representing our binned values. They’re roughly the same size, ranging from 1350 to 1481 customers, but are never perfectly equal.
Very low 1481 Moderate 1408 Very high 1407 Low 1397 High 1350 Name: tenure_qcut_bin, dtype: int64
To examine the customers in the
tenure_qcut_bin we can use the Pandas
agg() functions to group the data on the
tenure_qcut_bin column and then count the number of unique customers using
nunique and the mean tenure using
mean. This shows us that our data are correctly binned, with the “Very low” tenure customers have a mean tenure of 2.51 compared to 68.11 in the “Very high” tenure bin.
df.groupby('tenure_qcut_bin').agg( customers=('customerID', 'nunique'), mean_tenure=('tenure', 'mean') )
cut() function bins data at discrete intervals and converts a continuous variable to a categorical one, which can be useful for analysis and modeling. If you provide a number of bins to the
bins argument, the
cut() function will slice up the data for you, like this.
df['tenure_cut_bin'] = pd.cut(df['tenure'], bins=3, labels=['A', 'B', 'C'])
A 3210 C 2239 B 1594 Name: tenure_cut_bin, dtype: int64
df.groupby('tenure_cut_bin').agg( customers=('customerID', 'nunique'), mean_tenure=('tenure', 'mean') )
However, the more commonly used approach with
cut() is to define the bin edges or bounds yourself. This lets you use your own logic in defining what goes into each bin. Let’s say that as a business we define anyone with a
tenure of 12 or less as an A, anyone with a
tenure of 12 to 24 as a B, and anyone with a tenure of over 24 as a C.
To do this we need to pass four values, not three, otherwise we’ll get back an error that says “Bin labels must be one fewer than the number of bin edges”. To fix it we need to put 0 as the first value, then 12 to define the top of the first bin, then 24 to define the top of the second bin, and then put in a high value to define the top of the third bin. One neat trick for this is to use
df['tenure'].max() to dynamically calculate the upper bound of the top bin.
df['tenure_cut_defined_bin'] = pd.cut(df['tenure'], bins=[0, 12, 24, df['tenure'].max()], labels=['A', 'B', 'C'])
C 3833 A 2175 B 1024 Name: tenure_cut_defined_bin, dtype: int64
df.groupby('tenure_cut_defined_bin').agg( customers=('customerID', 'nunique'), mean_tenure=('tenure', 'mean') )
The final approach we’ll use is a combination of the Fisher-Jenks natural classification algorithm and the Pandas
cut() function. Like the earlier approaches, this can also be used to bin data and convert continuous variables into categorical ones, but it works slightly differently.
The Fisher-Jenks natural classification algorithm returns a series of numbers that represent the edges or bounds of data where natural breaks or groupings occur. It’s implemented in the Python Jenkspy package, so we’ll use this first to identify where the bounds lie and then pass in those values to
cut() to bin the data to the defined natural breaks.
breaks = jenkspy.jenks_breaks(df['tenure'], nb_class=3) breaks
[0.0, 20.0, 48.0, 72.0]
df['tenure_jenks_bin'] = pd.cut(df['tenure'], bins=breaks, labels=['A', 'B', 'C'])
df.groupby('tenure_jenks_bin').agg( customers=('customerID', 'nunique'), mean_tenure=('tenure', 'mean'), )
Matt Clarke, Thursday, August 18, 2022