How to bin data in Pandas with cut() and qcut()

Picture by Steve Johnson, Pexels.

12 minutes to read

Data Science Pandas

Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning. The same concept is also used in data visualisations such as histograms to make it clearer how data are distributed by grouping them into bins or buckets.

Data binning is used on continuous numeric variables to group or partition them into smaller chunks. For example, if you applied data binning to the amounts your customers spend you could see those who spent £0-100, £100-200, or £200+, making your data easier to visualise, analyse, and understand.

Data binning is widely used in ecommerce and marketing and forms the basis of the RFM model which bins customers on the Recency, Frequency, or Monetary value of their orders.

There are several ways to bin or bucket data in Python, but the Pandas cut() and qcut() functions are the ones most widely used in data science. In this tutorial, I’ll show you how you can use cut() and qcut() to bin data in Python using both discrete intervals of your own choosing, and through equal bucketing or quantile-based discretization. As a bonus, we’ll also cover a third way of binning data using the Fisher-Jenks natural classification.

Import the packages

To get started open a Jupyter notebook and run the command !pip3 install jenkspy to install the Jenkspy package, then import Pandas, Numpy,and Jenkspy, and hide any warnings to keep things neat and tidy.

!pip3 install jenkspy

import pandas as pd
import numpy as np
import jenkspy
import warnings
warnings.filterwarnings('ignore')

pd.set_option('max_columns', 100)

Load the data

Next we’ll import a dataset into a Pandas dataframe. The one I’m using is a customer churn dataset and includes a mixture of categorical and numeric variables. We only need a few columns, so I’ll filter it down to the essential ones only.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df = df[['customerID', 'tenure', 'MonthlyCharges']]
df.head()

	customerID	tenure	MonthlyCharges
0	7590-VHVEG	1	29.85
1	5575-GNVDE	34	56.95
2	3668-QPYBK	2	53.85
3	7795-CFOCW	45	42.30
4	9237-HQITU	2	70.70

If you run df.info() Pandas will tell you what data types are present. To use the qcut() or cut() functions in Pandas to bin data, the data type must be numeric, so if the column you want to bin contains an object then you’ll need to “cast” or convert that data to the correct numeric data type first.

df['tenure'] = df['tenure'].astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerID      7043 non-null   object 
 1   tenure          7043 non-null   int64  
 2   MonthlyCharges  7043 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 165.2+ KB

Bin data into equal buckets using qcut()

The first approach to data binning we’ll apply is called quantile-based discretization, which Pandas handles via the qcut() function. Despite the fancy name, quantile-based discretization is actually fairly simple - it assigns values to bins of approximately (but never exactly) equal size. As with all other binning techniques, it converts our continuous numeric variable into a categorical one.

To use qcut() we’ll pass in the Pandas series we want to bin df['tenure'], a Python list (or array) of quantile positions [0, .2, .4, .6, .8, 1], and a Python list of bin labels to apply to the quantiles ['Very low', 'Low', 'Moderate', 'High', 'Very high']. We’ll assign the output of qcut() to a new Pandas column called df['tenure_qcut_bin'].

df['tenure_qcut_bin'] = pd.qcut(df['tenure'], 
                            q=[0, .2, .4, .6, .8, 1], 
                            labels=['Very low', 'Low', 'Moderate', 'High', 'Very high'])

If you run the value_counts() function on the newly created tenure_qcut_bin column, you’ll see that we get back five unique sets of data representing our binned values. They’re roughly the same size, ranging from 1350 to 1481 customers, but are never perfectly equal.

df['tenure_qcut_bin'].value_counts()

Very low     1481
Moderate     1408
Very high    1407
Low          1397
High         1350
Name: tenure_qcut_bin, dtype: int64

To examine the customers in the tenure_qcut_bin we can use the Pandas groupby() and agg() functions to group the data on the tenure_qcut_bin column and then count the number of unique customers using nunique and the mean tenure using mean. This shows us that our data are correctly binned, with the “Very low” tenure customers have a mean tenure of 2.51 compared to 68.11 in the “Very high” tenure bin.

df.groupby('tenure_qcut_bin').agg(
    customers=('customerID', 'nunique'),
    mean_tenure=('tenure', 'mean')
)

	customers	mean_tenure
tenure_qcut_bin
Very low	1481	2.510466
Low	1397	12.815319
Moderate	1408	29.933949
High	1350	50.660000
Very high	1407	68.110163

Bin data at discrete intervals using cut()

The Pandas cut() function bins data at discrete intervals and converts a continuous variable to a categorical one, which can be useful for analysis and modeling. If you provide a number of bins to the bins argument, the cut() function will slice up the data for you, like this.

df['tenure_cut_bin'] = pd.cut(df['tenure'], 
                            bins=3, 
                            labels=['A', 'B', 'C'])

df['tenure_cut_bin'].value_counts()

A    3210
C    2239
B    1594
Name: tenure_cut_bin, dtype: int64

df.groupby('tenure_cut_bin').agg(
    customers=('customerID', 'nunique'),    
    mean_tenure=('tenure', 'mean')
)

	customers	mean_tenure
tenure_cut_bin
A	3210	9.076324
B	1594	36.191343
C	2239	63.048682

However, the more commonly used approach with cut() is to define the bin edges or bounds yourself. This lets you use your own logic in defining what goes into each bin. Let’s say that as a business we define anyone with a tenure of 12 or less as an A, anyone with a tenure of 12 to 24 as a B, and anyone with a tenure of over 24 as a C.

To do this we need to pass four values, not three, otherwise we’ll get back an error that says “Bin labels must be one fewer than the number of bin edges”. To fix it we need to put 0 as the first value, then 12 to define the top of the first bin, then 24 to define the top of the second bin, and then put in a high value to define the top of the third bin. One neat trick for this is to use df['tenure'].max() to dynamically calculate the upper bound of the top bin.

df['tenure_cut_defined_bin'] = pd.cut(df['tenure'], 
                            bins=[0, 12, 24, df['tenure'].max()], 
                            labels=['A', 'B', 'C'])

df['tenure_cut_defined_bin'].value_counts()

C    3833
A    2175
B    1024
Name: tenure_cut_defined_bin, dtype: int64

df.groupby('tenure_cut_defined_bin').agg(
    customers=('customerID', 'nunique'),    
    mean_tenure=('tenure', 'mean')
)

	customers	mean_tenure
tenure_cut_defined_bin
A	2175	4.755402
B	1024	18.351562
C	3833	51.879729

Bin data where natural breaks occur

The final approach we’ll use is a combination of the Fisher-Jenks natural classification algorithm and the Pandas cut() function. Like the earlier approaches, this can also be used to bin data and convert continuous variables into categorical ones, but it works slightly differently.

The Fisher-Jenks natural classification algorithm returns a series of numbers that represent the edges or bounds of data where natural breaks or groupings occur. It’s implemented in the Python Jenkspy package, so we’ll use this first to identify where the bounds lie and then pass in those values to cut() to bin the data to the defined natural breaks.

breaks = jenkspy.jenks_breaks(df['tenure'], nb_class=3)
breaks

[0.0, 20.0, 48.0, 72.0]

df['tenure_jenks_bin'] = pd.cut(df['tenure'], 
                                bins=breaks, 
                                labels=['A', 'B', 'C'])

df.groupby('tenure_jenks_bin').agg(
    customers=('customerID', 'nunique'),    
    mean_tenure=('tenure', 'mean'), 
)

	customers	mean_tenure
tenure_jenks_bin
A	2867	7.541332
B	1926	33.854102
C	2239	63.048682

Matt Clarke, Thursday, August 18, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.