Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning. The same concept is also used in data visualisations such as histograms to make it clearer how data are distributed by grouping them into bins or buckets.

Data binning is used on continuous numeric variables to group or partition them into smaller chunks. For example, if you applied data binning to the amounts your customers spend you could see those who spent £0-100, £100-200, or £200+, making your data easier to visualise, analyse, and understand.

Data binning is widely used in ecommerce and marketing and forms the basis of the RFM model which bins customers on the Recency, Frequency, or Monetary value of their orders.

There are several ways to bin or bucket data in Python, but the Pandas `cut()`

and `qcut()`

functions are the ones most widely used in data science. In this tutorial, I’ll show you how you can use `cut()`

and `qcut()`

to bin data in Python using both discrete intervals of your own choosing, and through equal bucketing or quantile-based discretization. As a bonus, we’ll also cover a third way of binning data using the Fisher-Jenks natural classification.

To get started open a Jupyter notebook and run the command `!pip3 install jenkspy`

to install the Jenkspy package, then import Pandas, Numpy,and Jenkspy, and hide any warnings to keep things neat and tidy.

```
!pip3 install jenkspy
```

```
import pandas as pd
import numpy as np
import jenkspy
import warnings
warnings.filterwarnings('ignore')
```

```
pd.set_option('max_columns', 100)
```

Next we’ll import a dataset into a Pandas dataframe. The one I’m using is a customer churn dataset and includes a mixture of categorical and numeric variables. We only need a few columns, so I’ll filter it down to the essential ones only.

```
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df = df[['customerID', 'tenure', 'MonthlyCharges']]
df.head()
```

customerID | tenure | MonthlyCharges | |
---|---|---|---|

0 | 7590-VHVEG | 1 | 29.85 |

1 | 5575-GNVDE | 34 | 56.95 |

2 | 3668-QPYBK | 2 | 53.85 |

3 | 7795-CFOCW | 45 | 42.30 |

4 | 9237-HQITU | 2 | 70.70 |

If you run `df.info()`

Pandas will tell you what data types are present. To use the `qcut()`

or `cut()`

functions in Pandas to bin data, the data type must be numeric, so if the column you want to bin contains an `object`

then you’ll need to “cast” or convert that data to the correct numeric data type first.

```
df['tenure'] = df['tenure'].astype(int)
```

```
df.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 tenure 7043 non-null int64
2 MonthlyCharges 7043 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 165.2+ KB
```

The first approach to data binning we’ll apply is called quantile-based discretization, which Pandas handles via the `qcut()`

function. Despite the fancy name, quantile-based discretization is actually fairly simple - it assigns values to bins of approximately (but never exactly) equal size. As with all other binning techniques, it converts our continuous numeric variable into a categorical one.

To use `qcut()`

we’ll pass in the Pandas series we want to bin `df['tenure']`

, a Python list (or array) of quantile positions `[0, .2, .4, .6, .8, 1]`

, and a Python list of bin labels to apply to the quantiles `['Very low', 'Low', 'Moderate', 'High', 'Very high']`

. We’ll assign the output of `qcut()`

to a new Pandas column called `df['tenure_qcut_bin']`

.

```
df['tenure_qcut_bin'] = pd.qcut(df['tenure'],
q=[0, .2, .4, .6, .8, 1],
labels=['Very low', 'Low', 'Moderate', 'High', 'Very high'])
```

If you run the `value_counts()`

function on the newly created `tenure_qcut_bin`

column, you’ll see that we get back five unique sets of data representing our binned values. They’re roughly the same size, ranging from 1350 to 1481 customers, but are never perfectly equal.

```
df['tenure_qcut_bin'].value_counts()
```

```
Very low 1481
Moderate 1408
Very high 1407
Low 1397
High 1350
Name: tenure_qcut_bin, dtype: int64
```

To examine the customers in the `tenure_qcut_bin`

we can use the Pandas `groupby()`

and `agg()`

functions to group the data on the `tenure_qcut_bin`

column and then count the number of unique customers using `nunique`

and the mean tenure using `mean`

. This shows us that our data are correctly binned, with the “Very low” tenure customers have a mean tenure of 2.51 compared to 68.11 in the “Very high” tenure bin.

```
df.groupby('tenure_qcut_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
```

customers | mean_tenure | |
---|---|---|

tenure_qcut_bin | ||

Very low | 1481 | 2.510466 |

Low | 1397 | 12.815319 |

Moderate | 1408 | 29.933949 |

High | 1350 | 50.660000 |

Very high | 1407 | 68.110163 |

The Pandas `cut()`

function bins data at discrete intervals and converts a continuous variable to a categorical one, which can be useful for analysis and modeling. If you provide a number of bins to the `bins`

argument, the `cut()`

function will slice up the data for you, like this.

```
df['tenure_cut_bin'] = pd.cut(df['tenure'],
bins=3,
labels=['A', 'B', 'C'])
```

```
df['tenure_cut_bin'].value_counts()
```

```
A 3210
C 2239
B 1594
Name: tenure_cut_bin, dtype: int64
```

```
df.groupby('tenure_cut_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
```

customers | mean_tenure | |
---|---|---|

tenure_cut_bin | ||

A | 3210 | 9.076324 |

B | 1594 | 36.191343 |

C | 2239 | 63.048682 |

However, the more commonly used approach with `cut()`

is to define the bin edges or bounds yourself. This lets you use your own logic in defining what goes into each bin. Let’s say that as a business we define anyone with a `tenure`

of 12 or less as an A, anyone with a `tenure`

of 12 to 24 as a B, and anyone with a tenure of over 24 as a C.

To do this we need to pass four values, not three, otherwise we’ll get back an error that says “Bin labels must be one fewer than the number of bin edges”. To fix it we need to put 0 as the first value, then 12 to define the top of the first bin, then 24 to define the top of the second bin, and then put in a high value to define the top of the third bin. One neat trick for this is to use `df['tenure'].max()`

to dynamically calculate the upper bound of the top bin.

```
df['tenure_cut_defined_bin'] = pd.cut(df['tenure'],
bins=[0, 12, 24, df['tenure'].max()],
labels=['A', 'B', 'C'])
```

```
df['tenure_cut_defined_bin'].value_counts()
```

```
C 3833
A 2175
B 1024
Name: tenure_cut_defined_bin, dtype: int64
```

```
df.groupby('tenure_cut_defined_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean')
)
```

customers | mean_tenure | |
---|---|---|

tenure_cut_defined_bin | ||

A | 2175 | 4.755402 |

B | 1024 | 18.351562 |

C | 3833 | 51.879729 |

The final approach we’ll use is a combination of the Fisher-Jenks natural classification algorithm and the Pandas `cut()`

function. Like the earlier approaches, this can also be used to bin data and convert continuous variables into categorical ones, but it works slightly differently.

The Fisher-Jenks natural classification algorithm returns a series of numbers that represent the edges or bounds of data where natural breaks or groupings occur. It’s implemented in the Python Jenkspy package, so we’ll use this first to identify where the bounds lie and then pass in those values to `cut()`

to bin the data to the defined natural breaks.

```
breaks = jenkspy.jenks_breaks(df['tenure'], nb_class=3)
breaks
```

```
[0.0, 20.0, 48.0, 72.0]
```

```
df['tenure_jenks_bin'] = pd.cut(df['tenure'],
bins=breaks,
labels=['A', 'B', 'C'])
```

```
df.groupby('tenure_jenks_bin').agg(
customers=('customerID', 'nunique'),
mean_tenure=('tenure', 'mean'),
)
```

customers | mean_tenure | |
---|---|---|

tenure_jenks_bin | ||

A | 2867 | 7.541332 |

B | 1926 | 33.854102 |

C | 2239 | 63.048682 |

Matt Clarke, Thursday, August 18, 2022