How to use mean encoding in your machine learning models

5 minutes to read

When you’re building a machine learning model, the feature engineering step is often the most important. From your initial small batch of features, the clever use of maths and stats can turn these into dozens or hundreds of new features, which make it much easier for your model to make its predictions.

There are loads of different ways to generate new features from your raw source data but one of the most effective is mean encoding. Unusually, this uses the target variable - or y - the one you’re trying to predict.

Ordinarily, you wouldn’t use the target variable for a feature because it introduces a stronger likelihood for “data leakage” and can often reveal the answer to your model resulting in perfect accuracy, or it will fail to allow the model to generalise well once it predicts upon your validation dataset or is deployed into production.

If your model’s results look too good to be true, it’s worth checking that you’ve not introduced any features that might reveal the answer in the data provided.

How mean encoding works

Let’s assume we’re creating a classification model which predicts whether someone will attend a doctor’s appointment or not. Our target variable might be called attended and it will have a value of 1 if the patient attended and 0 if they failed to show up.

While you’d have this y variable in your y_train and y_test datasets, you wouldn’t include it in either X_train or X_test as it would simply reveal the answer to the model and give you perfect accuracy but a model that didn’t work in the real world.

However, you can still utilise the target variable for your model to pass in additional information about its possible relationship to aspects of the dataset.

Here, we’d group the data by a given column in the DataFrame and then calculate the mean for that grouping and add it to the DataFrame as a new column. For example, perhaps we want to investigate postcode area as a feature impacting attending an appointment.

We’d simply group by postcode area and return the mean for that column to a new Pandas DataFrame column.

df['mean_encoding_postcode_area'] = df.groupby('postcode_area')['attended'].mean()

How to apply mean encoding to your DataFrame

To make it easier to use this technique, you might want to wrap that in a simple function to allow you to perform mean encodings for various groupings across your dataset.

def get_mean_encoding(df, group, target):
    """Group a Pandas DataFrame by a given column and return the mean encoding of the target variable for that grouping.
    
    Args:
        :param df: Pandas DataFrame.
        :param group: Column to group by. 
        :param target: Target variable column.
        
    Returns:
        Mean for the target variable across the group.
    
    Example: 
        df['mean_encoding_postcode_area'] = get_mean_encoding(df, 'postcode_area', 'attended')
    """
    
    mean_encoded = df.groupby(group)[target].mean()
    return df[group].map(mean_encoded)

To find out which features matter most, you can examine the Pearson correlation coefficients using the code below.

df[df.columns[1:]].corr()['attended'][:].sort_values(ascending=False)

Things to watch out for…

Mean encoding can work really, really well. However, it doesn’t come without its own problems. The main one is that it carries a risk of overfitting.

Overfitting happens when your model learns from the noise in your dataset, effectively focusing on the specifics instead of the general trends. This generates seemingly great results in testing, but fails to deliver the same results once deployed.

There are a few things you can do to minimise overfitting with mean encoding. The most obvious is that you need to be careful about is the size of the groupings. If you group by something very specific, such as a patient_id, then your mean encoding does carry some risk of data leakage and could tell the model the answer, giving you accuracy scores that look (and are) too good to be true.

Stick to higher cardinality columns and utilise regularization techniques, like cross-fold validation to help avoid this happening to your model.

As you’ll likely be creating smaller, lower cardinality grouped features using mean encoding, the other benefit is that (after feature selection and dimensionality reduction) you end up with shorter trees. For this reason, gradient boosting algorithms, like XGBoost, can often work particularly well with this technique.

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.