How to use transform categorical variables using encoders

Picture by Damon Hall, Unsplash.

21 minutes to read

There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on a per-feature basis, it’s often quicker and easier to make use of transformers.

These special classes are built-in Scikit-Learn and are ideal for performing bulk operations on data. Some are built into Scikit-Learn, and you can create your own custom ones by inheriting from Scikit-Learn’s TransformerMixin. However, a much easier approach is to make use of pre-built transformers that plug into the Scikit-Learn architecture via a package such as Category Encoders.

Category Encoders is a diverse set of Scikit-Learn style transformers designed for converting categorical data into numeric forms. It works with Pandas (as an input or an output), it’s configurable, compatible with Scikit-Learn, and includes a transformer for pretty much everything common categorical data encoding problem you’re likely to encounter.

There are other benefits to using Scikit-Learn transformers for preprocessing instead of Pandas. You can validate the workflow, use grid search on the model and preprocessing hyperparameters, avoid adding new columns and avoid data leakage.

Encoder	Description
Backward Difference Encoder	Backward Difference Encoding compares the mean of the dependent variable to the mean of the dependent variable for the prior level. It's a Contrast Encoder (along with Reverse Helmert and Polynomial).
BaseN Encoder	BaseN Encoding converts the numeric index of a categorical variable to a numeric form. It can work with a range of different base values to produce encodings. For example, passing the argument `base=2` to the encoder creates binary values, which larger values can be used on higher cardinality data.
Binary Encoder	Binary Encoding sits somewhere between One Hot Encoding and Hashing, as it converts categorical data into binary digits. It is a bit more concise than One Hot Encoding and adds fewer columns so is better suited to higher cardinality data than OHE.
<a href="https://contrib.scikit-learn.org/category_encoders/catboost.html rel="nofollow noopener" target="_blank">CatBoost Encoder</a>	CatBoost encoding, from the model of the same name, uses something called ordering principle to try and reduce target leakage. It's a bit like Leave One Out encoding and works in continuous and binomial data.
Count Encoder	Count Encoding (which is like Count Vectorization used in NLP models) converts the categorical variable for a numeric value representing its frequency within the dataset, so common categories have high values and rare categories have low values.
Generalized Linear Mixed Model Encoder	The Generalized Linear Mixed Model or GLMM Encoder is a bit like Target or M-Estimate encoding and can be used on continuous or binomial data.
Hashing Encoder	The Hashing Encoder applies the popular "hashing trick" to convert categorical variables to high dimensional space. It's a popular choice for use on high cardinality data, where one hot encoding wouldn't work. Works on nominal and ordinal data, but can cause data loss due to hash collisions.
Helmert Encoder	Helmert Encoding is another of the mean encoding transformers, like Target Encoding, James-Stein, and others. The version implemented in this package is reverse Helmert Encoding, and compares the mean of the target against the mean over all previous levels. Along with Backward Difference and Polynomial Encoding, it's one of the Contrast Encoders.
James-Stein Encoder	The James-Stein estimator is another type of target encoder. It uses the mean target value for the observed feature and the mean target value to obtain a weighted average. It's designed for use with normal distributions.
Leave One Out Encoder	Leave One Out or LOO encoding is another target encoding technique, however, it leaves out the target for the current row when calculating the mean, which can help with data contain outliers.
M-estimate Encoder	M-estimate encoding is a bit like a simplified version of Target Encoding. Along with Target, WoE, James-Stein and LOO, this is one of the Bayesian encoders. All of them are generally good for high cardinality data.
One Hot Encoder	One Hot Encoding or OHE is one of the most widely used techniques for encoding categorical variables. Best suited to low cardinality data, it can be used to binarise values, but needs to be used with caution. Works on nominal and ordinal data.
Ordinal Encoder	The Ordinal Encoder (OE) is basically the same as the Label Encoder (LE), as I understand it. It takes each unique categorical value and maps it to a number. As the name suggests, it's best for ordinal data that have a rank order, as it inherently implies ordinality to models and can therefore mislead them if used on non-ordinal data.
Polynomial Encoder	A Bayesian encoder that can work well on high cardinality data. It's supposed to be used on ordered categorical variables that are spaced equally.
Sum Encoder	Sum Encoding compares the mean of the target variable for a given level against the mean of the target over all the levels.
Target Encoder	Target Encoding, or Mean Encoding, as it's also known, is a powerful Bayesian encoding technique. Data are grouped and then a mean of the target is calculated for the grouping. Mean encoded data are often very important features.
Weight of Evidence Encoder	Weight of Evidence or WoE encoding came from the world of finance where it was used to assess credit risk. It's a Bayesian encoding technique (along with Target Encoding, James-Stein, M-Estimator and LOO) and can be effective on high cardinality data.

Load the libraries

We need quite a few libraries for this project. We’ll be using Pandas to load and display our data, Numpy for some filtering, XGBoost for our classification model, and various packages from Scikit-Learn to run the transformers, pipelines and assess the model’s accuracy. Finally, we’re using the Category Encoders package to perform our categorical variable transformations.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import category_encoders as ce

import warnings
warnings.filterwarnings('ignore')

Load data

The data set I’ve used is a Census Income data set from the UCI Machine Learning Repository. It contains a range of numeric and categorical features for us to encode, with the aim of predicting a person’s income from features such as their age, education, occupation, and ethnicity. The column names on this data set are not properly defined, so I’ve passed in the actual column names using the names argument of the Pandas read_csv() function.

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 names=['age','employment_type','final_weight','education','education_score',
                       'marital_status','occupation','relationship_status','ethnicity','gender',
                       'capital_gain','capital_loss','weekly_hours','native_country','income'])
df.head()

	age	employment_type	final_weight	education	education_score	marital_status	income
0	39	State-gov	77516	Bachelors	13	Never-married	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	<=50K
2	38	Private	215646	HS-grad	9	Divorced	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	<=50K

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   age                  32561 non-null  int64 
 1   employment_type      32561 non-null  object
 2   final_weight         32561 non-null  int64 
 3   education            32561 non-null  object
 4   education_score      32561 non-null  int64 
 5   marital_status       32561 non-null  object
 6   occupation           32561 non-null  object
 7   relationship_status  32561 non-null  object
 8   ethnicity            32561 non-null  object
 9   gender               32561 non-null  object
 10  capital_gain         32561 non-null  int64 
 11  capital_loss         32561 non-null  int64 
 12  weekly_hours         32561 non-null  int64 
 13  native_country       32561 non-null  object
 14  income               32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Separate features by dtype

Next we’ll separate the features in the dataframe by their datatype. There are a few different ways to achieve this. I’ve used the select_dtypes() function to obtain specific data types by passing in np.number to obtain the numeric data and exclude=['np.number'] to return the categorical data. Appending .columns to the end returns an Index list containing the column names. For the categorical features, we don’t want to include the target income column, so I’ve dropped that.

numeric_features = df.select_dtypes([np.number]).columns
numeric_features

Index(['age', 'final_weight', 'education_score', 'capital_gain',
       'capital_loss', 'weekly_hours'],
      dtype='object')

categorical_features = df.select_dtypes(exclude=[np.number]).drop(['income'], axis=1).columns
categorical_features

Index(['employment_type', 'education', 'marital_status', 'occupation',
       'relationship_status', 'ethnicity', 'gender', 'native_country'],
      dtype='object')

Define the model features and target

As usual, we’ll define the X feature set to include our fields, minus the target income column, then we’ll set our y data to be the column containing the target value containing each person’s salary.

X = df.drop('income', axis=1)
y = df['income']

y.head()

   <=50K
   <=50K
   <=50K
   <=50K
   <=50K
Name: income, dtype: object

If you print the unique values of the target column by entering y.unique() you’ll see that it contains two strings stating whether the person’s income is above or below 50K. Since the model requires a numeric label, we need to convert this to an integer. Again, you can do that in a number of ways (such as using np.where()), but the quickest is to use the preprocessing package’s LabelEncoder() class. Running fit_transform() on the y data will fit the label encoder and then return the encoded labels (this avoids the need to run fit() and then transform().

y = preprocessing.LabelEncoder().fit_transform(y)

array([0, 0, 0, ..., 0, 0, 1])

Create test and train groups

Now we’ve got our dataframe ready we can split it up into the train and test datasets for our model to use. We’ll use the Scikit-Learn train_test_split() function for this. By passing in the X dataframe of raw features, the y series containing the target, and the size of the test group (i.e. 0.3 for 30%), we get back the X_train, X_test, y_train and y_test data to use in the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Define the model

You can, of course, use any classification model for this. I’ve used XGBClassifier from XGBoost because it’s generally really effective and pretty quick to run. In normal circumstances, you’d obviously go through a careful model selection process, but we’ll skip that here for demonstration purposes.

selected_model = XGBClassifier(random_state=0)

Define the encoders

As we want to assess all of the encoders provided with the Category Encoders package, we’ll put them all into a dictionary. The dictionary key is the name of the encoder (i.e. HashingEncoder) while the value is the call to the category_encoders package which loads the relevant encoder. To store the results of each test, I’ll create a dataframe in which to store the name of the encoder and the results obtained.

encoders = {
    'BackwardDifferenceEncoder': ce.backward_difference.BackwardDifferenceEncoder,
    'BaseNEncoder': ce.basen.BaseNEncoder,
    'BinaryEncoder': ce.binary.BinaryEncoder,
    'CatBoostEncoder': ce.cat_boost.CatBoostEncoder,
    'HashingEncoder': ce.hashing.HashingEncoder,
    'HelmertEncoder': ce.helmert.HelmertEncoder,
    'JamesSteinEncoder': ce.james_stein.JamesSteinEncoder,
    'OneHotEncoder': ce.one_hot.OneHotEncoder,
    'LeaveOneOutEncoder': ce.leave_one_out.LeaveOneOutEncoder,
    'MEstimateEncoder': ce.m_estimate.MEstimateEncoder,
    'OrdinalEncoder': ce.ordinal.OrdinalEncoder,
    'PolynomialEncoder': ce.polynomial.PolynomialEncoder,
    'SumEncoder': ce.sum_coding.SumEncoder,
    'TargetEncoder': ce.target_encoder.TargetEncoder,
    'WOEEncoder': ce.woe.WOEEncoder
}

df_results = pd.DataFrame(columns=['encoder', 'f1', 'accuracy', 'roc'])

Create and run the pipeline

Next we’re going to loop through all of the encoders in the dictionary above and process the data using a Pipeline. While you don’t need to run a pipeline, it’s a good idea and makes the code cleaner, easier to maintain, and reduces repetition.

For our categorical variables (which we stored in categorical_features) we’re going to create a Pipeline called categorical_transformer which uses SimpleImputer() to fill in the missing values and then uses the selected encoder from Category Encoders. For the numeric data, we’ll fill in any missing values with the mean using SimpleImputer(), then we’ll scale the data using StandardScaler(). Then we’ll use the ColumnTransformer() to run our numeric and categorical transformer pipelines to preprocess our data.

Finally, we can define another Pipeline to describe the preprocessor step above, and pass in the details on the model we selected. We then fit() the model and preprocessor on the X_train and y_train data and it runs everything for us. Once that’s done, we can then use the fitted model to predict against X_test and return the data in y_pred. Then, it’s simply a case of calculating some performance metrics and appending the output to the dataframe of results we created above.

for key in encoders:

    categorical_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('encoder', encoders[key]())
        ]
    )    

    numeric_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ('numerical', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)
        ]
    )

    pipe = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('classifier', selected_model)
        ]
    )

    model = pipe.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    row = {
        'encoder': key,
        'f1': f1_score(y_test, y_pred, average='macro'),
        'accuracy': accuracy_score(y_test, y_pred),
        'roc': roc_auc_score(y_test, y_pred)
    }

    df_results = df_results.append(row, ignore_index=True)

If you print out the results and rank them by the AUC ROC you’ll be able to see which approach worked best on the data set. As with hyperparameter tuning, you may find that you can improve the results by tweaking the parameters used.

df_results.head(20).sort_values(by='roc')

	encoder	f1	accuracy	roc
8	LeaveOneOutEncoder	0.431043	0.757601	0.500000
4	HashingEncoder	0.792972	0.860784	0.770418
1	BaseNEncoder	0.812706	0.871021	0.794404
2	BinaryEncoder	0.812706	0.871021	0.794404
3	CatBoostEncoder	0.811457	0.869280	0.794979
0	BackwardDifferenceEncoder	0.814305	0.872249	0.795646
10	OrdinalEncoder	0.814305	0.872249	0.795646
9	MEstimateEncoder	0.813458	0.870816	0.796567
14	WOEEncoder	0.814858	0.872249	0.796938
11	PolynomialEncoder	0.814049	0.871225	0.797124
13	TargetEncoder	0.814175	0.871123	0.797631
5	HelmertEncoder	0.815163	0.872249	0.797656
6	JamesSteinEncoder	0.815577	0.872556	0.798002
7	OneHotEncoder	0.815463	0.872351	0.798154
12	SumEncoder	0.815463	0.872351	0.798154

Transforming specific columns

Although I didn’t use this approach in the intentionally simple example above, you can (and should) use the transformers on specific columns, rather than applying them in a less targeted fashion.

By default, if you don’t pass in any arguments to an encoder it will run on every non-numeric column. However, if you pass in a list of specific column names, you can apply the encoding to specific fields.