How to use Category Encoders to encode categorical variables

Category Encoders make it much easier to encode categorical variables during the machine learning process. Here's how to use them.

How to use Category Encoders to encode categorical variables
34 minutes to read

Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text strings and dates that can’t be used directly within machine learning models.

Thankfully, there are a number of useful pre-modeling techniques you can employ to encode categorical variables and engineer many new and useful features that can increase the performance of your models substantially. Here we’ll go over six different pre-modeling steps you can apply for encoding categorical data prior to inputting the data into your machine learning models.

Load some data

To work through some practical examples showing how we can deal with categorical variables, load up the test dataset I’ve created. This is based on a snapshot of some of my Google Analytics data and includes a selection of mostly categorical variables which need to be handled in slightly different ways.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
User Type Source Medium Browser Device Category Date Pageviews
0 New Visitor (direct) (none) Amazon Silk mobile 2020-07-31 3
1 New Visitor (direct) (none) Amazon Silk mobile 2020-07-14 1
2 New Visitor (direct) (none) Amazon Silk tablet 2020-07-14 1
3 New Visitor (direct) (none) Amazon Silk tablet 2020-08-07 1
4 New Visitor (direct) (none) Amazon Silk tablet 2020-08-12 1

Check and correct data types

The first thing you’ll want to do upon loading up any new dataset into a Pandas DataFrame is to see what you’ve got by entering df.dtypes in a Jupyter notebook cell. The Google Analytics API doesn’t always correctly set the data types in the data returned, so whenever you’re dealing with data from this source you may need to change the odd one.

We can see here that the Date column has been identified as an object, but the others are all fine. We can use the Pandas function to_datetime() to change Date to a datetime64 data type and then re-run df.dtypes to check it’s changed it.

df.dtypes
User Type          object
Source             object
Medium             object
Browser            object
Device Category    object
Date               object
Pageviews           int64
dtype: object
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes
User Type                  object
Source                     object
Medium                     object
Browser                    object
Device Category            object
Date               datetime64[ns]
Pageviews                   int64
dtype: object

Examine data cardinality

There are several ways to encode categorical data and you may need to use different approaches for different columns in your dataset. To understand which one is best, you first need to examine the “cardinality” of your data, which is just a fancy way of saying the number of unique values in each column.

You can examine cardinality by using the nunique() function on your Pandas DataFrame to count the number of unique items in each column. As you can see from the data below, the User Type column has only one value, while Medium and Device Category are low cardinality with 3 or 4 unique values. The Date, Source, and Browser columns have much higher cardinalities and we’ll need to handle their data differently. Very low cardinality columns with only a single value are of no use to models and can usually be dropped if this is a consistent pattern across the data.

df.nunique()
User Type           1
Source             19
Medium              4
Browser            17
Device Category     3
Date               30
Pageviews          13
dtype: int64

Different approaches to dealing with categorical data

Option 1: Dropping categorical variables

To use the above data in most models, we’ll first need to convert it all into a numeric representation, as we can’t just add text or date fields. Alternatively, we could simply choose to just drop or remove any categorical data from the dataset so we can use the numeric values.

While there are times when dropping categorical variables might be valid, for example when the data are almost entirely unique and the cardinality is exceptionally high, just dropping the columns obviously discards data that might be potentially useful to your model.

That said, it does actually apply to our User Type column, as that only contains one value New Visitor, which is of no use to any model (assuming it’s always that way, which it obviously wouldn’t be had I not created a dataset consisting solely of new users). To drop the User Type column we can use the drop() function and axis=1 to set it to columns and inplace=True to write the changes back to the DataFrame.

df_drop_numeric = df.copy()

object_cols = ['User Type']

for col in object_cols:
    df_drop_numeric.drop([col], axis=1, inplace=True)

df_drop_numeric.head()
Source Medium Browser Device Category Date Pageviews
0 (direct) (none) Amazon Silk mobile 2020-07-31 3
1 (direct) (none) Amazon Silk mobile 2020-07-14 1
2 (direct) (none) Amazon Silk tablet 2020-07-14 1
3 (direct) (none) Amazon Silk tablet 2020-08-07 1
4 (direct) (none) Amazon Silk tablet 2020-08-12 1

Option 2: One-hot encoding

The second way to deal with categorical variables is called one-hot encoding (OHE). One-hot encoding is basically a way of “binarising” categorical data and turning it into simple ones and zeros that are better suited as model features.

When given a column containing categorical data, one-hot encoding functions identify each unique variation, create a new feature column for each one and then record a 1 or 0 for every row in the DataFrame to indicate the presence or absence of the value.

Depending on the one-hot encoding technique you use, this will either return additional columns containing ones and zeros or it will return a single column containing an array of values.

When to use one-hot encoding

One-hot encoding is obviously fine on such low cardinality columns, with few unique values, but it can create issues on columns with high cardinality. If you tried it on the Date column, which contains 30 unique values for a 30-day dataset, you would end up adding 30 new columns to your DataFrame. Not only does that make your DataFrame much larger and slower to process but, crucially, it causes something commonly called “The Curse of Dimensionality”, especially when you extend the period of time you’re examining.

To cut a long story short, The Curse of Dimensionality, makes your data “sparse” and high-dimensional and this creates a problem for models and can lead to poor accuracy and overfitting. Therefore, you need to carefully consider when you apply one-hot encoding, or you need to reduce the cardinality of the column before you encode (more on that later).

Perfect multicollinearity

The other thing to look out for are columns that contain two values which are opposites of each other, or are “perfectly multicollinear” in stats terminology. For example, if your column is called sex and contains two values male or female, then one-hot encoding will create two columns sex_male and sex_female. As these are perfectly multicollinear, you can predict the value of sex_male by the presence of a 1 or 0 in sex_female. This will stop some models from running at all. Since these are opposites, one of them is of no use to the model and can safely be removed immediately to allow the model to run.

How to one-hot encode your data

Most data science packages (including Pandas, Numpy, scikit-learn, and Keras) include specific functions for one-hot encoding data. The two you’ll encounter most often are the sci-kit learn OneHotEncoder() function and Pandas’ get_dummies() function. They both do very similar things, but they are applied in different ways and one has a distinct advantage over the other.

get_dummies()

The Pandas get_dummies() function is the easiest of the two to use. At the most basic level, you pass get_dummies() two values - the column of the DataFrame you wish to one-hot encode and a prefix to add to the new column. This creates a new column for each unique feature found, using the prefix at the beginning of the new column name and the column value as the suffix. You can use pd.concat() to add the columns to the end of your original DataFrame.

Below we’ll copy the original DataFrame and use get_dummies() to encode one-hot encode the values in the Device Category column, then merge the data back to our DataFrame. Then we’ll repeat the process for the Medium column, adding a handy prefix to each of the new columns created. As there were 3 unique values in the Medium column and 4 in the Device Category column, one-hot encoding adds 7 new columns to our dataset.

df_get_dummies = df.copy()

device_encodings = pd.get_dummies(df_get_dummies['Device Category'], prefix='device')
df_get_dummies = pd.concat([df_get_dummies, device_encodings], axis=1)

medium_encodings = pd.get_dummies(df_get_dummies['Medium'], prefix='Medium')
df_get_dummies = pd.concat([df_get_dummies, medium_encodings], axis=1)

df_get_dummies.head(5)
User Type Source Medium Browser Device Category Date Pageviews device_desktop device_mobile
0 New Visitor (direct) (none) Amazon Silk mobile 2020-07-31 3 0 1
1 New Visitor (direct) (none) Amazon Silk mobile 2020-07-14 1 0 1
2 New Visitor (direct) (none) Amazon Silk tablet 2020-07-14 1 0 0
3 New Visitor (direct) (none) Amazon Silk tablet 2020-08-07 1 0 0
4 New Visitor (direct) (none) Amazon Silk tablet 2020-08-12 1 0 0

Using OneHotEncoder()

The scikit-learn OneHotEncoder() is rather different. It encodes data as a one-hot numeric array via a transform() method. If you set sparse=True this returns a sparse matrix, while if you set sparse=False it returns a 2D array. As you can’t cast either a sparse matrix or a two-dimensional array into a Pandas series, you first need to create an individual column in Pandas (a “serie” if you like) for each thing you want to encode, and then deal with the array returned.

from sklearn.preprocessing import OneHotEncoder

df_ohe = df.copy()
device = df_ohe['Device Category']

ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform([device])

encoded
array([[1., 1., 1., ..., 1., 1., 1.]])

The benefit of OneHotEncoder() over get_dummies() is that it returns consistent results. Re-running get_dummies() at a later date could result in different assignations, which could lead to reduced model accuracy.

Option 3: Label encoding

The third approach is to identify all of the unique values within a column and assign a unique number to each one - a technique known as label encoding or integer encoding. For example, if the Browser column contained five values Chrome, Firefox, Edge, Internet Explorer and Android Webview, these might be assigned values from 0 to 4 (because counts of numbers in programming start at zero, instead of one).

Although this makes label encoding better suited to higher cardinality columns, since it uses a single column to store the data and doesn’t increase the overall size of the DataFrame, it does come with its own issues.

The main disadvantage of label encoding is that some models can misinterpret those labels as indicating an order (or ordinality) to the data, which isn’t actually the case at all. For this reason, one-hot encoding’s binary approach often gives better results. Since label encoding implies ordinality, it’s obviously great for data in which there is ordinality. For example, “small, medium, large” or “primary, secondary, tertiary”. For such simple data, mapping via dictionary works fine.

Using Label encoding

Sklearn’s LabelEncoder() method also runs using fit_transform(). You simply pass it a column of categorical data and it assigns each unique value a number. You can easily add the encoding to a Pandas column in your dataframe.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df_le = df.copy()
labelencoder = LabelEncoder()

df_le['Device Label'] = labelencoder.fit_transform(df_le['Device Category'])
df_le[['Device Label', 'Device Category']].sample(5)
Device Label Device Category
7731 1 mobile
2464 2 tablet
2013 2 tablet
1853 1 mobile
4497 0 desktop

Option 4: Cardinality reduction

The other useful technique you can apply is cardinality reduction - a method for reducing the number of unique values in a column. Let’s say you also have a column called Page which includes the URL of over 1000 pages on your site. However, maybe just 10 of these generate 80% of your page views. To reduce the cardinality from 1000 to 11 you could simply take the top 10 and assign the rest a value of other. This then gives you 11 unique values to play with and allows you to apply one-hot encoding without massively increasing the sparseness of your dataset and incurring the Curse of Dimensionality.

To make it easier to use this technique, I use the below function, which can apply cardinality reduction to multiple columns within a dataframe using defined thresholds based on the column cardinality.

def cols_to_reduce_cardinality(df, thresholds):
    """Reduce the number of unique values by creating a column of 
    X values and the rest marked "Other".

    Args: 
        df: Pandas DataFrame.
        thresholds: Dictionary of column and threshold, i.e. 
        {'col1' : 10, 'col2' : 200}

    Returns: 
        Original DataFrame with additional prefixed columns. The most 
        dominant values in the column will be assigned their original 
        value. The less dominant results will be assigned to Other, 
        which can help visualise and model data in some cases.
    """

    for key, value in thresholds.items():
        counts = df[key].value_counts()
        others = set(counts[counts < value].index)
        df['reduce_' + key] = df[key].replace(list(others), 'Other')

    return df

First, let’s create a copy of the dataframe and use df.nunique() to check the number of unique values in each column. Source and Browser both have high cardinality, so for demonstration purposes, let’s apply the method to these two.

df_reducing_cardinality = df.copy()
df.nunique()
User Type           1
Source             19
Medium              4
Browser            17
Device Category     3
Date               30
Pageviews          13
dtype: int64

When we run the cols_to_reduce_cardinality() function, we pass in the name of the Pandas dataframe and a dictionary. The dictionary contains two values for each column you want to reduce: the name of the column and the number of values. If the number of occurences is less than the threshold we’ll return other. The function will return two new columns called reduce_Source and reduce_Browser. These now contain 7 values in the reduce_Source column and 12 in the reduce_Browser column.

df_reducing_cardinality = cols_to_reduce_cardinality(df_reducing_cardinality, \
                                                     {'Source': 10, 'Browser': 5})
df_reducing_cardinality.nunique()
User Type           1
Source             19
Medium              4
Browser            17
Device Category     3
Date               30
Pageviews          13
reduce_Source       7
reduce_Browser     12
dtype: int64
df_reducing_cardinality.sample(5)
User Type Source Medium Browser Device Category Date Pageviews reduce_Source reduce_Browser
2856 New Visitor bing organic Edge desktop 2020-08-08 2 bing Edge
4182 New Visitor google organic Chrome desktop 2020-07-26 2 google Chrome
4577 New Visitor google organic Chrome desktop 2020-07-23 2 google Chrome
145 New Visitor (direct) (none) Chrome desktop 2020-07-19 2 (direct) Chrome
1433 New Visitor (direct) (none) Safari mobile 2020-07-21 2 (direct) Safari
browser_encodings = pd.get_dummies(df_reducing_cardinality['reduce_Browser'], prefix='browser')
df_reducing_cardinality = pd.concat([df_reducing_cardinality, browser_encodings], axis=1)

df_reducing_cardinality.sample(5)
User Type Source Medium Browser Device Category Date Pageviews reduce_Source reduce_Browser browser_Amazon Silk
3062 New Visitor bing organic Edge desktop 2020-08-08 1 bing Edge 0
4297 New Visitor google organic Chrome desktop 2020-08-12 2 google Chrome 0
3064 New Visitor bing organic Edge desktop 2020-08-08 1 bing Edge 0
8348 New Visitor google organic Chrome mobile 2020-07-17 2 google Chrome 0
9266 New Visitor google organic Chrome mobile 2020-08-04 1 google Chrome 0

5 rows × 21 columns

Option 4: Vectorization

Vectorization is a technique for taking a categorical variable, typically a string or piece of text, and returning it as a vector of numbers. There are a number of different algorithms you can use for this, depending on the data you’re trying to encode. Two of the most commonly seen are Count Vectorization and Term-Frequency Inverse Document Frequency (TF-IDF) Vectorization.

Count Vectorization

First, we’ll create a new Pandas dataframe containing some text data, then we’ll instantiate CountVectorizer() and use fit_transform() on the column containing the text. This returns a sparse matrix of data.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data = {
    'id': [1,2,3,4,5,6],
    'review': ['Your courier is terrible. My order was late and arrived damaged.',
            'I am not impressed with your courier. My product was broken.',
            'Great service. Thanks.',
            'OK service. Terrible courier.',
            'I will never use your business again.', 
            'Superb service. Thank you.']
}

df = pd.DataFrame(data, columns=['id', 'review'])

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(df['review'])
features
<6x29 sparse matrix of type '<class 'numpy.int64'>'
    with 38 stored elements in Compressed Sparse Row format>

To see what’s in the sparse matrix of data returned by fit_transform() we can use get_feature_names() and pass the output into a new dataframe. If you print the dataframe, you’ll see that we get an encoding for each word. The word “terrible” appeared in two reviews, so we see a 1 for each review that included this word and zero on the others, giving you a matrix showing the distribution of words across reviews.

feature_names = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())
feature_names
again am and arrived broken business courier damaged great impressed ... superb terrible thank thanks
0 0 0 1 1 0 0 1 1 0 0 ... 0 1 0 0
1 0 1 0 0 1 0 1 0 0 1 ... 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 1
3 0 0 0 0 0 0 1 0 0 0 ... 0 1 0 0
4 1 0 0 0 0 1 0 0 0 0 ... 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0

6 rows × 29 columns

TF-IDF Vectorization

TF-IDF Vectorization has an advantage over Count Vectorization. Instead of counting whether a word appeared in a piece of text or not, it weights the word by how often it appears across the “documents” in a “corpus” - hence the name Term Frequency Inverse Document Frequency.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

data = {
    'id': [1,2,3,4,5,6],
    'review': ['Your courier is terrible. My order was late and arrived damaged.',
            'I am not impressed with your courier. My product was broken.',
            'Great service. Thanks.',
            'OK service. Terrible courier.',
            'I will never use your business again.', 
            'Superb service. Thank you.']
}

df = pd.DataFrame(data, columns=['id', 'review'])

vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(df['review'])
feature_names = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())
feature_names
again am and arrived broken business courier damaged great impressed
0 0.000000 0.000000 0.333781 0.333781 0.000000 0.000000 0.231081 0.333781 0.000000 0.000000
1 0.000000 0.347033 0.000000 0.000000 0.347033 0.000000 0.240255 0.000000 0.000000 0.347033
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.635091 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.426816 0.000000 0.000000 0.000000
4 0.427206 0.000000 0.000000 0.000000 0.000000 0.427206 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

6 rows × 29 columns

Option 5: Feature hashing

Feature hashing, or feature vectorization, is a more complex way of encoding categorical features and applies a technique commonly known in machine learning as “The Hashing Trick”. It’s quite complicated to understand, but it basically takes string features and returns a sparse matrix based on a hash of the value. The matrices returned are quick to generate and can be stored in a more space-efficient manner than CountVectorizer() or the similar DictVectorizer(), so the technique is fairly practical. The technique has become very popular in recent years and is now used in TensorFlow, scikit-learn, Apache Mahout, Apache Spark and various other systems.

To make it a bit easier to use, I’ve created a little helper function called feature_hashing(). This takes a dataframe and column, and returns a defined number of hash values in prefixed columns and appends them to the dataframe.

def feature_hashing(df, column, prefix, n_features):

    from sklearn.feature_extraction import FeatureHasher

    # Set up FeatureHasher
    fh = FeatureHasher(n_features=n_features, input_type='string')

    # Obtain sparse matrix for column
    sparse_matrix = fh.transform(df[column])

    # Create feature column names
    i = 1
    features = []
    while i < n_features + 1:
        features.append(prefix+str(i))
        i += 1

    # Assign sparse matrix to dataframe
    hashed_codes_df = pd.DataFrame(sparse_matrix.toarray(),columns=features)

    # Concatenate the hashed features to the original dataframe
    df = pd.concat([df, hashed_codes_df], axis=1)

    return df

Running the function on our dataframe of customer reviews returns the requested 8 values and assigns values to them.

import pandas as pd

data = {
    'id': [1,2,3,4,5,6],
    'review': ['Your courier is terrible. My order was late and arrived damaged.',
            'I am not impressed with your courier. My product was broken.',
            'Great service. Thanks.',
            'OK service. Terrible courier.',
            'I will never use your business again.', 
            'Superb service. Thank you.']
}

df = pd.DataFrame(data, columns=['id', 'review'])
df = feature_hashing(df, 'review', 'fh', 8)
df
id review fh1 fh2 fh3 fh4 fh5 fh6 fh7 fh8
0 1 Your courier is terrible. My order was late an... 12.0 7.0 6.0 5.0 2.0 -7.0 1.0 4.0
1 2 I am not impressed with your courier. My produ... 13.0 2.0 4.0 4.0 0.0 -7.0 0.0 0.0
2 3 Great service. Thanks. 2.0 2.0 2.0 -2.0 -1.0 -5.0 0.0 0.0
3 4 OK service. Terrible courier. 4.0 3.0 0.0 0.0 1.0 -7.0 0.0 4.0
4 5 I will never use your business again. 9.0 1.0 2.0 0.0 -1.0 -5.0 1.0 0.0
5 6 Superb service. Thank you. 5.0 1.0 1.0 0.0 -2.0 -6.0 0.0 1.0

Option 6: Date encoding

Although they’re partly numeric, dates are also a type of categorical variable and they do require some work prior to use in models. Don’t be tempted simply to remove them as they’re often extremely powerful features. Instead, you can parse the date and encode its component parts.

By loading the Datetime module with import datetime as dt, you can create new columns to hold numeric representations of information stored in the date, such as the year, month number, day number, day of week or day of year, or whether the day is a weekday or weekend. For time series datasets these nearly always hold significant value for models. Don’t feel limited to just these obvious date encodings. You can also apply this technique to working days, national holidays, and school holidays.

import datetime as dt
import numpy as np

df_date_encoding = df.copy()

df_date_encoding['year'] = df_date_encoding['Date'].dt.year
df_date_encoding['month'] = df_date_encoding['Date'].dt.month
df_date_encoding['day_of_year'] = df_date_encoding['Date'].dt.dayofyear
df_date_encoding['day_of_month'] = df_date_encoding['Date'].dt.day
df_date_encoding['week_of_year'] = df_date_encoding['Date'].dt.week
df_date_encoding['day_of_week'] = df_date_encoding['Date'].dt.dayofweek
df_date_encoding['is_weekday'] = np.where(df_date_encoding['Date'].dt.dayofweek < 5,0,1)

df_date_encoding.head()

Matt Clarke, Wednesday, March 03, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Extreme Gradient Boosting with XGBoost

Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.

Start course for FREE

Comments