How to predict employee churn using CatBoost

Picture by Marc Mueller, Pexels.

6 minutes to read

In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly similar to those retailers use for predicting customer churn.

In this project, I’ll show you how you can use the CatBoost algorithm to create a simple employee churn model to predict which of your staff are most likely to leave, and identify what could be causing them to quit.

Load the packages

First, open a Jupyter notebook and import the packages below. I’m using the NVIDIA Data Science Stack Docker container, which is pre-installed with most common data science packages. The only one I needed to install was CatBoost, which can be obtained via Pip using the command below.

!pip3 install catboost

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

Load the data

For this project we’re using an HR analytics dataset from Kaggle. This includes a mixture of numeric and categorical data, including the satisfaction level, last evaluation, the number of projects the person has worked on, their average monthly hours, the time they’ve been with the company, and whether they’ve been promoted or had an accident at work.

The two categorical variables hold the name of their department, and their salary band. Unfortunately, this dataset doesn’t include any potentially useful features to do with office culture, and how annoying the person’s colleagues or line manager are, but we’ll see how we do with what we’ve got.

df = pd.read_csv('HR_comma_sep.csv')
df.head()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	Department	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

Feature engineering

We’re actually going to skip the feature engineering step in this simple example, because CatBoost actually includes a neat feature to encode categorical variables itself. This makes it very quick and easy to use. All we need to do is create a list of the categorical variable column names from our dataframe. You can do this manually, but it’s also possible to automate, which saves loads of typing on larger datasets.

categorical_columns = list(df.select_dtypes(include=['object']).columns.values.tolist())

Create the test and training data

Next, we’ll define the X feature set and the y column that will serve as our target variable. We’re going to include all the features in X apart from left, which is to be our target variable. We obviously need to drop this from the X feature set to avoid giving the answer to the model. We’ll then use the train_test_split() function to create the test and training data, allocating 30% to the test group.

X = df.drop(columns=['left'])
y = df['left']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Create the model

Creating a model with CatBoost is fairly easy. There are loads of options and parameters you can set, but we’re going to leave these on the defaults, adding only silent=True to suppress the verbose output enabled by default.

When we fit the model, we’ll pass in the training data, and give it the test data in the eval_set(), along with the use_best_model=True argument to help reduce model overfitting. The cat_features argument takes the list we created above. The model will take a few seconds to fit, and can then be used to generate predictions, which we’ll store in y_pred.

model = CatBoostClassifier(silent=True)
model.fit(X_train, y_train, 
          eval_set=(X_test, y_test),
          use_best_model=True, 
          cat_features=categorical_columns)
y_pred = model.predict(X_test)

Assess the model’s performance

To see how well our simple model did, we’ll assess its performance using the accuracy score, precision, recall, and the area under the receiver operating characteristic curve, or ROC/AUC for short. This shows us that our model is 98.55% accurate on the test data, and generates a ROC/AUC score of 0.974, which is pretty decent for a first attempt.

accuracy_score = accuracy_score(y_test, y_pred)
precision_score = precision_score(y_test, y_pred)
recall_score = recall_score(y_test, y_pred)
roc_auc_score = roc_auc_score(y_test, y_pred)

print('Accuracy:', accuracy_score)
print('Precision:', precision_score)
print('Recall:', recall_score)
print('ROC/AUC:', roc_auc_score)

Accuracy: 0.9855555555555555
Precision: 0.9875598086124402
Recall: 0.9520295202952029
ROC/AUC: 0.9741119498431517

Matt Clarke, Friday, May 28, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.