In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly similar to those retailers use for predicting customer churn.
In this project, I’ll show you how you can use the CatBoost algorithm to create a simple employee churn model to predict which of your staff are most likely to leave, and identify what could be causing them to quit.
First, open a Jupyter notebook and import the packages below. I’m using the NVIDIA Data Science Stack Docker container, which is pre-installed with most common data science packages. The only one I needed to install was CatBoost, which can be obtained via Pip using the command below.
!pip3 install catboost
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from catboost import CatBoostClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import roc_auc_score
For this project we’re using an HR analytics dataset from Kaggle. This includes a mixture of numeric and categorical data, including the satisfaction level, last evaluation, the number of projects the person has worked on, their average monthly hours, the time they’ve been with the company, and whether they’ve been promoted or had an accident at work.
The two categorical variables hold the name of their department, and their salary band. Unfortunately, this dataset doesn’t include any potentially useful features to do with office culture, and how annoying the person’s colleagues or line manager are, but we’ll see how we do with what we’ve got.
df = pd.read_csv('HR_comma_sep.csv') df.head()
We’re actually going to skip the feature engineering step in this simple example, because CatBoost actually includes a neat feature to encode categorical variables itself. This makes it very quick and easy to use. All we need to do is create a list of the categorical variable column names from our dataframe. You can do this manually, but it’s also possible to automate, which saves loads of typing on larger datasets.
categorical_columns = list(df.select_dtypes(include=['object']).columns.values.tolist())
Next, we’ll define the
X feature set and the
y column that will serve as our target variable. We’re going to include all the features in
X apart from
left, which is to be our target variable. We obviously need to drop this from the
X feature set to avoid giving the answer to the model. We’ll then use the
train_test_split() function to create the test and training data, allocating 30% to the test group.
X = df.drop(columns=['left']) y = df['left']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Creating a model with CatBoost is fairly easy. There are loads of options and parameters you can set, but we’re going to leave these on the defaults, adding only
silent=True to suppress the verbose output enabled by default.
When we fit the model, we’ll pass in the training data, and give it the test data in the
eval_set(), along with
use_best_model=True argument to help reduce model overfitting. The
cat_features argument takes the list we
created above. The model will take a few seconds to fit, and can then be used to generate predictions, which we’ll store in
model = CatBoostClassifier(silent=True) model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, cat_features=categorical_columns) y_pred = model.predict(X_test)
To see how well our simple model did, we’ll assess its performance using the accuracy score, precision, recall, and the area under the receiver operating characteristic curve, or ROC/AUC for short. This shows us that our model is 98.55% accurate on the test data, and generates a ROC/AUC score of 0.974, which is pretty decent for a first attempt.
accuracy_score = accuracy_score(y_test, y_pred) precision_score = precision_score(y_test, y_pred) recall_score = recall_score(y_test, y_pred) roc_auc_score = roc_auc_score(y_test, y_pred)
print('Accuracy:', accuracy_score) print('Precision:', precision_score) print('Recall:', recall_score) print('ROC/AUC:', roc_auc_score)
Accuracy: 0.9855555555555555 Precision: 0.9875598086124402 Recall: 0.9520295202952029 ROC/AUC: 0.9741119498431517
Matt Clarke, Friday, May 28, 2021