A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance, tax, iCloud, Spotify, and pretty much anything else you can think of. Even their milk.
To help maximise customer retention, contractual or subscription-based businesses, identify customers who are reaching the ends of their contracts and assess their probability to churn, or not renew, using a customer churn model. Rather than simply using the contract renewal date, these churn models assess other features that are associated with a customer’s propensity to renew their contract, or go elsewhere. The same kind of approach can also be applied for predicting employee churn.
In this project, we’ll build a contractual churn model for contractual settings to show how it’s done. This won’t cover every step, but will be plenty to get decent results, and should get you started with a customer churn model you can refine and improve.
Non contractual churn models for ecommerce businesses are totally different to those used in contractual businesses. This is because in ecommerce you do not get to see, in CLV terms, when a customer stops being a customer. In ecommerce, you need to predict the probability of a customer being “alive”.
In subscription-based businesses and other contractual settings, you get the benefit of knowing their renewal date, so you can use much simpler models. In general, contractual businesses are able to use less complex models and make predictions with greater accuracy.
The results allow them to react accordingly with their marketing, and try and identify customers who might not renew, so they can make early contact and attempt to retain them before they defect. If a business thinks you’re unlikely to defect, they may even increase the price you pay to try and recover acquisition costs.
In subscription commerce or contractual businesses customer churn models utilise the supervised learning branch of machine learning. They’re classification models that aim to predict the probability that a customer will or won’t churn based on a given set of features, such as the number of months left on the contract, the number of complaints made, or the amount of time the customer has been with the company.
The basic approach is to collect all your historical data, examine it to identify what correlates with churn or retention, engineer new features, and train your model. Once constructed, it can then predict the probability of whether a customer will churn or not, irrespective of whether it’s seen that customer’s data before.
In this project, we’ll create a really simple customer churn model to show the basic approach you can apply. Obviously, in a business setting, you’d spend weeks fine-tuning the code and applying additional techniques to further improve the accuracy, but this basic one shows how simple it is to get started.
Open a new Jupyter notebook or Python script and import the packages below. We’ll be using the usual Pandas and Numpy, plus the scikit-learn machine learning package, and a range of models of different types. Any packages you don’t have can be installed by typing pip3 install package-name in your terminal.
import time
import pandas as pd
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from catboost import CatBoostClassifier
As I use the NVIDIA Data Science Stack Docker container, most of these packages were already present in my data science environment, so I only needed to install a few.
!pip3 install lightgbm
!pip3 install imblearn
!pip3 install catboost
For this project I’ve used a telecommunications churn dataset from Kaggle. Each row represents a customer, and the columns include a range of metrics which state which plan the customer is on, and what customer service issues they may have experienced during their tenure. The customer churn dataset is ideal for getting to grips with how the modeling process works in contractual settings.
pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 100)
df = pd.read_csv('train.csv')
df = df.rename(columns=str.lower)
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
state | OH | NJ | OH | OK | MA |
account_length | 107 | 137 | 84 | 75 | 121 |
area_code | area_code_415 | area_code_415 | area_code_408 | area_code_415 | area_code_510 |
international_plan | no | no | yes | yes | no |
voice_mail_plan | yes | no | no | no | yes |
number_vmail_messages | 26 | 0 | 0 | 0 | 24 |
total_day_minutes | 161.6 | 243.4 | 299.4 | 166.7 | 218.2 |
total_day_calls | 123 | 114 | 71 | 113 | 88 |
total_day_charge | 27.47 | 41.38 | 50.9 | 28.34 | 37.09 |
total_eve_minutes | 195.5 | 121.2 | 61.9 | 148.3 | 348.5 |
total_eve_calls | 103 | 110 | 88 | 122 | 108 |
total_eve_charge | 16.62 | 10.3 | 5.26 | 12.61 | 29.62 |
total_night_minutes | 254.4 | 162.6 | 196.9 | 186.9 | 212.6 |
total_night_calls | 103 | 104 | 89 | 121 | 118 |
total_night_charge | 11.45 | 7.32 | 8.86 | 8.41 | 9.57 |
total_intl_minutes | 13.7 | 12.2 | 6.6 | 10.1 | 7.5 |
total_intl_calls | 3 | 5 | 7 | 3 | 7 |
total_intl_charge | 3.7 | 3.29 | 1.78 | 2.73 | 2.03 |
number_customer_service_calls | 1 | 0 | 2 | 3 | 3 |
churn | no | no | no | no | no |
Since we’re aiming for our model to predict customer churn, we need to set this as our target variable for the model to predict. At the moment, the churn
column contains a boolean yes
or no
value, but we need to “binarise” this to turn it into a numeric value the model can use. A simple replace()
is one of several ways to do this.
df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))
As you’ll see from examining the value_counts()
of the target variable column, this dataset is imbalanced. The positive class (customers who churned) comprise about 14% of the dataset. This is the norm for customer churn datasets in non contractual and contractual settings, but does introduce some challenges we need to handle, otherwise the model may fail to make accurate predictions.
df['churn'].value_counts()
0 3652
1 598
Name: churn, dtype: int64
Next we’ll take a look at the data and create or “engineer” some new features. These all need to be numeric so the model can use them to calculate correlation coefficients, from which it can generate predictions. We’re actually going to use an automated approach to handle some of these, but for demonstration purposes, we’ll also do a few manually. For the international_plan
and voice_mail_plan
columns, we need to use the binarising approach to encode the data from boolean values to numeric.
df['international_plan'] = df['international_plan'].replace(('yes', 'no'), (1, 0))
df['voice_mail_plan'] = df['voice_mail_plan'].replace(('yes', 'no'), (1, 0))
We can also do some maths and calculate the call charge rate for each customer to see if this can help improve the model’s predictions. Maybe customers faced with higher charges are more likely to churn? To calculate the call charge rate, I’ve created a little function that divides the charges by the number of minutes and then assigns the values to a new column. We’ll use this approach for day, night, international, and evening call charges.
def call_charge_rate(df, minutes_column, charges_column):
return df[charges_column] / df[minutes_column]
df['charge_rate_day'] = call_charge_rate(df, 'total_day_minutes', 'total_day_charge')
df['charge_rate_night'] = call_charge_rate(df, 'total_night_minutes', 'total_night_charge')
df['charge_rate_intl'] = call_charge_rate(df, 'total_intl_minutes', 'total_intl_charge')
df['charge_rate_eve'] = call_charge_rate(df, 'total_eve_minutes', 'total_eve_charge')
Next, we’ll use a clever technique called mean encoding. This is often very powerful. It works by calculating the mean of the target variable for a given group. For example, what is the mean churn rate for customers by state? Are those in some states more likely to churn that others? We can use the same technique for other columns where there’s a natural grouping, such as the customers on the international plan and the voice mail plan.
def get_mean_encoding(df, group, target):
"""Group a Pandas DataFrame via a given column and return
the mean of the target variable for that grouping.
Args:
:param df: Pandas DataFrame.
:param group: Column to group by.
:param target: Target variable column.
Returns:
Mean for the target variable across the group.
Example:
df['sector_mean_encoded'] = get_mean_encoding(df, 'sector', 'converted')
"""
mean_encoded = df.groupby(group)[target].mean()
return df[group].map(mean_encoded)
df['mean_encoded_state'] = get_mean_encoding(df, 'state', 'churn')
df['mean_encoded_international_plan'] = get_mean_encoding(df, 'international_plan', 'churn')
df['mean_encoded_voice_mail_plan'] = get_mean_encoding(df, 'voice_mail_plan', 'churn')
Finally, we’ve got two columns - state
and area_code
- that contain a wider variety of values. There are a few ways you can handle these, but one-hot encoding (via the Pandas get_dummies()
function) is often as good as anything, at least when the number of unique values isn’t too high.
df = pd.get_dummies(df, columns=['state','area_code'])
Now we’ve converted most of our text-based features into numeric ones, and created some new features using our analysis of the data, we can examine which ones are most correlated with customer churn. To do this, I’ve used the Pandas corr()
function, which creates a Pearson correlation and returns the coefficient for each column against our target column.
The columns nearest the top, which have values closer to 1, are those that are most highly correlated with churn. Customers on the international plan are most likely to lapse, so are those who have made more calls to the customer service team, or who have had more day charges. The churn rate also differs by state and the plan customers are on.
One important thing to note from the data below is that there’s also some collinearity. Some of the features, such as mean_encoded_international_plan
and international_plan
have equal correlations. That means one of them is redundant and could be removed with no impact to the model. I’ve skipped this, but I would advise that you use an approach such as Recursive Feature Elimination to select only the features required, as this usually gives an additional improvement.
df[df.columns[1:]].corr()['churn'][:].sort_values(ascending=False).to_frame()
churn | |
---|---|
churn | 1.000000 |
mean_encoded_international_plan | 0.259053 |
international_plan | 0.259053 |
number_customer_service_calls | 0.221220 |
total_day_minutes | 0.215272 |
total_day_charge | 0.215263 |
mean_encoded_state | 0.142246 |
mean_encoded_voice_mail_plan | 0.114643 |
total_eve_minutes | 0.078855 |
total_eve_charge | 0.078852 |
state_NJ | 0.056891 |
total_intl_minutes | 0.055186 |
total_intl_charge | 0.055177 |
total_night_minutes | 0.046647 |
total_night_charge | 0.046641 |
state_WA | 0.033577 |
state_MD | 0.033157 |
state_CA | 0.032023 |
state_MT | 0.028598 |
state_NV | 0.026023 |
state_OK | 0.025333 |
state_TX | 0.023493 |
state_SC | 0.020288 |
state_MS | 0.017031 |
state_ME | 0.016433 |
state_MN | 0.016356 |
area_code_area_code_510 | 0.016309 |
state_KS | 0.013182 |
state_MI | 0.013182 |
state_CT | 0.012440 |
total_day_calls | 0.011640 |
state_NY | 0.011350 |
state_DE | 0.008681 |
state_MA | 0.006981 |
state_IN | 0.006462 |
state_AR | 0.005332 |
state_OR | 0.004801 |
state_TN | 0.004430 |
state_SD | 0.002298 |
state_KY | 0.000193 |
area_code_area_code_408 | -0.001251 |
state_WV | -0.002123 |
charge_rate_intl | -0.002965 |
state_NH | -0.004916 |
state_NM | -0.004916 |
state_AL | -0.005381 |
state_MO | -0.006256 |
state_OH | -0.006258 |
total_eve_calls | -0.006817 |
charge_rate_eve | -0.006853 |
state_UT | -0.007469 |
state_PA | -0.007754 |
charge_rate_night | -0.008185 |
state_FL | -0.008648 |
state_IA | -0.009729 |
state_VT | -0.010096 |
state_GA | -0.011141 |
state_DC | -0.011173 |
state_CO | -0.011235 |
state_ID | -0.012648 |
total_night_calls | -0.012699 |
area_code_area_code_415 | -0.013004 |
state_LA | -0.014503 |
state_WY | -0.015413 |
state_ND | -0.018618 |
state_AZ | -0.019453 |
state_NC | -0.021194 |
charge_rate_day | -0.021843 |
state_IL | -0.025630 |
state_AK | -0.026074 |
state_NE | -0.027455 |
state_WI | -0.028649 |
state_RI | -0.029826 |
total_intl_calls | -0.034334 |
state_HI | -0.034674 |
state_VA | -0.040493 |
number_vmail_messages | -0.100347 |
voice_mail_plan | -0.114643 |
The next step is to take our original dataframe and create the X
dataset of features and the y
dataset containing the target variable. We will be training the model on X
so it must not include anything that causes “data leakage” and tells the model the answer, so we definitely need to drop the target variable column. We’ll also drop the international_plan
column which is collinear with mean_encoded_international_plan
.
X = df.drop(['churn', 'international_plan'], axis=1)
y = df['churn']
To divide X
and y
into the train and test datasets we need to train the model we will use the train_test_split()
function from scikit-learn. We’ll assign 30% of the data to the test groups using the argument test_size=0.3
, and we’ll use the stratify=y
option to ensure the target variable is present in the test and train data in equal proportions. The random_state=0
argument means we get reproducible results each time we run the code, rather than a random mix, which may give us different results.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
stratify=y
)
The next bits of code are a little more complicated, but on the plus side, they’re written in such a way that you
can re-use them across projects. What we’re doing here is creating a helper function called get_pipeline()
that creates a scikit-learn pipeline. This
takes our X
data and a scikit-learn model object (i.e. XGBClassier()
), and then performs a series of consistent
tasks on the data passed to each model via a scikit-learn pipeline.
First, we’re extracting the numeric columns from X
and returning a list, then we’re doing the same with the
categorical data columns. We then define a pipeline for the numeric data which uses SimpleImputer()
to
intelligently impute missing values with realistic data. For the categorical data, we’re applying the
OneHotEncoder()
to ensure everything is numeric. These settings get passed to the ColumnTransformer()
.
Next, we’re using the pipeline from imblearn
, which is specifically designed for working with imbalanced datasets
like this one. The pipeline first runs our preprocessor
we just created, then uses the Synthetic Minority
Oversampling Technique (SMOTE) to handle the class imbalance. Finally, we use the MinMaxScaler()
to put our data
on a consistent scale, then pass in the model
and return a bundled_pipeline
.
def get_pipeline(X, model):
"""Return a pipeline to preprocess data and bundle with a model.
Args:
X (object): X_train data.
model (object): scikit-learn model object, i.e. XGBClassifier
Returns:
Pipeline (object): Pipeline steps.
"""
numeric_columns = list(X.select_dtypes(exclude=['object']).columns.values.tolist())
categorical_columns = list(X.select_dtypes(include=['object']).columns.values.tolist())
numeric_pipeline = SimpleImputer(strategy='constant')
categorical_pipeline = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_pipeline, numeric_columns),
('categorical', categorical_pipeline, categorical_columns),
], remainder='passthrough'
)
bundled_pipeline = imbpipeline(steps=[
('preprocessor', preprocessor),
('smote', SMOTE(random_state=1)),
('scaler', MinMaxScaler()),
('model', model)
])
return bundled_pipeline
To give you a rough idea of what this does, here’s the output when we run it on a base XGBClassifier()
model. As you can see, we get back a Pipeline()
with each of the steps we created above, in which our original data gets preprocessed, oversampled, scaled, and returned.
model = XGBClassifier()
example_pipeline = get_pipeline(X_train, model)
example_pipeline
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('numeric',
SimpleImputer(strategy='constant'),
['account_length',
'voice_mail_plan',
'number_vmail_messages',
'total_day_minutes',
'total_day_calls',
'total_day_charge',
'total_eve_minutes',
'total_eve_calls',
'total_eve_charge',
'total_night_minutes',
'total_night_calls',...
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))])
Now we’ve got our nifty reusable pipeline all set up, we can create another helper function to handle the model selection step. What we’re doing here is defining a big dictionary of classifier models, and then looping through them, running the data through our pipeline above, and using cross validation to assess which model is best.
We’ll use a bunch of classifiers of different types to see which is most effective, and we’ll also use an approach called model stacking to combine the predictions of several models together in a new model using a VotingClassifier()
. This can often give you better results than a single model alone. Once the model has run, its output gets added to a Pandas dataframe so we can compare the results.
def select_model(X, y, pipeline=None):
"""Test a range of classifiers and return their performance metrics on training data.
Args:
X (object): Pandas dataframe containing X_train data.
y (object): Pandas dataframe containing y_train data.
pipeline (object): Pipeline from get_pipeline().
Return:
df (object): Pandas dataframe containing model performance data.
"""
classifiers = {}
classifiers.update({"DummyClassifier": DummyClassifier(strategy='most_frequent')})
classifiers.update({"XGBClassifier": XGBClassifier(use_label_encoder=False,
eval_metric='logloss',
objective='binary:logistic',
)})
classifiers.update({"LGBMClassifier": LGBMClassifier()})
classifiers.update({"RandomForestClassifier": RandomForestClassifier()})
classifiers.update({"DecisionTreeClassifier": DecisionTreeClassifier()})
classifiers.update({"ExtraTreeClassifier": ExtraTreeClassifier()})
classifiers.update({"ExtraTreesClassifier": ExtraTreeClassifier()})
classifiers.update({"AdaBoostClassifier": AdaBoostClassifier()})
classifiers.update({"KNeighborsClassifier": KNeighborsClassifier()})
classifiers.update({"RidgeClassifier": RidgeClassifier()})
classifiers.update({"SGDClassifier": SGDClassifier()})
classifiers.update({"BaggingClassifier": BaggingClassifier()})
classifiers.update({"BernoulliNB": BernoulliNB()})
classifiers.update({"SVC": SVC()})
classifiers.update({"CatBoostClassifier":CatBoostClassifier(silent=True)})
# Stacking
models = []
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
models.append(('BaggingClassifier', BaggingClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, CatBoostClassifier, BaggingClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('LGBMClassifier', LGBMClassifier()))
models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
classifiers.update({"VotingClassifier (XGBClassifier, LGBMClassifier, CatBoostClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, RandomForestClassifier, DecisionTreeClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('ExtraTreeClassifier', ExtraTreeClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, AdaBoostClassifier, ExtraTreeClassifier)": VotingClassifier(models)})
models = []
models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
models.append(('ExtraTreesClassifier', ExtraTreesClassifier()))
classifiers.update({"VotingClassifier (XGBClassifier, ExtraTreesClassifier)": VotingClassifier(models)})
df_models = pd.DataFrame(columns=['model', 'run_time', 'accuracy'])
for key in classifiers:
start_time = time.time()
pipeline = get_pipeline(X_train, classifiers[key])
cv = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
row = {'model': key,
'run_time': format(round((time.time() - start_time)/60,2)),
'accuracy': cv.mean(),
}
df_models = df_models.append(row, ignore_index=True)
df_models = df_models.sort_values(by='accuracy', ascending=False)
return df_models
Running the select_model()
function on our training data takes a minute or so. The best independent models were CatBoostClassifier
, XGBClassifier
, and LGBMClassifier
. However, it was the VotingClassifier
which includes a combination of these models that generated the top result overall. The extra performance is minimal, and you may be able to achieve better results by tuning an additional model, but it’s not bad.
models = select_model(X_train, y_train)
models.head(20)
model | run_time | accuracy | |
---|---|---|---|
15 | VotingClassifier (XGBClassifier, CatBoostClassifier, BaggingClassifier) | 0.28 | 0.957647 |
16 | VotingClassifier (XGBClassifier, LGBMClassifier, CatBoostClassifier) | 0.24 | 0.957311 |
14 | CatBoostClassifier | 0.21 | 0.956975 |
1 | XGBClassifier | 0.02 | 0.955630 |
2 | LGBMClassifier | 0.03 | 0.954286 |
17 | VotingClassifier (XGBClassifier, RandomForestClassifier, DecisionTreeClassifier) | 0.08 | 0.950252 |
11 | BaggingClassifier | 0.04 | 0.938151 |
18 | VotingClassifier (XGBClassifier, AdaBoostClassifier, ExtraTreeClassifier) | 0.06 | 0.926723 |
3 | RandomForestClassifier | 0.05 | 0.926387 |
19 | VotingClassifier (XGBClassifier, ExtraTreesClassifier) | 0.06 | 0.915966 |
4 | DecisionTreeClassifier | 0.01 | 0.891765 |
7 | AdaBoostClassifier | 0.04 | 0.868235 |
0 | DummyClassifier | 0.0 | 0.859160 |
13 | SVC | 0.04 | 0.844706 |
6 | ExtraTreesClassifier | 0.0 | 0.813445 |
5 | ExtraTreeClassifier | 0.0 | 0.806723 |
9 | RidgeClassifier | 0.0 | 0.765042 |
12 | BernoulliNB | 0.0 | 0.737479 |
10 | SGDClassifier | 0.0 | 0.735126 |
8 | KNeighborsClassifier | 0.0 | 0.732773 |
Finally, we can take our best model - the VotingClassifier
comprising XGBClassifier
, LGBMClassifier
and CatBoostClassifier
- and fit the data on this. To do this step, we’ll first define our stacked model, then we’ll pass its configuration to get_pipeline()
with our training data. Then, we’ll fit()
the training data and use predict()
to return our predictions from the newly trained model.
stacked_models = []
stacked_models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
stacked_models.append(('LGBMClassifier', LGBMClassifier()))
stacked_models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
stacked_model = VotingClassifier(stacked_models)
bundled_pipeline = get_pipeline(X_train, stacked_model)
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)
There are a few ways we can assess the performance of a classification model. I’ve gone with a few of the most commonly used metrics: accuracy score, ROC/AUC, and the F1 score. Our VotingClassifier()
model returns a solid accuracy score of 0.9592 (95.92%). This is enough to have taken the 13th rank in the Kaggle competition upon which this dataset is based. The top score achieved was 97.33%, so we’re not too far behind.
roc_auc = roc_auc_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred)
print('ROC/AUC:', roc_auc)
print('Accuracy:', accuracy)
print('F1 score:', f1_score)
ROC/AUC: 0.8874678872894833
Accuracy: 0.9592156862745098
F1 score: 0.844311377245509
To examine how well the model performed in a little more detail we can make use of the classification_report()
and confusion_matrix()
functions. The classification report shows us the precision, recall, and F1 score for our predictions.
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.97 0.99 0.98 1096
1 0.91 0.79 0.84 179
accuracy 0.96 1275
macro avg 0.94 0.89 0.91 1275
weighted avg 0.96 0.96 0.96 1275
The confusion matrix (possibly so-named because everyone finds it so confusing to interpret) shows us that:
Out of 1275 predictions, we got it right 1223 times, and we got it wrong just 52 times.
confusion_matrix(y_test, y_pred)
array([[1082, 14],
[ 38, 141]])
Our final score of 95.92% accuracy is pretty good (just outside the top 10 on Kaggle). However, with a bit more effort we should be able to improve upon this, because we’ve skipped a few steps for brevity. If you were deploying this in the real world, you’d put in more hours to further improve the scores.
The two main things we skipped were feature selection and model hyperparameter tuning. As we saw earlier, there are some collinear features present, so adopting a technique such as Recursive Feature Elimination could see the score go up.
Similarly, tuning the model parameters should also get us a little further improvement. Finally, there are a bunch of other feature engineering methods that could be applied to see if these aid the accuracy score.
Matt Clarke, Saturday, August 14, 2021