How to create a contractual churn model in scikit-learn

Whether you sell magazine subscriptions, mobile phone contracts, broadband, or car insurance, contractual churn models can help your business. Here's how to build one.

How to create a contractual churn model in scikit-learn
Pictures by Lisa Fotios and Andrea Piacquadio, Pexels.
35 minutes to read

A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance, tax, iCloud, Spotify, and pretty much anything else you can think of. Even their milk.

To help maximise customer retention, contractual or subscription-based businesses, identify customers who are reaching the ends of their contracts and assess their probability to churn, or not renew, using a customer churn model. Rather than simply using the contract renewal date, these churn models assess other features that are associated with a customer’s propensity to renew their contract, or go elsewhere. The same kind of approach can also be applied for predicting employee churn.

In this project, we’ll build a contractual churn model for contractual settings to show how it’s done. This won’t cover every step, but will be plenty to get decent results, and should get you started with a customer churn model you can refine and improve.

Contractual churn vs. non contractual churn

Non contractual churn models for ecommerce businesses are totally different to those used in contractual businesses. This is because in ecommerce you do not get to see, in CLV terms, when a customer stops being a customer. In ecommerce, you need to predict the probability of a customer being “alive”.

In subscription-based businesses and other contractual settings, you get the benefit of knowing their renewal date, so you can use much simpler models. In general, contractual businesses are able to use less complex models and make predictions with greater accuracy.

The results allow them to react accordingly with their marketing, and try and identify customers who might not renew, so they can make early contact and attempt to retain them before they defect. If a business thinks you’re unlikely to defect, they may even increase the price you pay to try and recover acquisition costs.

How contractual churn models work

In subscription commerce or contractual businesses customer churn models utilise the supervised learning branch of machine learning. They’re classification models that aim to predict the probability that a customer will or won’t churn based on a given set of features, such as the number of months left on the contract, the number of complaints made, or the amount of time the customer has been with the company.

The basic approach is to collect all your historical data, examine it to identify what correlates with churn or retention, engineer new features, and train your model. Once constructed, it can then predict the probability of whether a customer will churn or not, irrespective of whether it’s seen that customer’s data before.

In this project, we’ll create a really simple customer churn model to show the basic approach you can apply. Obviously, in a business setting, you’d spend weeks fine-tuning the code and applying additional techniques to further improve the accuracy, but this basic one shows how simple it is to get started.

Load the packages

Open a new Jupyter notebook or Python script and import the packages below. We’ll be using the usual Pandas and Numpy, plus the scikit-learn machine learning package, and a range of models of different types. Any packages you don’t have can be installed by typing pip3 install package-name in your terminal.

import time
import pandas as pd
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from catboost import CatBoostClassifier

As I use the NVIDIA Data Science Stack Docker container, most of these packages were already present in my data science environment, so I only needed to install a few.

!pip3 install lightgbm
!pip3 install imblearn
!pip3 install catboost

Load the data

For this project I’ve used a telecommunications churn dataset from Kaggle. Each row represents a customer, and the columns include a range of metrics which state which plan the customer is on, and what customer service issues they may have experienced during their tenure. The customer churn dataset is ideal for getting to grips with how the modeling process works in contractual settings.

pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 100)
df = pd.read_csv('train.csv')
df = df.rename(columns=str.lower)
df.head().T
0 1 2 3 4
state OH NJ OH OK MA
account_length 107 137 84 75 121
area_code area_code_415 area_code_415 area_code_408 area_code_415 area_code_510
international_plan no no yes yes no
voice_mail_plan yes no no no yes
number_vmail_messages 26 0 0 0 24
total_day_minutes 161.6 243.4 299.4 166.7 218.2
total_day_calls 123 114 71 113 88
total_day_charge 27.47 41.38 50.9 28.34 37.09
total_eve_minutes 195.5 121.2 61.9 148.3 348.5
total_eve_calls 103 110 88 122 108
total_eve_charge 16.62 10.3 5.26 12.61 29.62
total_night_minutes 254.4 162.6 196.9 186.9 212.6
total_night_calls 103 104 89 121 118
total_night_charge 11.45 7.32 8.86 8.41 9.57
total_intl_minutes 13.7 12.2 6.6 10.1 7.5
total_intl_calls 3 5 7 3 7
total_intl_charge 3.7 3.29 1.78 2.73 2.03
number_customer_service_calls 1 0 2 3 3
churn no no no no no

Signing a contract

Define the target variable

Since we’re aiming for our model to predict customer churn, we need to set this as our target variable for the model to predict. At the moment, the churn column contains a boolean yes or no value, but we need to “binarise” this to turn it into a numeric value the model can use. A simple replace() is one of several ways to do this.

df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))

As you’ll see from examining the value_counts() of the target variable column, this dataset is imbalanced. The positive class (customers who churned) comprise about 14% of the dataset. This is the norm for customer churn datasets in non contractual and contractual settings, but does introduce some challenges we need to handle, otherwise the model may fail to make accurate predictions.

df['churn'].value_counts()
0    3652
1     598
Name: churn, dtype: int64

Engineer some features

Next we’ll take a look at the data and create or “engineer” some new features. These all need to be numeric so the model can use them to calculate correlation coefficients, from which it can generate predictions. We’re actually going to use an automated approach to handle some of these, but for demonstration purposes, we’ll also do a few manually. For the international_plan and voice_mail_plan columns, we need to use the binarising approach to encode the data from boolean values to numeric.

df['international_plan'] = df['international_plan'].replace(('yes', 'no'), (1, 0))
df['voice_mail_plan'] = df['voice_mail_plan'].replace(('yes', 'no'), (1, 0))

We can also do some maths and calculate the call charge rate for each customer to see if this can help improve the model’s predictions. Maybe customers faced with higher charges are more likely to churn? To calculate the call charge rate, I’ve created a little function that divides the charges by the number of minutes and then assigns the values to a new column. We’ll use this approach for day, night, international, and evening call charges.

def call_charge_rate(df, minutes_column, charges_column):
    return df[charges_column] / df[minutes_column]
df['charge_rate_day'] = call_charge_rate(df, 'total_day_minutes', 'total_day_charge')
df['charge_rate_night'] = call_charge_rate(df, 'total_night_minutes', 'total_night_charge')
df['charge_rate_intl'] = call_charge_rate(df, 'total_intl_minutes', 'total_intl_charge')
df['charge_rate_eve'] = call_charge_rate(df, 'total_eve_minutes', 'total_eve_charge')

Next, we’ll use a clever technique called mean encoding. This is often very powerful. It works by calculating the mean of the target variable for a given group. For example, what is the mean churn rate for customers by state? Are those in some states more likely to churn that others? We can use the same technique for other columns where there’s a natural grouping, such as the customers on the international plan and the voice mail plan.

def get_mean_encoding(df, group, target):
    """Group a Pandas DataFrame via a given column and return
    the mean of the target variable for that grouping.
    Args:
        :param df: Pandas DataFrame.
        :param group: Column to group by.
        :param target: Target variable column.
    Returns:
        Mean for the target variable across the group.
    Example:
        df['sector_mean_encoded'] = get_mean_encoding(df, 'sector', 'converted')
    """

    mean_encoded = df.groupby(group)[target].mean()
    return df[group].map(mean_encoded)
df['mean_encoded_state'] = get_mean_encoding(df, 'state', 'churn')
df['mean_encoded_international_plan'] = get_mean_encoding(df, 'international_plan', 'churn')
df['mean_encoded_voice_mail_plan'] = get_mean_encoding(df, 'voice_mail_plan', 'churn')

Finally, we’ve got two columns - state and area_code - that contain a wider variety of values. There are a few ways you can handle these, but one-hot encoding (via the Pandas get_dummies() function) is often as good as anything, at least when the number of unique values isn’t too high.

df = pd.get_dummies(df, columns=['state','area_code'])

Examine correlations with the target variable

Now we’ve converted most of our text-based features into numeric ones, and created some new features using our analysis of the data, we can examine which ones are most correlated with customer churn. To do this, I’ve used the Pandas corr() function, which creates a Pearson correlation and returns the coefficient for each column against our target column.

The columns nearest the top, which have values closer to 1, are those that are most highly correlated with churn. Customers on the international plan are most likely to lapse, so are those who have made more calls to the customer service team, or who have had more day charges. The churn rate also differs by state and the plan customers are on.

One important thing to note from the data below is that there’s also some collinearity. Some of the features, such as mean_encoded_international_plan and international_plan have equal correlations. That means one of them is redundant and could be removed with no impact to the model. I’ve skipped this, but I would advise that you use an approach such as Recursive Feature Elimination to select only the features required, as this usually gives an additional improvement.

df[df.columns[1:]].corr()['churn'][:].sort_values(ascending=False).to_frame()
churn
churn 1.000000
mean_encoded_international_plan 0.259053
international_plan 0.259053
number_customer_service_calls 0.221220
total_day_minutes 0.215272
total_day_charge 0.215263
mean_encoded_state 0.142246
mean_encoded_voice_mail_plan 0.114643
total_eve_minutes 0.078855
total_eve_charge 0.078852
state_NJ 0.056891
total_intl_minutes 0.055186
total_intl_charge 0.055177
total_night_minutes 0.046647
total_night_charge 0.046641
state_WA 0.033577
state_MD 0.033157
state_CA 0.032023
state_MT 0.028598
state_NV 0.026023
state_OK 0.025333
state_TX 0.023493
state_SC 0.020288
state_MS 0.017031
state_ME 0.016433
state_MN 0.016356
area_code_area_code_510 0.016309
state_KS 0.013182
state_MI 0.013182
state_CT 0.012440
total_day_calls 0.011640
state_NY 0.011350
state_DE 0.008681
state_MA 0.006981
state_IN 0.006462
state_AR 0.005332
state_OR 0.004801
state_TN 0.004430
state_SD 0.002298
state_KY 0.000193
area_code_area_code_408 -0.001251
state_WV -0.002123
charge_rate_intl -0.002965
state_NH -0.004916
state_NM -0.004916
state_AL -0.005381
state_MO -0.006256
state_OH -0.006258
total_eve_calls -0.006817
charge_rate_eve -0.006853
state_UT -0.007469
state_PA -0.007754
charge_rate_night -0.008185
state_FL -0.008648
state_IA -0.009729
state_VT -0.010096
state_GA -0.011141
state_DC -0.011173
state_CO -0.011235
state_ID -0.012648
total_night_calls -0.012699
area_code_area_code_415 -0.013004
state_LA -0.014503
state_WY -0.015413
state_ND -0.018618
state_AZ -0.019453
state_NC -0.021194
charge_rate_day -0.021843
state_IL -0.025630
state_AK -0.026074
state_NE -0.027455
state_WI -0.028649
state_RI -0.029826
total_intl_calls -0.034334
state_HI -0.034674
state_VA -0.040493
number_vmail_messages -0.100347
voice_mail_plan -0.114643

Split the data into a train and test group

The next step is to take our original dataframe and create the X dataset of features and the y dataset containing the target variable. We will be training the model on X so it must not include anything that causes “data leakage” and tells the model the answer, so we definitely need to drop the target variable column. We’ll also drop the international_plan column which is collinear with mean_encoded_international_plan.

X = df.drop(['churn', 'international_plan'], axis=1)
y = df['churn']

To divide X and y into the train and test datasets we need to train the model we will use the train_test_split() function from scikit-learn. We’ll assign 30% of the data to the test groups using the argument test_size=0.3, and we’ll use the stratify=y option to ensure the target variable is present in the test and train data in equal proportions. The random_state=0 argument means we get reproducible results each time we run the code, rather than a random mix, which may give us different results.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=0,
                                                    stratify=y
                                                   )

Create a pipeline

The next bits of code are a little more complicated, but on the plus side, they’re written in such a way that you can re-use them across projects. What we’re doing here is creating a helper function called get_pipeline() that creates a scikit-learn pipeline. This takes our X data and a scikit-learn model object (i.e. XGBClassier()), and then performs a series of consistent tasks on the data passed to each model via a scikit-learn pipeline.

First, we’re extracting the numeric columns from X and returning a list, then we’re doing the same with the categorical data columns. We then define a pipeline for the numeric data which uses SimpleImputer() to intelligently impute missing values with realistic data. For the categorical data, we’re applying the OneHotEncoder() to ensure everything is numeric. These settings get passed to the ColumnTransformer().

Next, we’re using the pipeline from imblearn, which is specifically designed for working with imbalanced datasets like this one. The pipeline first runs our preprocessor we just created, then uses the Synthetic Minority Oversampling Technique (SMOTE) to handle the class imbalance. Finally, we use the MinMaxScaler() to put our data on a consistent scale, then pass in the model and return a bundled_pipeline.

def get_pipeline(X, model):
    """Return a pipeline to preprocess data and bundle with a model.
    
    Args:
        X (object): X_train data. 
        model (object): scikit-learn model object, i.e. XGBClassifier
    
    Returns: 
        Pipeline (object): Pipeline steps. 
    """
    
    numeric_columns = list(X.select_dtypes(exclude=['object']).columns.values.tolist())    
    categorical_columns = list(X.select_dtypes(include=['object']).columns.values.tolist())
    numeric_pipeline = SimpleImputer(strategy='constant')
    categorical_pipeline = OneHotEncoder(handle_unknown='ignore')

    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric', numeric_pipeline, numeric_columns),
            ('categorical', categorical_pipeline, categorical_columns),
            ], remainder='passthrough'
    )

    bundled_pipeline = imbpipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=1)),
        ('scaler', MinMaxScaler()),
        ('model', model)
    ])
    
    return bundled_pipeline

To give you a rough idea of what this does, here’s the output when we run it on a base XGBClassifier() model. As you can see, we get back a Pipeline() with each of the steps we created above, in which our original data gets preprocessed, oversampled, scaled, and returned.

model = XGBClassifier()
example_pipeline = get_pipeline(X_train, model)
example_pipeline
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  SimpleImputer(strategy='constant'),
                                                  ['account_length',
                                                   'voice_mail_plan',
                                                   'number_vmail_messages',
                                                   'total_day_minutes',
                                                   'total_day_calls',
                                                   'total_day_charge',
                                                   'total_eve_minutes',
                                                   'total_eve_calls',
                                                   'total_eve_charge',
                                                   'total_night_minutes',
                                                   'total_night_calls',...
                               importance_type='gain',
                               interaction_constraints=None, learning_rate=None,
                               max_delta_step=None, max_depth=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               random_state=None, reg_alpha=None,
                               reg_lambda=None, scale_pos_weight=None,
                               subsample=None, tree_method=None,
                               validate_parameters=None, verbosity=None))])

Select the best model for the job

Now we’ve got our nifty reusable pipeline all set up, we can create another helper function to handle the model selection step. What we’re doing here is defining a big dictionary of classifier models, and then looping through them, running the data through our pipeline above, and using cross validation to assess which model is best.

We’ll use a bunch of classifiers of different types to see which is most effective, and we’ll also use an approach called model stacking to combine the predictions of several models together in a new model using a VotingClassifier(). This can often give you better results than a single model alone. Once the model has run, its output gets added to a Pandas dataframe so we can compare the results.

def select_model(X, y, pipeline=None):
    """Test a range of classifiers and return their performance metrics on training data.
    
    Args:
        X (object): Pandas dataframe containing X_train data. 
        y (object): Pandas dataframe containing y_train data. 
        pipeline (object): Pipeline from get_pipeline().

    Return:
        df (object): Pandas dataframe containing model performance data. 
    """
    
    classifiers = {}
    classifiers.update({"DummyClassifier": DummyClassifier(strategy='most_frequent')})
    classifiers.update({"XGBClassifier": XGBClassifier(use_label_encoder=False, 
                                                       eval_metric='logloss',
                                                       objective='binary:logistic',
                                                      )})
    classifiers.update({"LGBMClassifier": LGBMClassifier()})
    classifiers.update({"RandomForestClassifier": RandomForestClassifier()})
    classifiers.update({"DecisionTreeClassifier": DecisionTreeClassifier()})
    classifiers.update({"ExtraTreeClassifier": ExtraTreeClassifier()})
    classifiers.update({"ExtraTreesClassifier": ExtraTreeClassifier()})    
    classifiers.update({"AdaBoostClassifier": AdaBoostClassifier()})
    classifiers.update({"KNeighborsClassifier": KNeighborsClassifier()})
    classifiers.update({"RidgeClassifier": RidgeClassifier()})
    classifiers.update({"SGDClassifier": SGDClassifier()})
    classifiers.update({"BaggingClassifier": BaggingClassifier()})
    classifiers.update({"BernoulliNB": BernoulliNB()})
    classifiers.update({"SVC": SVC()})
    classifiers.update({"CatBoostClassifier":CatBoostClassifier(silent=True)})
    
    # Stacking
    models = []

    models = []
    models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
    models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
    models.append(('BaggingClassifier', BaggingClassifier()))
    classifiers.update({"VotingClassifier (XGBClassifier, CatBoostClassifier, BaggingClassifier)": VotingClassifier(models)})

    models = []
    models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
    models.append(('LGBMClassifier', LGBMClassifier()))
    models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
    classifiers.update({"VotingClassifier (XGBClassifier, LGBMClassifier, CatBoostClassifier)": VotingClassifier(models)})
    
    models = []
    models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
    models.append(('RandomForestClassifier', RandomForestClassifier()))
    models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
    classifiers.update({"VotingClassifier (XGBClassifier, RandomForestClassifier, DecisionTreeClassifier)": VotingClassifier(models)})

    models = []
    models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
    models.append(('AdaBoostClassifier', AdaBoostClassifier()))
    models.append(('ExtraTreeClassifier', ExtraTreeClassifier()))
    classifiers.update({"VotingClassifier (XGBClassifier, AdaBoostClassifier, ExtraTreeClassifier)": VotingClassifier(models)})
    
    models = []
    models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
    models.append(('ExtraTreesClassifier', ExtraTreesClassifier()))
    classifiers.update({"VotingClassifier (XGBClassifier, ExtraTreesClassifier)": VotingClassifier(models)})    
    
    df_models = pd.DataFrame(columns=['model', 'run_time', 'accuracy'])

    for key in classifiers:
        
        start_time = time.time()

        pipeline = get_pipeline(X_train, classifiers[key])
        
        cv = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

        row = {'model': key,
               'run_time': format(round((time.time() - start_time)/60,2)),
               'accuracy': cv.mean(),
        }

        df_models = df_models.append(row, ignore_index=True)
        
    df_models = df_models.sort_values(by='accuracy', ascending=False)
    return df_models

Running the select_model() function on our training data takes a minute or so. The best independent models were CatBoostClassifier, XGBClassifier, and LGBMClassifier. However, it was the VotingClassifier which includes a combination of these models that generated the top result overall. The extra performance is minimal, and you may be able to achieve better results by tuning an additional model, but it’s not bad.

models = select_model(X_train, y_train)
models.head(20)
model run_time accuracy
15 VotingClassifier (XGBClassifier, CatBoostClassifier, BaggingClassifier) 0.28 0.957647
16 VotingClassifier (XGBClassifier, LGBMClassifier, CatBoostClassifier) 0.24 0.957311
14 CatBoostClassifier 0.21 0.956975
1 XGBClassifier 0.02 0.955630
2 LGBMClassifier 0.03 0.954286
17 VotingClassifier (XGBClassifier, RandomForestClassifier, DecisionTreeClassifier) 0.08 0.950252
11 BaggingClassifier 0.04 0.938151
18 VotingClassifier (XGBClassifier, AdaBoostClassifier, ExtraTreeClassifier) 0.06 0.926723
3 RandomForestClassifier 0.05 0.926387
19 VotingClassifier (XGBClassifier, ExtraTreesClassifier) 0.06 0.915966
4 DecisionTreeClassifier 0.01 0.891765
7 AdaBoostClassifier 0.04 0.868235
0 DummyClassifier 0.0 0.859160
13 SVC 0.04 0.844706
6 ExtraTreesClassifier 0.0 0.813445
5 ExtraTreeClassifier 0.0 0.806723
9 RidgeClassifier 0.0 0.765042
12 BernoulliNB 0.0 0.737479
10 SGDClassifier 0.0 0.735126
8 KNeighborsClassifier 0.0 0.732773

Fit the best model

Finally, we can take our best model - the VotingClassifier comprising XGBClassifier, LGBMClassifier and CatBoostClassifier - and fit the data on this. To do this step, we’ll first define our stacked model, then we’ll pass its configuration to get_pipeline() with our training data. Then, we’ll fit() the training data and use predict() to return our predictions from the newly trained model.

stacked_models = []
stacked_models.append(('XGBClassifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', objective='binary:logistic')))
stacked_models.append(('LGBMClassifier', LGBMClassifier()))
stacked_models.append(('CatBoostClassifier', CatBoostClassifier(silent=True)))
stacked_model = VotingClassifier(stacked_models)
bundled_pipeline = get_pipeline(X_train, stacked_model)
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

Assess the performance of the model

There are a few ways we can assess the performance of a classification model. I’ve gone with a few of the most commonly used metrics: accuracy score, ROC/AUC, and the F1 score. Our VotingClassifier() model returns a solid accuracy score of 0.9592 (95.92%). This is enough to have taken the 13th rank in the Kaggle competition upon which this dataset is based. The top score achieved was 97.33%, so we’re not too far behind.

roc_auc = roc_auc_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred)
print('ROC/AUC:', roc_auc)
print('Accuracy:', accuracy)
print('F1 score:', f1_score)
ROC/AUC: 0.8874678872894833
Accuracy: 0.9592156862745098
F1 score: 0.844311377245509

To examine how well the model performed in a little more detail we can make use of the classification_report() and confusion_matrix() functions. The classification report shows us the precision, recall, and F1 score for our predictions.

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1096
           1       0.91      0.79      0.84       179

    accuracy                           0.96      1275
   macro avg       0.94      0.89      0.91      1275
weighted avg       0.96      0.96      0.96      1275

The confusion matrix (possibly so-named because everyone finds it so confusing to interpret) shows us that:

  • 1082 true negatives (the customers didn’t churn, and we predicted this correctly)
  • 141 true positives (the customers did churn, and we predicted this correctly)
  • 14 false positives (the customers didn’t churn, but we wrongly predicted that they would)
  • 38 false negatives (the customers did churn, but we wrongly predicted that they wouldn’t)

Out of 1275 predictions, we got it right 1223 times, and we got it wrong just 52 times.

confusion_matrix(y_test, y_pred)
array([[1082,   14],
       [  38,  141]])

Next steps

Our final score of 95.92% accuracy is pretty good (just outside the top 10 on Kaggle). However, with a bit more effort we should be able to improve upon this, because we’ve skipped a few steps for brevity. If you were deploying this in the real world, you’d put in more hours to further improve the scores.

The two main things we skipped were feature selection and model hyperparameter tuning. As we saw earlier, there are some collinear features present, so adopting a technique such as Recursive Feature Elimination could see the score go up.

Similarly, tuning the model parameters should also get us a little further improvement. Finally, there are a bunch of other feature engineering methods that could be applied to see if these aid the accuracy score.

Matt Clarke, Saturday, August 14, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.