Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be retained. This partly stems from the fact that customer retention is a complex problem, and how you measure it is completely dependent on the transactional nature of your business.
Retention models, or churn models as they’re more commonly known, set out to predict which customers are likely to churn, and which are likely to be retained. However, they need to be created in different ways, depending on whether this business is contractual or non-contractual. Measuring retention, churn, or attrition, is easiest in contractual settings, such as a gym membership, phone contract, or insurance policy where the customer has a contract with the business. Since contracts have a clear end date or renewal date, it’s easy to see who is a customer now, and when they’re due to stop being a customer, unless they renew.
You can also examine the behaviour of the customer, to see what sort of experience they’ve had, such as whether they have been in touch with customer service, or whether they’ve been using the product or service. This is a good way to predict whether they’re likely to renew or not. There are countless reasons why customers defect or churn. I’ve explained how you can identify the reasons why your customers churn in a previous post.
In non-contractual settings, such as a supermarket or ecommerce business, it’s much harder to measure churn. You could try to predict whether a customer will order in a future period by using RFM features or purchase latency as model features, but usually the most effective way to build a non-contractual churn model is to use the Beta Geometric Negative Binomial Bayes Distribution or BG/NBD model. It’s generally wrong to assume that if a customer doesn’t order within X days that they’re likely to churn, as they may have simply forgotten to order, or they may have been away on holiday, or they might just behave differently to other customers.
In this project we’ll be building a customer retention model for a contractual setting and will be aiming to predict which customers are retained and which will churn based on the experience they’ve had as customers of a phone provider. Our contractual churn model will examine how much they’ve used the service, when they use it, how much they’ve been charged, and whether they’ve been in touch with customer service. We’ll also be using XGBoost and Optuna for hyperparameter tuning.
To get started, open a new Jupyter Notebook and import the packages below. We’ll be using the XGBClassifier from XGBoost to create our retention model, and we’ll be using a range of scikit-learn packages to create pipelines and evaluate our model. Finally, we’ll be using the Optuna Bayesian optimization package to tune the hyperparameters of our model.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
import optuna
Next, load up your dataset. I’m using the Kaggle Customer Churn 2020 dataset for this demonstration. The dataset is based on a telecoms provider and includes details on the location of the customer, the telephone plan they have, how they use it, how much they’re being charged, and how many times they’ve had to call the customer service department.
df = pd.read_csv('churn_train.csv')
df = df.rename(columns=str.lower)
df.T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 4240 | 4241 | 4242 | 4243 | 4244 | 4245 | 4246 | 4247 | 4248 | 4249 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | OH | NJ | OH | OK | MA | MO | LA | WV | IN | RI | ... | AR | WA | MN | ND | AZ | MT | WV | NC | HI | VT |
account_length | 107 | 137 | 84 | 75 | 121 | 147 | 117 | 141 | 65 | 74 | ... | 127 | 80 | 150 | 140 | 97 | 83 | 73 | 75 | 50 | 86 |
area_code | area_code_415 | area_code_415 | area_code_408 | area_code_415 | area_code_510 | area_code_415 | area_code_408 | area_code_415 | area_code_415 | area_code_415 | ... | area_code_415 | area_code_510 | area_code_408 | area_code_510 | area_code_510 | area_code_415 | area_code_408 | area_code_408 | area_code_408 | area_code_415 |
international_plan | no | no | yes | yes | no | yes | no | yes | no | no | ... | no | no | no | no | no | no | no | no | no | no |
voice_mail_plan | yes | no | no | no | yes | no | no | yes | no | no | ... | yes | no | no | no | no | no | no | no | yes | yes |
number_vmail_messages | 26 | 0 | 0 | 0 | 24 | 0 | 0 | 37 | 0 | 0 | ... | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 34 |
total_day_minutes | 161.6 | 243.4 | 299.4 | 166.7 | 218.2 | 157.0 | 184.5 | 258.6 | 129.1 | 187.7 | ... | 157.6 | 157.0 | 170.0 | 244.7 | 252.6 | 188.3 | 177.9 | 170.7 | 235.7 | 129.4 |
total_day_calls | 123 | 114 | 71 | 113 | 88 | 79 | 97 | 84 | 137 | 127 | ... | 107 | 101 | 115 | 115 | 89 | 70 | 89 | 101 | 127 | 102 |
total_day_charge | 27.47 | 41.38 | 50.9 | 28.34 | 37.09 | 26.69 | 31.37 | 43.96 | 21.95 | 31.91 | ... | 26.79 | 26.69 | 28.9 | 41.6 | 42.94 | 32.01 | 30.24 | 29.02 | 40.07 | 22.0 |
total_eve_minutes | 195.5 | 121.2 | 61.9 | 148.3 | 348.5 | 103.1 | 351.6 | 222.0 | 228.5 | 163.4 | ... | 280.6 | 208.8 | 162.7 | 258.6 | 340.3 | 243.8 | 131.2 | 193.1 | 223.0 | 267.1 |
total_eve_calls | 103 | 110 | 88 | 122 | 108 | 94 | 80 | 111 | 83 | 148 | ... | 49 | 127 | 138 | 101 | 91 | 88 | 82 | 126 | 126 | 104 |
total_eve_charge | 16.62 | 10.3 | 5.26 | 12.61 | 29.62 | 8.76 | 29.89 | 18.87 | 19.42 | 13.89 | ... | 23.85 | 17.75 | 13.83 | 21.98 | 28.93 | 20.72 | 11.15 | 16.41 | 18.96 | 22.7 |
total_night_minutes | 254.4 | 162.6 | 196.9 | 186.9 | 212.6 | 211.8 | 215.8 | 326.4 | 208.8 | 196.0 | ... | 75.1 | 113.3 | 267.2 | 231.3 | 256.5 | 213.7 | 186.2 | 129.1 | 297.5 | 154.8 |
total_night_calls | 103 | 104 | 89 | 121 | 118 | 96 | 90 | 97 | 111 | 94 | ... | 77 | 109 | 77 | 112 | 67 | 79 | 89 | 104 | 116 | 100 |
total_night_charge | 11.45 | 7.32 | 8.86 | 8.41 | 9.57 | 9.53 | 9.71 | 14.69 | 9.4 | 8.82 | ... | 3.38 | 5.1 | 12.02 | 10.41 | 11.54 | 9.62 | 8.38 | 5.81 | 13.39 | 6.97 |
total_intl_minutes | 13.7 | 12.2 | 6.6 | 10.1 | 7.5 | 7.1 | 8.7 | 11.2 | 12.7 | 9.1 | ... | 8.0 | 16.2 | 8.3 | 7.5 | 8.8 | 10.3 | 11.5 | 6.9 | 9.9 | 9.3 |
total_intl_calls | 3 | 5 | 7 | 3 | 7 | 6 | 4 | 5 | 6 | 5 | ... | 4 | 2 | 2 | 6 | 5 | 6 | 6 | 7 | 5 | 16 |
total_intl_charge | 3.7 | 3.29 | 1.78 | 2.73 | 2.03 | 1.92 | 2.35 | 3.02 | 3.43 | 2.46 | ... | 2.16 | 4.37 | 2.24 | 2.03 | 2.38 | 2.78 | 3.11 | 1.86 | 2.67 | 2.51 |
number_customer_service_calls | 1 | 0 | 2 | 3 | 3 | 0 | 1 | 0 | 4 | 0 | ... | 1 | 2 | 0 | 1 | 1 | 0 | 3 | 1 | 2 | 0 |
churn | no | no | no | no | no | no | no | no | yes | no | ... | no | no | no | yes | yes | no | no | no | no | no |
20 rows × 4250 columns
Since this is a contractual setting, the dataset includes a churn column that indicates whether the customer left the company or not. We will encode this column as a binary variable, where 1 indicates that the customer left the company and 0 indicates that the customer stayed with the company. We’ll then define the X features set to include all the columns apart from the target variable and the y target variable to be the churn column.
df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))
X = df.drop(['churn'], axis=1)
y = df['churn']
Next, we’ll use the train_test_split
function to split the data into train and test sets. We’ll set the test_size to 0.3, which means that 30% of the data will be used for testing and 80% for training. We’ll also set the random_state to 0, which means that the data will be split in the same way every time we run the code.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0)
To make things easier and more maintainable in the future, we’ll be using scikit-learn pipelines for this project and specifically we’ll be creating a custom feature engineering pipeline. Since we may need to treat categorical and numerical features differently, we’ll first define the categorical and numerical features using the select_dtypes()
function to select the categorical and numerical features.
categorical_columns = list(X_train.select_dtypes(include=['object']).columns.values.tolist())
numeric_columns = list(X_train.select_dtypes(exclude=['object']).columns.values.tolist())
Next, we’ll create some custom feature engineering functions to incorporate into our feature engineering pipeline. These are the sorts of things you’d ordinarily handle in Pandas, but it’s much cleaner to do it in a pipeline. For each customer we’ll create new columns to hold the total_net_minutes
, total_net_calls
, total_net_charge
and cs_calls_per_month
. Previous exploratory data analysis for things that were correlated with churn suggested these would be important features for the retention model.
def get_total_net_minutes(df):
df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
return df
def get_total_net_calls(df):
df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
return df
def get_total_net_charge(df):
df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
return df
def cs_calls_per_month(df):
df['cs_calls_per_month'] = (df['number_customer_service_calls'] + df['number_vmail_messages']) / df['account_length']
return df
Now we can use ColumnTransformer
to create a feature engineering pipeline that will apply the feature engineering functions to the appropriate columns. Since we’ve got a pipeline to handle this, when we move to predicting churn in a production environment we can just use the pipeline to transform the data without having to worry about doing the feature engineering in Pandas.
feature_engineering = ColumnTransformer([
('total_net_minutes', FunctionTransformer(get_total_net_minutes, validate=False), ['total_day_minutes', 'total_eve_minutes', 'total_night_minutes']),
('total_net_calls', FunctionTransformer(get_total_net_calls, validate=False), ['total_day_calls', 'total_eve_calls', 'total_night_calls']),
('total_net_charge', FunctionTransformer(get_total_net_charge, validate=False), ['total_day_charge', 'total_eve_charge', 'total_night_charge']),
('cs_calls_per_month', FunctionTransformer(cs_calls_per_month, validate=False), ['account_length', 'number_customer_service_calls', 'number_vmail_messages']),
])
Next we’ll create a transformer for the numerical and categorical features. We’ll use the SimpleImputer
to fill in missing values and we’ll use one hot encoding to convert categorical variables to binary ones.
numeric_transformer = SimpleImputer(strategy='constant')
categorical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
Since we’ve just created a whole array of new features, we’ve increased the dimensionality of the data and we could find that this throws our model’s predictions off. To select only the best features, we can create a feature selection pipeline. We’ll first scale the data using StandardScaler
and then use SelectFromModel
to select the best features when they’re fed into a LogisticRegression
model.
feature_selection = Pipeline(
steps=[
('scaler', StandardScaler()),
('feature_selection', SelectFromModel(estimator=LogisticRegression(), threshold='median'))
])
Finally, we can bundle all the pipeline steps together into a single preprocessor. This will apply the feature engineering pipeline, the numerical and categorical pipelines and the feature selection pipeline to the data.
preprocessor = ColumnTransformer(
transformers=[
('feature_engineering', feature_engineering, numeric_columns),
('numeric_transformers', numeric_transformer, numeric_columns),
('categorical_transformers', categorical_transformer, categorical_columns),
])
Now that’s sorted, we can build the model itself. We’ll fit a base model to start with using XGBoost’s XGBClassifier machine learning classification algorithm. This will be run after we’ve run the data through our preprocessor pipeline and then selected the features we want to keep. We can then fit()
the model to the training data and predict()
the churn for the test data.
model = XGBClassifier(random_state=0)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('feature_selection', feature_selection),
('model', model)])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
To evaluate the model’s performance, we’ll use the accuracy_score()
function to calculate the accuracy of the model, as well as the roc_auc_score()
function to calculate the area under the ROC curve. We get a decent result of 97.64% accuracy with an AUC of 0.915, which is not bad for a first attempt.
```python
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))
Accuracy: 0.9764705882352941
AUC: 0.9157312505900991
To see whether we can better our model’s performance by tuning the model’s underlying hyperparameters, we’ll use Optuna to tune the hyperparameters of the XGBClassifier. We’ll use the objective()
function to define the objective of the hyperparameter tuning. We’ll use the XGBClassifier
’s fit()
method to fit the model to the training data and then use the predict()
method to predict churn for the test data. We’ll then use the roc_auc_score()
function to calculate the area under the ROC curve and return that as the objective. Everything gets run through the pipeline we created above.
def objective(trial):
params = {
'model__n_estimators': trial.suggest_int('model__n_estimators', 100, 1000),
'model__learning_rate': trial.suggest_float('model__learning_rate', 0.01, 0.1),
'model__max_depth': trial.suggest_int('model__max_depth', 3, 10),
'model__min_child_weight': trial.suggest_int('model__min_child_weight', 1, 10),
'model__gamma': trial.suggest_float('model__gamma', 0.01, 0.1),
'model__subsample': trial.suggest_float('model__subsample', 0.01, 1.0),
'model__colsample_bytree': trial.suggest_float('model__colsample_bytree', 0.01, 1.0),
'model__reg_alpha': trial.suggest_float('model__reg_alpha', 1e-5, 10.0),
'model__reg_lambda': trial.suggest_float('model__reg_lambda', 1e-5, 10.0),
'model__scale_pos_weight': trial.suggest_float('model__scale_pos_weight', 1e-5, 10.0),
'model__n_jobs': 4
}
pipeline.set_params(**params)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
return roc_auc_score(y_test, predictions)
Now the optimization can be run. We’ll use the study.optimize()
method to run the optimization. We’ll run the optimization for 100 trials, then we’ll print the best parameters and the best score.
study = optuna.create_study(study_name='churn model', direction='maximize')
study.optimize(objective, n_trials=100)
print('Best parameters', study.best_params)
print('Best score', study.best_value)
print('Best model', study.best_trial)
Best parameters {'model__n_estimators': 566, 'model__learning_rate': 0.04828391312657425, 'model__max_depth': 5, 'model__min_child_weight': 9, 'model__gamma': 0.0469148769911664, 'model__subsample': 0.739256945411171, 'model__colsample_bytree': 0.37277618781404215, 'model__reg_alpha': 6.786248475422331, 'model__reg_lambda': 5.651643235753842, 'model__scale_pos_weight': 3.180336721524608}
Best score 0.9283672356094542
Best model FrozenTrial(number=20, values=[0.9283672356094542], datetime_start=datetime.datetime(2022, 11, 11, 8, 21, 1, 718426), datetime_complete=datetime.datetime(2022, 11, 11, 8, 21, 2, 505067), params={'model__n_estimators': 566, 'model__learning_rate': 0.04828391312657425, 'model__max_depth': 5, 'model__min_child_weight': 9, 'model__gamma': 0.0469148769911664, 'model__subsample': 0.739256945411171, 'model__colsample_bytree': 0.37277618781404215, 'model__reg_alpha': 6.786248475422331, 'model__reg_lambda': 5.651643235753842, 'model__scale_pos_weight': 3.180336721524608}, distributions={'model__n_estimators': IntDistribution(high=1000, log=False, low=100, step=1), 'model__learning_rate': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__max_depth': IntDistribution(high=10, log=False, low=3, step=1), 'model__min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'model__gamma': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__reg_alpha': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__reg_lambda': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__scale_pos_weight': FloatDistribution(high=10.0, log=False, low=1e-05, step=None)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=20, state=TrialState.COMPLETE, value=None)
Optuna found some hyperparameters that should give us a further increase in performance. We can access these directly using study.best_params
. We can then use these to re-fit the model.
pipeline.set_params(**study.best_params)
pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('feature_engineering',
ColumnTransformer(transformers=[('total_net_minutes',
FunctionTransformer(func=<function get_total_net_minutes at 0x7f210399da60>),
['total_day_minutes',
'total_eve_minutes',
'total_night_minutes']),
('total_net_calls',
FunctionTransformer(func=<function get_total_net_call...
grow_policy='depthwise', importance_type=None,
interaction_constraints='',
learning_rate=0.04828391312657425, max_bin=256,
max_cat_to_onehot=4, max_delta_step=0,
max_depth=5, max_leaves=0, min_child_weight=9,
missing=nan, monotone_constraints='()',
n_estimators=566, n_jobs=4, num_parallel_tree=1,
predictor='auto', random_state=0,
reg_alpha=6.786248475422331,
reg_lambda=5.651643235753842, ...))])
Finally, we can run everything again and see how the tuned retention model performs on the test dataset. We get back an accuracy score of 97.7% on the test dataset, with an AUC score of 0.928. Therefore, the tuned model is performing better than the untuned model. We’re now able to predict with very high accuracy which of the companies customers are going to churn and which will be retained, purely based on their tenure, call plan, call charges and usage, and how often they needed to contact customer service.
predictions = pipeline.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))
print(classification_report(y_test, predictions))
Accuracy: 0.9772549019607844
AUC: 0.9283672356094542
precision recall f1-score support
0 0.98 1.00 0.99 1102
1 0.97 0.86 0.91 173
accuracy 0.98 1275
macro avg 0.97 0.93 0.95 1275
weighted avg 0.98 0.98 0.98 1275
Not only could a business use this in its marketing to try to prevent the attrition of customers most likely to churn (using predict_proba()
instead of predict will give you the probability to churn), but you could also analyse the feature importance or Shapley values from the model to guide the business strategy by identifying the most common drivers of churn and seeking to address these.
Matt Clarke, Thursday, November 10, 2022