How to create a customer retention model with XGBoost

Picture by Pixabay, Pexels.

23 minutes to read

Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be retained. This partly stems from the fact that customer retention is a complex problem, and how you measure it is completely dependent on the transactional nature of your business.

Retention models, or churn models as they’re more commonly known, set out to predict which customers are likely to churn, and which are likely to be retained. However, they need to be created in different ways, depending on whether this business is contractual or non-contractual. Measuring retention, churn, or attrition, is easiest in contractual settings, such as a gym membership, phone contract, or insurance policy where the customer has a contract with the business. Since contracts have a clear end date or renewal date, it’s easy to see who is a customer now, and when they’re due to stop being a customer, unless they renew.

You can also examine the behaviour of the customer, to see what sort of experience they’ve had, such as whether they have been in touch with customer service, or whether they’ve been using the product or service. This is a good way to predict whether they’re likely to renew or not. There are countless reasons why customers defect or churn. I’ve explained how you can identify the reasons why your customers churn in a previous post.

In non-contractual settings, such as a supermarket or ecommerce business, it’s much harder to measure churn. You could try to predict whether a customer will order in a future period by using RFM features or purchase latency as model features, but usually the most effective way to build a non-contractual churn model is to use the Beta Geometric Negative Binomial Bayes Distribution or BG/NBD model. It’s generally wrong to assume that if a customer doesn’t order within X days that they’re likely to churn, as they may have simply forgotten to order, or they may have been away on holiday, or they might just behave differently to other customers.

In this project we’ll be building a customer retention model for a contractual setting and will be aiming to predict which customers are retained and which will churn based on the experience they’ve had as customers of a phone provider. Our contractual churn model will examine how much they’ve used the service, when they use it, how much they’ve been charged, and whether they’ve been in touch with customer service. We’ll also be using XGBoost and Optuna for hyperparameter tuning.

Import the packages

To get started, open a new Jupyter Notebook and import the packages below. We’ll be using the XGBClassifier from XGBoost to create our retention model, and we’ll be using a range of scikit-learn packages to create pipelines and evaluate our model. Finally, we’ll be using the Optuna Bayesian optimization package to tune the hyperparameters of our model.

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
import optuna

Load the data

Next, load up your dataset. I’m using the Kaggle Customer Churn 2020 dataset for this demonstration. The dataset is based on a telecoms provider and includes details on the location of the customer, the telephone plan they have, how they use it, how much they’re being charged, and how many times they’ve had to call the customer service department.

df = pd.read_csv('churn_train.csv')
df = df.rename(columns=str.lower)
df.T

	0	1	2	3	4	5	6	7	8	9	...	4240	4241	4242	4243	4244	4245	4246	4247	4248	4249
state	OH	NJ	OH	OK	MA	MO	LA	WV	IN	RI	...	AR	WA	MN	ND	AZ	MT	WV	NC	HI	VT
account_length	107	137	84	75	121	147	117	141	65	74	...	127	80	150	140	97	83	73	75	50	86
area_code	area_code_415	area_code_415	area_code_408	area_code_415	area_code_510	area_code_415	area_code_408	area_code_415	area_code_415	area_code_415	...	area_code_415	area_code_510	area_code_408	area_code_510	area_code_510	area_code_415	area_code_408	area_code_408	area_code_408	area_code_415
international_plan	no	no	yes	yes	no	yes	no	yes	no	no	...	no	no	no	no	no	no	no	no	no	no
voice_mail_plan	yes	no	no	no	yes	no	no	yes	no	no	...	yes	no	no	no	no	no	no	no	yes	yes
number_vmail_messages	26	0	0	0	24	0	0	37	0	0	...	27	0	0	0	0	0	0	0	40	34
total_day_minutes	161.6	243.4	299.4	166.7	218.2	157.0	184.5	258.6	129.1	187.7	...	157.6	157.0	170.0	244.7	252.6	188.3	177.9	170.7	235.7	129.4
total_day_calls	123	114	71	113	88	79	97	84	137	127	...	107	101	115	115	89	70	89	101	127	102
total_day_charge	27.47	41.38	50.9	28.34	37.09	26.69	31.37	43.96	21.95	31.91	...	26.79	26.69	28.9	41.6	42.94	32.01	30.24	29.02	40.07	22.0
total_eve_minutes	195.5	121.2	61.9	148.3	348.5	103.1	351.6	222.0	228.5	163.4	...	280.6	208.8	162.7	258.6	340.3	243.8	131.2	193.1	223.0	267.1
total_eve_calls	103	110	88	122	108	94	80	111	83	148	...	49	127	138	101	91	88	82	126	126	104
total_eve_charge	16.62	10.3	5.26	12.61	29.62	8.76	29.89	18.87	19.42	13.89	...	23.85	17.75	13.83	21.98	28.93	20.72	11.15	16.41	18.96	22.7
total_night_minutes	254.4	162.6	196.9	186.9	212.6	211.8	215.8	326.4	208.8	196.0	...	75.1	113.3	267.2	231.3	256.5	213.7	186.2	129.1	297.5	154.8
total_night_calls	103	104	89	121	118	96	90	97	111	94	...	77	109	77	112	67	79	89	104	116	100
total_night_charge	11.45	7.32	8.86	8.41	9.57	9.53	9.71	14.69	9.4	8.82	...	3.38	5.1	12.02	10.41	11.54	9.62	8.38	5.81	13.39	6.97
total_intl_minutes	13.7	12.2	6.6	10.1	7.5	7.1	8.7	11.2	12.7	9.1	...	8.0	16.2	8.3	7.5	8.8	10.3	11.5	6.9	9.9	9.3
total_intl_calls	3	5	7	3	7	6	4	5	6	5	...	4	2	2	6	5	6	6	7	5	16
total_intl_charge	3.7	3.29	1.78	2.73	2.03	1.92	2.35	3.02	3.43	2.46	...	2.16	4.37	2.24	2.03	2.38	2.78	3.11	1.86	2.67	2.51
number_customer_service_calls	1	0	2	3	3	0	1	0	4	0	...	1	2	0	1	1	0	3	1	2	0
churn	no	no	no	no	no	no	no	no	yes	no	...	no	no	no	yes	yes	no	no	no	no	no

20 rows × 4250 columns

Encode the target variable

Since this is a contractual setting, the dataset includes a churn column that indicates whether the customer left the company or not. We will encode this column as a binary variable, where 1 indicates that the customer left the company and 0 indicates that the customer stayed with the company. We’ll then define the X features set to include all the columns apart from the target variable and the y target variable to be the churn column.

df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))
X = df.drop(['churn'], axis=1)
y = df['churn']

Split the data into train and test sets

Next, we’ll use the train_test_split function to split the data into train and test sets. We’ll set the test_size to 0.3, which means that 30% of the data will be used for testing and 80% for training. We’ll also set the random_state to 0, which means that the data will be split in the same way every time we run the code.

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0)

Define the categorical and numerical features

To make things easier and more maintainable in the future, we’ll be using scikit-learn pipelines for this project and specifically we’ll be creating a custom feature engineering pipeline. Since we may need to treat categorical and numerical features differently, we’ll first define the categorical and numerical features using the select_dtypes() function to select the categorical and numerical features.

categorical_columns = list(X_train.select_dtypes(include=['object']).columns.values.tolist())
numeric_columns = list(X_train.select_dtypes(exclude=['object']).columns.values.tolist())

Create feature engineering pipeline functions

Next, we’ll create some custom feature engineering functions to incorporate into our feature engineering pipeline. These are the sorts of things you’d ordinarily handle in Pandas, but it’s much cleaner to do it in a pipeline. For each customer we’ll create new columns to hold the total_net_minutes, total_net_calls, total_net_charge and cs_calls_per_month. Previous exploratory data analysis for things that were correlated with churn suggested these would be important features for the retention model.

def get_total_net_minutes(df):
    df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
    return df

def get_total_net_calls(df):
    df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
    return df

def get_total_net_charge(df):
    df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
    return df

def cs_calls_per_month(df):
    df['cs_calls_per_month'] = (df['number_customer_service_calls'] + df['number_vmail_messages']) / df['account_length']
    return df

Create a feature engineering pipeline with ColumnTransformer

Now we can use ColumnTransformer to create a feature engineering pipeline that will apply the feature engineering functions to the appropriate columns. Since we’ve got a pipeline to handle this, when we move to predicting churn in a production environment we can just use the pipeline to transform the data without having to worry about doing the feature engineering in Pandas.

feature_engineering = ColumnTransformer([
    ('total_net_minutes', FunctionTransformer(get_total_net_minutes, validate=False), ['total_day_minutes', 'total_eve_minutes', 'total_night_minutes']),
    ('total_net_calls', FunctionTransformer(get_total_net_calls, validate=False), ['total_day_calls', 'total_eve_calls', 'total_night_calls']),
    ('total_net_charge', FunctionTransformer(get_total_net_charge, validate=False), ['total_day_charge', 'total_eve_charge', 'total_night_charge']),
    ('cs_calls_per_month', FunctionTransformer(cs_calls_per_month, validate=False), ['account_length', 'number_customer_service_calls', 'number_vmail_messages']),
])

Create a numerical and categorical pipeline

Next we’ll create a transformer for the numerical and categorical features. We’ll use the SimpleImputer to fill in missing values and we’ll use one hot encoding to convert categorical variables to binary ones.

numeric_transformer = SimpleImputer(strategy='constant')

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

Create a feature selection and scaling pipeline

Since we’ve just created a whole array of new features, we’ve increased the dimensionality of the data and we could find that this throws our model’s predictions off. To select only the best features, we can create a feature selection pipeline. We’ll first scale the data using StandardScaler and then use SelectFromModel to select the best features when they’re fed into a LogisticRegression model.

feature_selection = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('feature_selection', SelectFromModel(estimator=LogisticRegression(), threshold='median'))
    ])

Create a preprocessing pipeline

Finally, we can bundle all the pipeline steps together into a single preprocessor. This will apply the feature engineering pipeline, the numerical and categorical pipelines and the feature selection pipeline to the data.

preprocessor = ColumnTransformer(
    transformers=[
        ('feature_engineering', feature_engineering, numeric_columns),
        ('numeric_transformers', numeric_transformer, numeric_columns),
        ('categorical_transformers', categorical_transformer, categorical_columns),
    ])

Create the customer retention model using XGBClassifier

Now that’s sorted, we can build the model itself. We’ll fit a base model to start with using XGBoost’s XGBClassifier machine learning classification algorithm. This will be run after we’ve run the data through our preprocessor pipeline and then selected the features we want to keep. We can then fit() the model to the training data and predict() the churn for the test data.

model = XGBClassifier(random_state=0)

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('feature_selection', feature_selection),
                           ('model', model)])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Evaluate the model’s performance

To evaluate the model’s performance, we’ll use the accuracy_score() function to calculate the accuracy of the model, as well as the roc_auc_score() function to calculate the area under the ROC curve. We get a decent result of 97.64% accuracy with an AUC of 0.915, which is not bad for a first attempt.

```python
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))

Accuracy:  0.9764705882352941
AUC:  0.9157312505900991

Use Optuna to tune the XGBClassifier hyperparameters

To see whether we can better our model’s performance by tuning the model’s underlying hyperparameters, we’ll use Optuna to tune the hyperparameters of the XGBClassifier. We’ll use the objective() function to define the objective of the hyperparameter tuning. We’ll use the XGBClassifier’s fit() method to fit the model to the training data and then use the predict() method to predict churn for the test data. We’ll then use the roc_auc_score() function to calculate the area under the ROC curve and return that as the objective. Everything gets run through the pipeline we created above.

def objective(trial):
    params = {
        'model__n_estimators': trial.suggest_int('model__n_estimators', 100, 1000),
        'model__learning_rate': trial.suggest_float('model__learning_rate', 0.01, 0.1),
        'model__max_depth': trial.suggest_int('model__max_depth', 3, 10),
        'model__min_child_weight': trial.suggest_int('model__min_child_weight', 1, 10),
        'model__gamma': trial.suggest_float('model__gamma', 0.01, 0.1),
        'model__subsample': trial.suggest_float('model__subsample', 0.01, 1.0),
        'model__colsample_bytree': trial.suggest_float('model__colsample_bytree', 0.01, 1.0),
        'model__reg_alpha': trial.suggest_float('model__reg_alpha', 1e-5, 10.0),
        'model__reg_lambda': trial.suggest_float('model__reg_lambda', 1e-5, 10.0),
        'model__scale_pos_weight': trial.suggest_float('model__scale_pos_weight', 1e-5, 10.0),
        'model__n_jobs': 4
    }
    pipeline.set_params(**params)
    pipeline.fit(X_train, y_train)
    predictions = pipeline.predict(X_test)
    return roc_auc_score(y_test, predictions)

Run the Optuna optimization

Now the optimization can be run. We’ll use the study.optimize() method to run the optimization. We’ll run the optimization for 100 trials, then we’ll print the best parameters and the best score.

study = optuna.create_study(study_name='churn model', direction='maximize')
study.optimize(objective, n_trials=100)

print('Best parameters', study.best_params)
print('Best score', study.best_value)
print('Best model', study.best_trial)

Best parameters {'model__n_estimators': 566, 'model__learning_rate': 0.04828391312657425, 'model__max_depth': 5, 'model__min_child_weight': 9, 'model__gamma': 0.0469148769911664, 'model__subsample': 0.739256945411171, 'model__colsample_bytree': 0.37277618781404215, 'model__reg_alpha': 6.786248475422331, 'model__reg_lambda': 5.651643235753842, 'model__scale_pos_weight': 3.180336721524608}
Best score 0.9283672356094542
Best model FrozenTrial(number=20, values=[0.9283672356094542], datetime_start=datetime.datetime(2022, 11, 11, 8, 21, 1, 718426), datetime_complete=datetime.datetime(2022, 11, 11, 8, 21, 2, 505067), params={'model__n_estimators': 566, 'model__learning_rate': 0.04828391312657425, 'model__max_depth': 5, 'model__min_child_weight': 9, 'model__gamma': 0.0469148769911664, 'model__subsample': 0.739256945411171, 'model__colsample_bytree': 0.37277618781404215, 'model__reg_alpha': 6.786248475422331, 'model__reg_lambda': 5.651643235753842, 'model__scale_pos_weight': 3.180336721524608}, distributions={'model__n_estimators': IntDistribution(high=1000, log=False, low=100, step=1), 'model__learning_rate': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__max_depth': IntDistribution(high=10, log=False, low=3, step=1), 'model__min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'model__gamma': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__reg_alpha': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__reg_lambda': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__scale_pos_weight': FloatDistribution(high=10.0, log=False, low=1e-05, step=None)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=20, state=TrialState.COMPLETE, value=None)

Re-fit the model with the best parameters

Optuna found some hyperparameters that should give us a further increase in performance. We can access these directly using study.best_params. We can then use these to re-fit the model.

pipeline.set_params(**study.best_params)
pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('feature_engineering',
                                                  ColumnTransformer(transformers=[('total_net_minutes',
                                                                                   FunctionTransformer(func=<function get_total_net_minutes at 0x7f210399da60>),
                                                                                   ['total_day_minutes',
                                                                                    'total_eve_minutes',
                                                                                    'total_night_minutes']),
                                                                                  ('total_net_calls',
                                                                                   FunctionTransformer(func=<function get_total_net_call...
                               grow_policy='depthwise', importance_type=None,
                               interaction_constraints='',
                               learning_rate=0.04828391312657425, max_bin=256,
                               max_cat_to_onehot=4, max_delta_step=0,
                               max_depth=5, max_leaves=0, min_child_weight=9,
                               missing=nan, monotone_constraints='()',
                               n_estimators=566, n_jobs=4, num_parallel_tree=1,
                               predictor='auto', random_state=0,
                               reg_alpha=6.786248475422331,
                               reg_lambda=5.651643235753842, ...))])

Evaluate the performance of the final model

Finally, we can run everything again and see how the tuned retention model performs on the test dataset. We get back an accuracy score of 97.7% on the test dataset, with an AUC score of 0.928. Therefore, the tuned model is performing better than the untuned model. We’re now able to predict with very high accuracy which of the companies customers are going to churn and which will be retained, purely based on their tenure, call plan, call charges and usage, and how often they needed to contact customer service.

predictions = pipeline.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))
print(classification_report(y_test, predictions))

Accuracy:  0.9772549019607844
AUC:  0.9283672356094542
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1102
           1       0.97      0.86      0.91       173

    accuracy                           0.98      1275
   macro avg       0.97      0.93      0.95      1275
weighted avg       0.98      0.98      0.98      1275

Not only could a business use this in its marketing to try to prevent the attrition of customers most likely to churn (using predict_proba() instead of predict will give you the probability to churn), but you could also analyse the feature importance or Shapley values from the model to guide the business strategy by identifying the most common drivers of churn and seeking to address these.

Matt Clarke, Thursday, November 10, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.