How to identify the causes of customer churn

Picture by Freestocks, Pexels.

15 minutes to read

Understanding what drives customer churn is critical to business success. While there is always going to be natural churn that you can’t prevent, the most common reasons for churn are invariably linked to your operational service quality, and how your customer service team handles problems when they arise.

If you can identify the specific causes of churn, you can try to improve in those areas, and your customer retention rate should go up. One way to achieve this is through survival analysis, which looks at the factors associated with the loss of customers over time.

In this project, we’ll use Cox’s Proportional Hazards model to examine a contractual churn dataset to quickly identify the factors associated with the loss of customers. This only scratches the surface of what you can do, but it’s often all that’s needed to find what needs fixing. Here’s how it’s done.

Building a churn model dataset

I’m using a contractual churn dataset for this project. These are easier to work with than non-contractual churn datasets, because the time of a customer’s “death” (in CLV terms, not literally) is measurable. If you’re in a non-contractual business, such as online retail, you will need to model the time of death using a model such as BG/NBD.

Each row in your dataset needs to represent a customer. You’ll require a column containing your target variable (i.e. churn) and a time-based metric, such as tenure based on the number of days between “birth” and “death”, plus a wide range of features. Certain features are commonly associated with churn, so try to extract these if you can.

Service quality	Customers who experience operational service quality issues, such as damages, delays, incorrect or incomplete orders, and even returns or service that was just "meh", are commonly associated with churn. Capturing data on these factors can show you how they're linked.
Service contact	Customers generally only make contact with your customer service team when something goes wrong (usually for one of the reasons above) so it makes sense that the number of service contacts, and their type, are associated with churn. In settings where customers interact online, they generally expect to have issues resolved online, so making them ring a call centre is often a turn off. Measure how they made contact and how many times it happened.
Customer service quality	Next there's the quality of customer service they received when they made contact. Online retailers often love to knock Amazon, but they're actually bloody brilliant when it comes to this. Staff are polite, they actually apologise for the issue (even when not their fault), they give refunds almost instantaneously, and rarely quibble. If customers complain about your policies, you definitely need to record it, as it's usually highly correlated with churn. I've yet to run a service recovery test where it hasn't made a massive contribution towards reversing the trend, so record this too.
Pricing	Finally, there are price-related issues. These often go unnoticed because the people who set prices are generally thinking about margin and not customer retention, but putting up product or delivery prices can mean customers don't come back. They are, however, much harder to capture, especially when multiple products are purchased.

png Customers who call are more likely to churn. Picture by Pixabay, Pexels.

Load the packages

Open a new Python script or Jupyter notebook and import the pandas, numpy, lifelines, and matplotlib packages. Any packages you don’t have can be installed by entering pip3 install package-name in your terminal. Lifelines will be doing the bulk of the work here.

import pandas as pd
import numpy as np
from lifelines import CoxPHFitter
import matplotlib.pyplot as plt

Load the data

You can use your own dataset for this if you like. I’m using a churn dataset from a telecoms provider. This comprises one row per customer and includes a boolean churn metric, and a range of features that we can examine to see what influences churn and retention the most.

df = pd.read_csv('train.csv')
df.head().T

	0	1	2	3	4
state	OH	NJ	OH	OK	MA
account_length	107	137	84	75	121
area_code	area_code_415	area_code_415	area_code_408	area_code_415	area_code_510
international_plan	no	no	yes	yes	no
voice_mail_plan	yes	no	no	no	yes
number_vmail_messages	26	0	0	0	24
total_day_minutes	161.6	243.4	299.4	166.7	218.2
total_day_calls	123	114	71	113	88
total_day_charge	27.47	41.38	50.9	28.34	37.09
total_eve_minutes	195.5	121.2	61.9	148.3	348.5
total_eve_calls	103	110	88	122	108
total_eve_charge	16.62	10.3	5.26	12.61	29.62
total_night_minutes	254.4	162.6	196.9	186.9	212.6
total_night_calls	103	104	89	121	118
total_night_charge	11.45	7.32	8.86	8.41	9.57
total_intl_minutes	13.7	12.2	6.6	10.1	7.5
total_intl_calls	3	5	7	3	7
total_intl_charge	3.7	3.29	1.78	2.73	2.03
number_customer_service_calls	1	0	2	3	3
churn	no	no	no	no	no

Prepare the data

To prepare the data for use by the model we need to binarise the boolean values stored in the churn, international_plan, and voice_mail_plan columns. Rather than one-hot encoding them, which would introduce multi-collinearity (since yes is the opposite of no). I’ve used np.where() to replace the yes with a 1.

df['churn'] = np.where(df['churn'].str.contains('yes'), 1, 0)
df['international_plan'] = np.where(df['international_plan'].str.contains('yes'), 1, 0)
df['voice_mail_plan'] = np.where(df['voice_mail_plan'].str.contains('yes'), 1, 0)

To simplify the output, I have dropped the state column, but there is actually useful data in here. It simply makes the dataframes and plots too big to see easily. Finally, I’ve used get_dummies() to one-hot encode the area code field, which is quite low cardinality.

df = df.drop(columns=['state'])
df = pd.get_dummies(df)

df.head().T

	0	1	2	3	4
account_length	107.00	137.00	84.00	75.00	121.00
international_plan	0.00	0.00	1.00	1.00	0.00
voice_mail_plan	1.00	0.00	0.00	0.00	1.00
number_vmail_messages	26.00	0.00	0.00	0.00	24.00
total_day_minutes	161.60	243.40	299.40	166.70	218.20
total_day_calls	123.00	114.00	71.00	113.00	88.00
total_day_charge	27.47	41.38	50.90	28.34	37.09
total_eve_minutes	195.50	121.20	61.90	148.30	348.50
total_eve_calls	103.00	110.00	88.00	122.00	108.00
total_eve_charge	16.62	10.30	5.26	12.61	29.62
total_night_minutes	254.40	162.60	196.90	186.90	212.60
total_night_calls	103.00	104.00	89.00	121.00	118.00
total_night_charge	11.45	7.32	8.86	8.41	9.57
total_intl_minutes	13.70	12.20	6.60	10.10	7.50
total_intl_calls	3.00	5.00	7.00	3.00	7.00
total_intl_charge	3.70	3.29	1.78	2.73	2.03
number_customer_service_calls	1.00	0.00	2.00	3.00	3.00
churn	0.00	0.00	0.00	0.00	0.00
area_code_area_code_408	0.00	0.00	1.00	0.00	0.00
area_code_area_code_415	1.00	1.00	0.00	1.00	0.00
area_code_area_code_510	0.00	0.00	0.00	0.00	1.00

Fit Cox’s Proportional Hazards model

Next we can fit Cox’s Proportional Hazards model to our data using the CoxPHFitter(). When you fit the model, you’ll need to pass in your dataframe, the duration column (i.e. account_length, and the event column (i.e. churn).

If you run into issues with model convergence, you may need to pass in a penalizer value as a workaround. The Lifelines documentation provides guidance on resolving this.

cph = CoxPHFitter(penalizer=0.1)
cph.fit(df,
        'account_length', 
        event_col='churn')

<lifelines.CoxPHFitter: fitted with 4250 total observations, 3652 right-censored observations>

If you use print_summary() on your CoxPHFitter model object you’ll get a detailed breakdown of the model configuration used and the coefficients identied. As you can see by the high value for international_plan, this has a 0.95 coefficient and is highly correlated with attrition. It’s easier to examine these factors in a plot, so we’ll tackle that next.

cph.print_summary()

model	lifelines.CoxPHFitter
duration col	'account_length'
event col	'churn'
penalizer	0.1
l1 ratio	0
baseline estimation	breslow
number of observations	4250
number of events observed	598
partial log-likelihood	-4183.60
time fit was run	2021-01-19 16:56:16 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
international_plan	0.95	2.58	0.09	0.78	1.11	2.18	3.04	11.13	<0.005	93.12
voice_mail_plan	-0.32	0.73	0.09	-0.49	-0.14	0.61	0.87	-3.59	<0.005	11.56
number_vmail_messages	-0.01	0.99	0.00	-0.01	-0.00	0.99	1.00	-2.31	0.02	5.57
total_day_minutes	0.00	1.00	0.00	0.00	0.00	1.00	1.00	4.96	<0.005	20.46
total_day_calls	0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.13	0.90	0.15
total_day_charge	0.02	1.02	0.00	0.01	0.03	1.01	1.03	4.96	<0.005	20.46
total_eve_minutes	0.00	1.00	0.00	-0.00	0.00	1.00	1.00	1.83	0.07	3.90
total_eve_calls	0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.14	0.89	0.17
total_eve_charge	0.02	1.02	0.01	-0.00	0.03	1.00	1.03	1.83	0.07	3.90
total_night_minutes	0.00	1.00	0.00	-0.00	0.00	1.00	1.00	1.32	0.19	2.41
total_night_calls	-0.00	1.00	0.00	-0.00	0.00	1.00	1.00	-0.38	0.70	0.51
total_night_charge	0.02	1.02	0.02	-0.01	0.06	0.99	1.06	1.32	0.19	2.41
total_intl_minutes	0.02	1.02	0.01	-0.01	0.05	0.99	1.05	1.44	0.15	2.74
total_intl_calls	-0.02	0.98	0.01	-0.05	0.00	0.95	1.00	-1.64	0.10	3.31
total_intl_charge	0.07	1.08	0.05	-0.03	0.18	0.97	1.19	1.44	0.15	2.75
number_customer_service_calls	0.20	1.22	0.02	0.16	0.24	1.17	1.27	9.70	<0.005	71.51
area_code_area_code_408	-0.00	1.00	0.08	-0.16	0.16	0.85	1.17	-0.05	0.96	0.05
area_code_area_code_415	-0.02	0.98	0.07	-0.16	0.13	0.85	1.13	-0.25	0.80	0.32
area_code_area_code_510	0.03	1.03	0.08	-0.13	0.19	0.88	1.21	0.35	0.72	0.47

Concordance	0.78
Partial AIC	8405.19
log-likelihood ratio test	395.89 on 19 df
-log2(p) of ll-ratio test	237.53

Plot the model coefficients

By appending the plot() function to our cph object containing the fitted model we can visualise the coefficients.

plt.figure(figsize=(10,10))
cox = cph.plot()

png

The key issues

This is only a very basic example of what you can do (partly limited by the lack of time series data in this dataset), but it also shows you how easy it is to spot what’s going wrong, so you can investigate and fix the issues. Here’s what we found and what action needs to be taken.

International calls	Customers on the international plan are significantly more likely to churn. Obviously, there's also a link with the total international charges metric, so perhaps they're leaving because the prices are too high.
Customer service calls	The next big issue is with the number of customer of service calls placed. This is generally one of the top reasons for churn anywhere, so it's vital you try to identify what went wrong and caused these customers to call. Could customers have resolved their issue online? Were they on hold for ages? Did it take several calls to get an adequate resolution? Did staff actually apologise for the problem? Was service recovery used to try and turn things around?
Voice mail plan	Customers on the voice mail plan are least likely to churn, so there might be something the company can do here. Perhaps aiming to acquire voice mail customers, encouraging others to move to this plan, or examining why the customers here are less likely to churn than the rest would help.

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.