A quick guide to machine learning uplift models

Pictures by Daniel Frank, Unsplash.

25 minutes to read

Uplift modeling is a machine learning technique used in marketing and ecommerce to predict which customers are likely to respond to a particular marketing campaign. However, rather than simply predicting which customers are likely to purchase, an uplift model predicts which customers will purchase because they’re given a piece of marketing. It’s also known as true-lift or net lift modeling.

By constructing your own uplift model, you can predict which of your customers you should target your promotions at, even if they’ve not been targeted before. This can help you improve the customer experience, by cutting down on unwanted emails, help you increase the profitability of your marketing activity, and help avoid the risk of getting your customers (or your marketing team) hooked on coupons.

Coupons can be a bit “moreish” for both customers and marketers. Much in the same way that crack cocaine and crystal meth are moreish…

While uplift modeling is most commonly applied to promotions, such as coupons, the technique is equally applicable to other kinds of marketing campaign, such as emails, so it’s a great approach to learn if you work in the ecommerce or digital marketing field. It can also prove whether your direct marketing activity, such as catalogues, work or not.

How are coupons usually targeted in ecommerce?

In most ecommerce businesses, there’s generally little science involved in the selection of customers targeted using coupons. In some cases, companies might utilise RFM models to target their customers in a more strategic manner via customer segmentation, perhaps by trying to “reactivate” those who have lapsed, or encourage new acquisitions to place their second order, but generally it’s an un-targeted “spray and pray” approach that is used.

When more sophisticated machine learning approaches are used, they tend to be response models or propensity models (also known as likelihood-to-buy models), targeting those who will be “responsive” or more likely to purchase, rather than the smaller number who will respond because they receive a coupon promotion.

Where did propensity models originate?

Response or propensity models originated from the budget constraints that came with sending direct mail, such as catalogues and brochures, which are expensive to produce, print, and post to customers. In catalogue marketing, profit optimisation is as much about reducing wastage as it is about maximising the performance of the collateral sent.

Marketers would be assigned a campaign budget and would then allocate this to targeting the customers most likely to respond if mailed, usually using the powerful RFM model variables (Recency, Frequency, and Monetary), which remain just as valid today.

However, as the data retailers capture has increased in depth and breadth, more sophisticated modeling methods have been applied to using it to predict who marketers should target to maximise campaign profits.

Cable car ski lift.

How do response or propensity models work?

Response models or propensity models use supervised learning classification algorithms to estimate whether a customer will purchase or not purchase, based on a feature set comprising a wide range of readily available or easily engineered data. For example, how long they’ve been a customer, how many orders they’ve placed, how much they’ve spent, how many days they typically have between orders, etc.

The trained response model can then be used to predict which customers have the greatest probability of responding. Customers are ranked by their probability to purchase and then the top K customers are contacted, according to the campaign budget.

Although these propensity models can generate impressive predictions of which customers are going to purchase, they ignore the causal link between the campaign and the customer response. As a consequence, they target the customers with the greatest likelihood of purchasing, which inherently includes some who would have purchased anyway.

That said, response models are still really useful, and are great for certain applications, particularly catalogue marketing and direct mail, where you want to allocate a given number of brochures to maximise your return. The drawback is that they can waste resources, which can reduce profit.

Why is uplift modeling “better” than response modeling?

The aim of uplift modeling is not simply to predict the probability of who will purchase, but who will purchase because they’re given a promotional coupon. Rather than giving you a broader list of target customers who are likely to purchase, it gives you a tighter list of customers who probably won’t order unless you send them a promotion.

Compared to response modeling, uplift modeling can allow you to identify customers who respond because they’re targeted, customers who respond irrespective of targeting, customers who didn’t respond and where targeting had no impact, and customers who didn’t respond and where targeting may have a negative impact. It’s much more powerful.

The downside of uplift modeling, and the reason why it’s less commonly used, is that it requires additional data and uses machine learning techniques that are rather more complex than those used for regular propensity or response modeling, with some approaches even incorporating multiple simultaneous models…

How does uplift modeling work?

Where response or propensity models look at all of your historical customer data in a single data set, uplift models are trained using two separate training datasets: a “treatment” and a “control.” The customers in the treatment data set were targeted with marketing promotion, while the customers in the control were not and were “held out.”

Treatment	Customers who were sent a promotion
Control	Customers who were held out of the promotion

Instead of calculating the “class probability” (i.e. response or no response), the uplift model aims to calculate the difference between the class probabilities of the treatment and the control, which yields the causal influence.

As the name implies, the model predicts the uplift you get from running the promotion, not simply who is going to purchase when a campaign goes out. This approach allows you to identify the persuadable customers whose conversion probability will increase when they’re given a coupon.

Since an uplift model requires data from two groups - one which received a marketing action (such as a coupon, email, or catalogue) and a control group which did not, you need to be incorporating A/B tests, hold out groups, and test campaigns into your marketing activity and collecting copious amounts of data in order to use the method.

Cable car ski lift.

What data do I need to construct a model?

The actual raw data you need will likely be readily available, assuming you already run and capture data on marketing tests. Importantly, the data you use need to be random, based over the same time period, and include a test treatment and a control group.

You’re completely free to use your feature engineering skills to come up with any interesting metrics you think might help your model improve its predictions. I’d certainly include RFM variables in there, as well as any additional segmentation or engagement data you can add.

Crucially, since these data are time-linked, you need to calculate any customer metrics based on how they were before the campaign, with the result metrics (i.e. visits, conversions, revenue) based on the period after the campaign was sent.

What target variable is used?

Uplift modeling is only aiming to measure the incremental impact of a treatment on an individual. You’re free to select your own target variable. For example, you could just as easily apply it to measuring the impact a campaign has upon traffic, purchase frequency, or average order value, as you could overall campaign revenue, or the number of orders placed. It’s really up to you.

Is a test data set available?

Yes, if you’re building an uplift model and want to benchmark it against a standardised data set, Criteo (the digital remarketing company) has produced one specifically for uplift modeling. The data set was created from Criteo’s incrementality tests, whereby a random part of the population was held out from advertising. It includes 25M rows, with each one representing a user with 11 features, a treatment indicator and labels on whether they visited or converted.

Another test data set that’s featured in research papers was provided by marketing science blogger Kevin Hillstrom. His is more appropriate to typical data you’ll encounter in ecommerce as it’s based on 64,000 customers who purchased in the past 12 months, who were subject to an email marketing campaign. Customers were divided into three random sets: one third received an email campaign on men’s merchandise, one third received an email campaign on women’s merchandise, and one third didn’t get any email.

A sample of the data in the Hillstrom dataset is shown below. I’d highly recommend reading this paper by Nicholas Radcliffe which explains how uplift modeling was used to analyse this data set and create a state-of-the-art uplift model. It’s a brilliant example of using the approach to examine email marketing performance from one of the leading authorities on uplift modeling.

	recency	history_segment	history	mens	womens	newbie	channel	segment	visit	conversion	spend
7819	9	1) $0 - $100	50.84	1	0	0	Web	No E-Mail	0	0	0.00
29335	2	4) $350 - $500	488.98	1	1	0	Web	Mens E-Mail	0	0	0.00
25876	1	2) $100 - $200	127.55	0	1	0	Web	No E-Mail	0	0	0.00
19077	4	4) $350 - $500	416.46	0	1	0	Phone	Mens E-Mail	1	1	190.74
23106	1	5) $500 - $750	748.82	1	1	1	Multichannel	Womens E-Mail	1	0	0.00
3028	1	3) $200 - $350	332.70	1	1	0	Multichannel	Womens E-Mail	0	0	0.00
34341	12	2) $100 - $200	123.23	1	0	1	Web	Mens E-Mail	0	0	0.00
18554	3	4) $350 - $500	370.41	0	1	0	Web	Womens E-Mail	1	0	0.00
23729	4	1) $0 - $100	38.62	0	1	0	Web	Womens E-Mail	0	0	0.00
8948	11	3) $200 - $350	211.40	1	1	1	Web	Mens E-Mail	0	0	0.00

Uplift modeling.

How is uplift calculated?

The paper I mentioned above by Nicholas Radcliffe is worth reading for a basic and easy to understand grounding on the mathematics of uplift modeling. As a mediocre mathematician, even I was able to understand this.

For a binary outcome, such as an order or visit, uplift U can be define using the below equation, where P (A | B) shows the probability P of A (your treatment) over B (your control).

U = P (purchase | A) – P (purchase | B)

If you want to measure a continuous outcome, such as revenue, you can simply substitute the binary purchase metric for the continuous revenue metric.

U = P (revenue | A) – P (revenue | B)

If your ecommerce business has a high standard deviation in average order value, with mostly small orders and the odd massive one, using a continuous revenue metric might give you misleading results.

Which classification models are used?

Data science researchers have applied a whole range of different machine learning techniques to uplift modeling. These include tree-based algorithms, neural networks, logistic regression, k nearest neighbours, and support vector machines, as well as custom-designed algorithms that use weighting to derive “pessimistic” uplift scores. Stacking, bagging, or ensemble modeling are also popular.

One of the most effective techniques, is the two model uplift method, two model approach or double classifier approach. This involves the creation of two separate classification models from the treatment data and the control data.

Along with a more sophisticated and complex technique called the interaction term method (ITM), the two model uplift method is also one of the current state-of-the-art techniques. For an in-depth guide to the various models, check out the paper by Robin Gubela and his co-authors.

How do I apply the two model approach?

For any uplift model, the first step is always to run a pilot marketing test in which every customer is included in either the campaign or the hold-out group. A range of features (i.e. RFM and other variables) are calculated for each customer at the start of the campaign, and their response is calculated for a fixed period afterwards (i.e. one week or one month after the campaign was delivered).

Then, rather than constructing a single model on the combined data set of treatment and control, two separate models are created - one for the treatment group and one for the control.

The observant among you will spot that this is exactly the same as a standard response model, but with two steps and a bit of extra maths, and a significantly more challenging problem around the interpretation of results.

Step 1	The first model looks at the treatment or test group which received the marketing promotion. It estimates the probability of response and is the same as a conventional response or propensity model.
Step 2	The second model looks at the control or hold out group which didn't receive the marketing promotion. It estimates the probability of response among the untreated population.
Step 3	The lift score is calculated by subtracting the estimate from the first model from the estimate of the second model. Lift score (x) = (Estimate of P (R\|T, x)) - (Estimate of P (R\|C, x)

How do you measure model performance?

Normally, you might use something like R-square, classification error, or gini. However, as uplift models compare an actual outcome against a prediction, they don’t work. A person can’t simultaneously be both in the test and the control, unless your test has gone wrong. A method called the Qini curve (which is a bit like gini, hence the name) is sometimes used.

Cable car ski lift.

How do you analyse responses?

There are four possible outcomes when you’re running a test and control, which you can calculate from the raw data and place into a 2x2 matrix. In the treatment or test group, customers will either respond or not respond, and in the control, customers will either respond or not respond. These are known as: Treatment Responders (TR), Treatment Non-Responders (TN), and Control Responders (CR) and Control Non-Responders (CN).

Control Responders (CR) Not in the campaign and purchased	Control Non-Responders (CN) Not in the campaign and didn't purchase
Treatment Responders (TR) Were in the campaign and purchased	Treatment Non-Responders (TN) Were in the campaign and didn't purchase

Since the customers in the Control Responders group purchased anyway, targeting them would be a waste of resources. Control Non-Responders didn’t get a coupon, and didn’t purchase, but some of them might do if they were given one, though this pot will also include loads of lapsed customers too.

Treatment Responders did get a coupon and did respond to it. However, we don’t know if that was because they got a coupon or because they’d have ordered anyway. While Treatment Non-Responders were given a coupon but it failed to generate an order.

Which customers should I target?

Quite a few authors in the field of uplift modeling refer to customers as Sure Things, Lost Causes, Do Not Disturbs, or Persuadables to make it easier for them to refer to specific subsets of the customer population based on their likelihood to generate uplift. They’re quite useful definitions, much like RFM scores, in that they’re easy for people to understand.

Sure Things	Sure Things include Control Responders and Treatment Responders who purchase whether they're in a campaign or not. As the marketing to these seemingly makes no difference, excluding them can increase profit and allow resources to be allocated where they'll give a better return.
Lost Causes	Lost Causes are mostly your lapsed customers which are found in the Treatment Non-Responders and Control Non-Responders. They're not going to buy whether you send them a campaign or not, so excluding them will save you money and lose you little.
Do Not Disturbs	Do Not Disturbs (or Sleeping Dogs) hate marketing. They're Treatment Non-Responders or Control Non-Responders. They won't purchase if you bombard them with offers, but they might purchase if you leave them alone. Not only is marketing to them a waste of money, it could lose you these customers and reduce the size of your email database through unsubscribes. Marketing to them will increase attrition and increase costs.
Persuadables	Persuadables are the only ones you should really focus upon, as they're the only ones that provide incremental sales. They'll only purchase (or spend more, or purchase earlier) if you contact them and they react positively to marketing. They fall into the Treatment Responders and Control Non-Responders groups.

How can I identify Persuadables?

The concept of identifying the Persuadables was the subject of The Great Hack, an award-winning Netflix documentary on the Cambridge Analytica scandal.

Here, the Persuadables were people who’d vote for Trump or Brexit if shown controversial fake news posts on social media, rather than people who’d use a coupon. However, the approach used for so-called “persuasion modeling” is fairly similar to that used for uplift modeling.

The objective of the uplift model is to identify the Sure Things, Do Not Disturbs, Lost Causes, and Persuadables present in the Control Responders, Control Non-Responders, Treatment Responders, and Treatment Non-Responders pots, and then target the Persuadables. However, in practice, it’s not actually that easy to get your head around how this is achieved.

One method I’ve seen used is to predict the probability of each customer appearing in either the Control Responders, Control Non-Responders, Treatment Responders, or Treatment Non-Responders groups, using a supervised learning model that predicts multiple nominal outcomes, such as multinomial logistic regression. (See the Kane et al. paper for a better explanation of this.)

Cable car ski lift.

What’s the issue with coupon marketing in ecommerce?

Used correctly, coupon marketing can be extremely effective in ecommerce. It can aid customer acquisition, reduce customer acquisition costs, improve conversion rates, reduce advertising costs, and increase average order values.

The downside is that, if it’s handled incorrectly, it can encourage customers to become deal prone or coupon prone so they only purchase when coupons are provided, which can obviously erode your margin. With care, it’s an exceptionally useful tool to have at your disposal.

Is coupon-proneness a problem in internet retail?

Yep. Coupon-proneness was already fairly well-studied in the days before ecommerce, but several studies have shown that it can be just as prevalent online as it can be offline. Indeed, my own postgraduate research examined the use of coupons for customer acquisition in specialist ecommerce markets. Here, I studied a retailer who was using aggressive coupon discounts to acquire customers shortly after launch. This looked like it was working, but it wasn’t really…

My research project demonstrated that the customers acquired were behaving very differently to those acquired via channels, such as paid or organic search, where a deep discount wasn’t offered to persuade the customer to place their first order.

The customers acquired via deep discounts from marketing coupons were not only less likely to be retained, they also spent less when they ordered, and they were significantly more likely to only purchase again if they were given another deep discount to persuade them.

Unfortunately, by using coupon code websites such as Wowcher and Groupon, the strategy inadvertently targeted customers who were already deal prone. It set the price for the product too low, and made them reluctant to buy the product again at full price.

As the company was also paying a fee to the affiliate promoting the coupon offer, as well as offering customers a hefty discount, it took several orders to break even on each acquisition.

The commercial expectation was that the customers acquired would be like the rest, but they were so different that it made the activity unprofitable and the acquisition strategy was dropped, despite its apparent initial success as a way of rapidly gaining customers. Using voucher code websites to acquire customers is not, therefore, a strategy I would recommend…