Market Basket Analysis, or MBA, is a subset of affinity analysis and has been used in the retail sector for many years. It provides a computational method for identifying common associations between items - usually products - from which strategy can be formed.
The most commonly application of MBA is to aid cross-selling and up-selling. For example, if you know that customers who buy nappies also buy baby wipes, you can promote the baby wipes on the same page or offer them as a bundle to try to increase their sales. Or, you could promote one product and see an uplift in the other.
Market Basket Analysis isn’t solely limited to cross-sells and up-sells. It’s also used in category management, store layout planning, and within marketing for loyalty programs, promotions, and discount plans.
It’s remarkably versatile. In fact, it can be applied to almost any kind of data where there are natural and repetitive groupings - not just in the world of retail data science. Similar techniques are also applied in web analytics, online security and intrusion detection, bioinformatics, and product manufacturing.
Here, we’re going to load up some real Google Analytics transactional data from an ecommerce business, calculate the product association rules, and look at how you could use the output to help you create a strategy to increase sales and site performance, or use them to serve product recommendations to customers.
For this project we’ll be using the GAPandas package, which lets you pull Google Analytics data into a Pandas DataFrame for analysis, then we’ll be using the Apriori algorithm provided via the mlxtend package to calculate the association rules.
Apriori is one of a number of similar association rule mining algorithms, such as Eclat and fp-growth. As association rule algorithms go, Apriori isn’t hugely complicated and you can write it from scratch with relative ease (I did this in PHP and MySQL a decade ago). However, the mlxtend package makes it loads quicker to deploy and saves you reinventing the wheel. It’s also very efficient, so doesn’t take long to run at all.
The Apriori algorithm is designed to find “frequently occurring itemsets”. An itemset is basically a group of items that occur together (such as products in a basket), while their frequency of co-occurrence depends on a user-defined “support” threshold.
The process involves two key steps. First you find all frequent itemsets that meet a minimum support threshold, then you create rules by applying a confidence constraint.
To get started, install GAPandas and mlxtend if you don’t have them already. You can do this by entering pip3 install gapandas
and pip3 install mlxtend
. Once they’re installed, load up the packages below.
import pandas as pd
from gapandas import connect, query
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
As some of the lines in our DataFrames could be quite large, I’d advise using the Pandas set_option()
function to set the max_colwidth
to a value that allows you to see the full un-truncated data, and using max_rows
so you can scroll through a good sample of your data.
pd.set_option('max_rows', 1500)
pd.set_option('max_colwidth', 1000)
Next, you’ll need to set up GAPandas to query the Google Analytics API using Python and extract some ecommerce basket data to analyse. Obtain your client_secrets.json
file from the Google API Console and set up your connection. For this demo, rather than revealing any secrets, I’ve obfuscated the data.
from gapandas import connect, query
service = connect.get_service('client_secrets.json')
Now that GAPandas is set up, you’ll need to write a simple query to pass to the Google Analytics API. We do this by defining the view ID for your Google Analytics account and creating a “payload” to send to the API. This will return the basic information we need to use in the Apriori algorithm. This will comprise the transaction ID, the product SKU, the product name, and the quantity of items.
Think carefully about the duration you use. A short duration has the benefit of ignoring products you may no longer sell, but contains less information for the algorithm. However, a longer duration provides the extra data but could include seasonal products or those which may have been discontinued.
view = '00000000'
payload = {
'start_date': '30daysAgo',
'end_date': 'yesterday',
'metrics': 'ga:itemQuantity',
'dimensions': 'ga:transactionId,ga:productSku,ga:productName'
}
df = query.run_query(service, view, payload)
If you print out the resulting Pandas DataFrame, you’ll see that we get one row per product, with items neatly ordered by their transaction ID. We don’t actually need every field, but depending on how your data are stored, the extra fields can be useful in aiding interpretation.
df.head()
transactionId | productSku | productName | itemQuantity | |
---|---|---|---|---|
0 | 123456 | RANGE1-PROD1 | RANGE1 | 1 |
1 | 123457 | RANGE3-PROD2 | RANGE3 | 1 |
2 | 123458 | RANGE19-PROD8 | RANGE19 | 12 |
3 | 123459 | RANGE1-PROD1 | RANGE1 | 100 |
4 | 123459 | RANGE2-PROD3 | RANGE2 | 5 |
The data we’ve extracted from Google Analytics using GAPandas are in the usual form you’ll see in most ecommerce systems, with one unique product SKU per line. However, the Apriori algorithm needs the data in a slightly different format, so we’ll need to re-shape the data.
For this implementation of Apriori to work, the data need to be reformatted so that each transaction is on one line, and each SKU in its own column, with a one-hot encoded representation denoting whether the SKU appeared in the basket or not.
To achieve this, we’ll first create a new DataFrame called baskets
and assign the columns we need: transactionId
, productSku
, and itemQuantity
. Then we’ll use the Pandas groupby()
function to group on both the transactionId
and productSku
columns. Next, we’ll calculate the sum()
of the Quantity
column, unstack()
the data, reset the index, fill any NaN
values and set the transactionId
to be the index.
baskets = df[['transactionId', 'productSku', 'itemQuantity']]
baskets = baskets.copy()
baskets['itemQuantity'] = baskets['itemQuantity'].astype('int')
baskets = baskets.groupby(['transactionId', 'productSku'])['itemQuantity'].sum().unstack().reset_index().fillna(0).set_index('transactionId')
At the moment, we have a sparse matrix containing mostly 0.0
values. Any positive values present denote the number of units in the basket for each SKU. However, what we need is a one-hot encoded representation of this - with a 0
representing none of the items in the basket and a 1
denoting any quantity over zero - the actual number of units doesn’t matter to the algorithm (it can also use True
or False
Boolean values). As we want to do this to every column in the DataFrame (since our Transaction ID has been set as the index), we can use the applymap()
function to handle this and apply a custom function to every cell.
def one_hot_encode(x):
if x <= 0:
return 0
if x >= 1:
return 1
baskets = baskets.applymap(one_hot_encode)
baskets.head()
productSku | PROD1 | PROD2 | PROD3 | PROD4 |
---|---|---|---|---|
transactionId | ||||
123456 | 0 | 0 | 0 | 0 |
123457 | 0 | 0 | 0 | 0 |
123458 | 0 | 0 | 1 | 0 |
123459 | 1 | 0 | 0 | 0 |
Now our DataFrame contains 0
and 1
one-hot encoded data (or Boolean True
or False
values), we can run it through the apriori()
function. First, we pass in the baskets
DataFrame containing our one-hot encoded data, then we’ll set use_colnames
to True
so we see the product name in the itemsets
rather than an indecipherable index. We’ll set verbose
to 1
so we can see how many combinations are processed and how many itemsets
are found according to the min_support
value.
Setting the right min_support
value will take some tweaking, depending on the nature of your dataset. The value needs to be a float between 0
and 1
and is calculated as the fraction of “transactions where the item occurs” / “total transactions”. So, a min_support
value of 0.5
expects to find the association in half of the orders in which the product was purchased.
That’s unlikely to happen, unless you have a very small product range. In many businesses, there’s actually not a huge amount of correlation between some products, so making accurate recommendations isn’t always easy. You’ll probably need a very low min_support
value in practice. However, do be aware that the lower the level of support you set, the more memory you’ll use and the slower the model will take to calculate the associations.
itemsets = apriori(baskets,
use_colnames=True,
verbose=1,
low_memory=False,
min_support=0.0005
)
itemsets.head(5)
support | itemsets | |
---|---|---|
0 | 0.000761 | PROD1 |
1 | 0.000507 | PROD2 |
2 | 0.000507 | PROD3 |
3 | 0.001014 | PROD4 |
4 | 0.000761 | PROD5 |
You’ll probably notice that most of the itemsets
returned above only contain one item. However, if you dig a bit deeper into the data and pass len()
to a lambda
function you can calculate how many items are present in each itemset. The single item entries are part of a product pair, while those with higher numbers of items represent groups of multiple products that frequently appear together in baskets.
itemsets['length'] = itemsets['itemsets'].apply(lambda x: len(x))
itemsets.head(4)
support | itemsets | length | |
---|---|---|---|
0 | 0.000761 | PROD1 | 1 |
1 | 0.000507 | PROD8 | 1 |
2 | 0.000507 | PROD8 | 1 |
3 | 0.001014 | PROD20 | 1 |
If you print the value_counts()
for the length
column you’ll see that we get back a number of itemsets
that contain multiple items and which contain single items.
itemsets['length'].value_counts()
1 811
2 459
3 83
4 12
5 1
Name: length, dtype: int64
Next we can use the itemsets
DataFrame containing each itemset
and its support
value to calculate the association rules and examine how strong the associations are.
rules = association_rules(itemsets, metric="lift", min_threshold=0.5)
rules.head(5).sort_values(by='confidence')
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
3 | PROD1 | PROD9 | 0.011920 | 0.006594 | 0.000507 | 0.042553 | 6.453355 | 0.000429 | 1.037557 |
2 | PROD2 | PROD93 | 0.006594 | 0.011920 | 0.000507 | 0.076923 | 6.453355 | 0.000429 | 1.070420 |
4 | PROD192 | PROD823 | 0.006594 | 0.005072 | 0.000507 | 0.076923 | 15.165385 | 0.000474 | 1.077838 |
0 | PROD99 | PROD98 | 0.000507 | 0.000507 | 0.000507 | 1.000000 | 1971.500000 | 0.000507 | inf |
1 | PROD34 | PROD35 | 0.000507 | 0.000507 | 0.000507 | 1.000000 | 1971.500000 | 0.000507 | inf |
To interpret the results from association rule analysis you’ll need to understand the technical jargon and metrics it uses, as they’re different to those you’ll have seen in packages such as sci-kit learn, and the like. One common thing you’ll see when reading about association rule learning is that the rules are expressed in shorthand like this: {baby wipes, nappies} ⇒ {nappy bags}, i.e. if someone buys baby wipes and nappies, they’ll often buy nappy bags too.
Association rule mining uses special terminology to refer to the items on either side of the rule. The antecedent represents the items in the left-hand side of our rule, i.e. {baby wipes, nappies}, while the value on the right hand-side (i.e. {nappy bags}) is called the consequent. The antecedent is therefore the X
value and the consequent is y
.
The support metric indicates how frequently the itemset occurs within the dataset. If your itemset {baby wipes, nappies} appears in 10% of transactions in the whole dataset, it will have a support of 0.1
. If you sell thousands of SKUs, you’ll probably find that your support metrics give very low outputs due to the breadth of SKUs your business sells.
The confidence value tells you how often the rule proves to be true. A value of 1
indicates that the itemset occurs 100% of the time, while 0.1
shows that it occurs 10% of the time. For example, {baby wipes, nappies} ⇒ {nappy bags} might have a confidence value of 0.71, indicating that 71% of customers who buy baby wipes and nappies also buy nappy bags at the same time.
Lift is essentially the support (or the probability of all the items in a rule occurring together) divided by the product of the probabilities of the items on either side appearing if there was no association. The higher the lift, the stronger the association between the antecedent and the consequent.
For example, let’s say that we’re looking at the {baby wipes, nappies} ⇒ {nappy bags} association rule. The support for this hypothetical association might be 2% (i.e. 2% of orders include all three items in the association). The {baby wipes, nappies} appear in 10% of transactions, and the {nappy bags} appear in 5% of transactions. Therefore, the lift can be calculated as 0.02 / (0.1 * 0.05) = 4.
A lift of over 1 is a strong sign that sales of the antecedent influence the consequent, i.e. a promotion that increases sales of {baby wipes} or {nappies} will also aid sales of {nappy bags}. Conversely, running out of one of these or setting the price too high could impact sales of the others.
Leverage is a bit like lift. Both of them measure the relation between the probability of an association rule occurring and its expected probability if the items were independent. Lift looks at the ratio of both factors, but leverage only looks at one. Lift can identify seemingly strong associations between items that don’t occur that frequently, but leverage prioritises the ones that occur most often and have more support.
Conviction tells you the ratio of the expected frequency that X occurs without Y. This is a bit hard to explain, but basically, a high conviction means that the consequent is highly dependent on the antecedent.
The data we’ve looked at here are very granular and consist of the “child” or “simple product” level SKUs, rather than the “parent” or “range” SKU. This can make things a bit too specific to be useful. For example, going back to our {baby wipes, nappies} ⇒ {nappy bags} itemset, most shops sell different brands of each and the nappies come in different sizes.
This extra granularity introduces noise and will make it harder to identify broader associations. For example, this approach would tell us that people who buy “Pampers Premium Protection Size 1, Jumbo Pack 72 Nappies, 2-5Kg” also buy “Pampers Fresh Clean Baby Wipes 15x80 = 1200 Wipes”, when it may be more useful to know that people who buy nappies also buy baby wipes.
Grouping them by their parent range SKU, or the general product type, can help make the results a bit less granular. When planning your Google Analytics implementation, I’d highly recommend using the productName field to hold the parent or range SKU, and not the product name. Unlike product names, which often get changed or optimised for SEO and may not be entered consistently, parent or range SKUs are static and usually more accurate.
If you pull these in and re-run the query, you’ll generate less granular and broader, more actionable data. If your site uses configurable or grouped product detail pages, in which one page shows a number of simple product variants, this approach also makes sense to help you analyse their performance and optimise them.
import pandas as pd
from gapandas import connect, query
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Fetch data using the Google Analytics API
service = connect.get_service('client_secrets.json')
view = '00000000'
payload = {
'start_date': '30daysAgo',
'end_date': 'yesterday',
'metrics': 'ga:itemQuantity',
'dimensions': 'ga:transactionId,ga:productSku,ga:productName'
}
df = query.run_query(service, view, payload)
# Re-shape the data and use the productName field
baskets = df[['transactionId', 'productName', 'itemQuantity']]
baskets = baskets.copy()
baskets['itemQuantity'] = baskets['itemQuantity'].astype('int')
baskets = baskets.groupby(['transactionId', 'productName'])['itemQuantity'].sum().unstack().reset_index().fillna(0).set_index('transactionId')
# Apply one-hot encoding
def one_hot_encode(x):
if x <= 0:
return 0
if x >= 1:
return 1
baskets = baskets.applymap(one_hot_encode)
# Generate frequent itemsets with Apriori
itemsets = apriori(baskets,
use_colnames=True,
verbose=1,
low_memory=False,
min_support=0.0005
)
# Calculate association rules
rules = association_rules(itemsets, metric="lift", min_threshold=0.5)
# Examine the results
rules.head(5).sort_values(by='support')
The lift metric can help you identify the products you whose sales are correlated with the sales of other items. Supposing you wanted to increase sales of high margin own-brand nappy bags, and the {baby wipes, nappies} ⇒ {nappy bags} itemset, and the {baby wipes} ⇒ {nappies} itemsets also showed a high lift of >= 1
. If you run a promotion on the baby wipes or the nappies, you could be expected to see an uplift in the sales of your nappy bags.
Some items might appear frequently within the dataset and have a positive impact upon the sales of other lines. These are often called Key Value Items or KVIs. For example, maybe customers who buy nappies are very cost conscious and shop at your store because your nappies are very reasonably priced. The data might indicate that when you sell nappies you also sell a number of other products, like baby wipes, nappy bags, baby oil, talcum powder, headache tablets, and beer. Knowing this, and ensuring your nappies always remain competitively priced and in stock can ensure that poor pricing strategy or stock outs don’t have a knock-on impact on sales of other products.
Many platforms, such as Magento, let you manually create your own static product recommendations. While this is a bit of an old school technique, it can work very well and does have some advantages in certain ecommerce markets. You can get your ecommerce staff to use their domain knowledge to manually create lists of recommended extras that go with the product, even if they’ve not been bought before, and you can serve up-sells using a Good, Better, Best or GBB technique.
This gives you maximum control and lets you introduce customers to new products, even if they’ve got no sales data behind them when first launched. The downside is that it can be very laborious, unless you build internal tools to help your team maintain these manual recommendations. Apriori can be a great way for your ecommerce merchandisers to identify the associations between products, so that they can configure the ecommerce platform to show the right ones. This can result in an increase in the number of units per order and an increase in average order value.
By examining the antecedents and consequents with support, you can also use the data to help your content writers ensure they plug the right products in their blog posts, which both improves the customer experience and aids your ecommerce conversion rate.
As the confidence metric can tell you how many times your itemset appears in the dataset, you can use it as a way of checking whether any manual product recommendations you’re serving are actually any good. If your ecommerce merchandisers have used their domain knowledge to recommend a given product from another product detail page, you can check how many times the itemset occurs in your dataset and see if there’s a hidden itemset that would make a better replacement. Again, that can improve basket sizes and increase AOV and revenue.
One thing you’ll often see on ecommerce sites are bundled deals that combine two or more commonly associated products, sometimes with a discount, but often just for convenience. For example, let’s say you’re buying a mobile phone. The association rule data might show that you’ve got associations with screen protectors, phone cases and additional charging cables. Creating a bundle which incorporates all three of these into a single product, and allows the customer to purchase using a single click, can help you increase AOV. If you’re providing a discount, you do need to check that the profit you’re making with the promotion exceeds what you would have generated anyway, however, this isn’t very difficult to do.
If you have access to customer IDs (which aren’t present in Google Analytics) you can also use this technique to examine customers instead of baskets. Here, you’d add in the Customer ID
to the order lines DataFrame and assign this to the index in the re-shaped DataFrame, rather than using the Transaction ID
. This will give you a one-hot encoded DataFrame containing all of the products purchased by each customer. From this, you can then create association rules to identify other products a customer might like based on their previous purchases, which you can use in email marketing.
If you know what each customer buys and you can identify association rules that suggest they’d also like other products, you can use this to your advantage when targeting promotions. Instead of blanket untargeted promotions, you can appeal directly to their tastes and buying behaviour by promoting the sales of either products they like now, or products other similar customers buy. When that’s combined with products which contribute to lift, a promotion on one line can impact one or more other lines.
Many customers are brand conscious and often prefer to buy items from a handful of brands. Maybe they’ll only drink Corona beer, and buy Pampers nappies for their children. If you re-run the analysis using the brand name instead of the SKU or product name, you’ll end up with a useful set of data that shows the associations between certain brands. This could reveal useful data you can use in segmentation and promotional targeting. For example, someone who only ever buys a supermarket’s cheapest own-brand lines isn’t likely to go for an ultra premium brand, but might be persuaded to switch to a mid-range one if the offer was right.
Matt Clarke, Tuesday, March 02, 2021