Product recommender systems, or recommendation systems, as they’re also known are ubiquitous on e-commerce websites these days. They’re relatively simple to create and even fairly basic ones can give striking results. In this project, I’ll show you how you can knock up a simple product recommender system in Python, solely using Pandas.
Product recommendation systems can be created using a wide variety of data science techniques and can make use of user data to provide personalised recommendations, make content based recommendations using Natural Language Processing (NLP), utilise data on the user’s browsing history, adopt hybrid approaches, or simply make generalised recommendations based on anonymised user data via an approach known as collaborative filtering.
Collaborative filtering is probably the most widely used algorithm for creating product recommender systems in online retailing. This algorithm is comparatively simple to implement and generates relevant and accurate recommendations. This improves the user experience by presenting products popular with other customers, which can help boost basket sizes, average order value, sales, and revenue.
The name “collaborative filtering” comes from the two main parts of the algorithm. The “filtering” part refers to the generation of predictions, while the “collaborative” part refers to the fact that the data are sourced by the “collaboration” of many users.
Basically, in our context, collaborative filtering looks at each product and then examines the data from all customers to identify through a simple correlation score, such as Pearson’s R, to identify the other products most correlated with the target, depending on whether they were purchased together in the same user’s purchase.
In this project, we’ll utilise Pandas to create a simple collaborative filtering model to generate product recommendations for an e-commerce store. This uses user purchase data on baskets from an e-commerce site and makes recommendations by identifying correlations between different items that commonly occur together within baskets.
It’s quick and easy to build and generates decent results - often as good as those from many paid tools or website plugins or extensions. If you’re using the manually generated product recs shown in platforms such as Magento, this can be a great way to quickly generate recommendations and is great value for money, since unlike the commercial recommender systems it costs nothing to set up and you won’t have to pay any commission.
First, open a Jupyter notebook an import the Pandas package using
import pandas as pd. If you don’t have Pandas, you can install it by entering
pip3 install pandas in your terminal.
import pandas as pd
Next, load up a product transaction items dataset. I’ve used the famous Online Retail dataset from the UCI Machine Learning Repository. Load this up into Pandas and display the first few rows using the
As the example data below show, we have a line for each product within each customer’s basket. The first five lines of the data show that customer 17850 has purchased various products in their basket on this e-commerce site. To help this business generate relevant recommendations, we’re going to look at each product purchased and identify which other items are most commonly purchased alongside it, effectively using the past behaviour of customers to predict the future.
df = pd.read_csv('online_retail.csv') df.head()
|0||536365||85123A||WHITE HANGING HEART T-LIGHT HOLDER||6||2010-12-01 08:26:00||2.55||17850.0||United Kingdom|
|1||536365||71053||WHITE METAL LANTERN||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom|
|2||536365||84406B||CREAM CUPID HEARTS COAT HANGER||8||2010-12-01 08:26:00||2.75||17850.0||United Kingdom|
|3||536365||84029G||KNITTED UNION FLAG HOT WATER BOTTLE||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom|
|4||536365||84029E||RED WOOLLY HOTTIE WHITE HEART.||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom|
We only require a small subset of the fields - the order or invoice number, the stock code or description, and the quantity of units purchased. We can filter the dataframe down to these columns using the code below.
df_baskets = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity']] df_baskets.head()
|0||536365||85123A||WHITE HANGING HEART T-LIGHT HOLDER||6|
|1||536365||71053||WHITE METAL LANTERN||6|
|2||536365||84406B||CREAM CUPID HEARTS COAT HANGER||8|
|3||536365||84029G||KNITTED UNION FLAG HOT WATER BOTTLE||6|
|4||536365||84029E||RED WOOLLY HOTTIE WHITE HEART.||6|
To get a handle on what are the most popular products, we can use
agg() to calculate some summary statistics for the products in this dataset. This shows us that the top-selling line is the
WHITE HANGING HEART T-LIGHT HOLDER.
df.groupby('Description').agg( orders=('InvoiceNo', 'nunique'), quantity=('Quantity', 'sum') ).sort_values(by='orders', ascending=False).head(10)
|WHITE HANGING HEART T-LIGHT HOLDER||2302||35317|
|REGENCY CAKESTAND 3 TIER||2169||13033|
|JUMBO BAG RED RETROSPOT||2135||47363|
|LUNCH BAG RED RETROSPOT||1607||18779|
|ASSORTED COLOUR BIRD ORNAMENT||1467||36381|
|SET OF 3 CAKE TINS PANTRY DESIGN||1458||7336|
|PACK OF 72 RETROSPOT CAKE CASES||1334||36039|
|LUNCH BAG BLACK SKULL.||1295||12112|
|NATURAL SLATE HEART CHALKBOARD||1266||9120|
The primary component of our recommendation engine is a matrix. This states the quantity of units of each item present in each customer’s basket. We can create this matrix easily using the
pivot_table() function. We’ll set the
InvoiceNo as the index column, and we’ll place each product name in a column, with the quantity of units stored in each cell. Any
NaN values will be replaced by zeros.
df_items = df_baskets.pivot_table(index='InvoiceNo', columns=['Description'], values='Quantity').fillna(0) df_items.head(3)
|Description||4 PURPLE FLOCK DINNER CANDLES||50'S CHRISTMAS GIFT BAG LARGE||DOLLY GIRL BEAKER||I LOVE LONDON MINI BACKPACK||I LOVE LONDON MINI RUCKSACK||NINE DRAWER OFFICE TIDY||OVAL WALL MIRROR DIAMANTE||RED SPOT GIFT BAG LARGE||SET 2 TEA TOWELS I LOVE LONDON||SPACEBOY BABY GIFT SET||...||wrongly coded 20713||wrongly coded 23343||wrongly coded-23343||wrongly marked||wrongly marked 23343||wrongly marked carton 22804||wrongly marked. 23343 in box||wrongly sold (22719) barcode||wrongly sold as sets||wrongly sold sets|
3 rows × 4223 columns
Finally, we can create a little helper function for our recommendation system to make it quick and easy to identify which products are associated with others. First, we use the
corrwith() function to identify the Pearson correlation coefficient for each product with every other. We then drop the
NaN values, and place these in a dataframe sorted by descending correlation.
When we run the
get_recommendations() function we will pass in our item matrix dataframe containing each product and the number of times it co-occurred in a basket, as well as the column name for our target product. The recommender function will then calculate the Pearson correlation for the item and return the most correlated products, thus generating accurate product recs for us to display to the user on the product detail page, or present in an email to the user.
def get_recommendations(df, item): """Generate a set of product recommendations using item-based collaborative filtering. Args: df (dataframe): Pandas dataframe containing matrix of items purchased. item (string): Column name for target item. Returns: recommendations (dataframe): Pandas dataframe containing product recommendations. """ recommendations = df.corrwith(df[item]) recommendations.dropna(inplace=True) recommendations = pd.DataFrame(recommendations, columns=['correlation']).reset_index() recommendations = recommendations.sort_values(by='correlation', ascending=False) return recommendations
To run the function we pass in the dataframe containing our matrix of baskets and items, and the name of the target product. The function will return a product recommendation based on the products most commonly associated with that item.
For the “White Hanging Heart T-Light Holder”, the most correlated item is the “Gin + Tonic Diet Metal Sign”, so recommending this on the same page or in the same email might boost sales.
recommendations = get_recommendations(df_items, 'WHITE HANGING HEART T-LIGHT HOLDER') recommendations.head()
|3918||WHITE HANGING HEART T-LIGHT HOLDER||1.000000|
|1478||GIN + TONIC DIET METAL SIGN||0.824987|
|1241||FAIRY CAKE FLANNEL ASSORTED COLOUR||0.820905|
|1072||DOORMAT FAIRY CAKE||0.483524|
|3627||TEA TIME PARTY BUNTING||0.469207|
For “Party Bunting”, we see that “Spotty Bunting” also commonly appears in the same baskets.
recommendations = get_recommendations(df_items, 'PARTY BUNTING') recommendations.head()
|3301||SET/20 FRUIT SALAD PAPER NAPKINS||0.207832|
|726||CHARLOTTE BAG SUKI DESIGN||0.181390|
|120||75 GREEN FAIRY CAKE CASES||0.176897|
One drawback with collaborative filtering is that it suffers from the “cold start problem”. Recommender systems - whether they’re using content based, item based, or user based filtering methods - all have one requirement in common: their underlying algorithms require a good amount of information for them to generate a relevant product recommendation.
When products are new, and few customers have purchased them, the amount of information available to recommender systems can be too low to calculate a correlation and the products may not appear within recommendations. This is known as the “cold start”. It takes a time for products to warm up and generate enough data to allow recommender systems to produce relevant results.
Some product recommendation algorithms avoid the cold start problem by using different approaches, or a mixture of algorithms or filtering methods, so they are less reliant on customer information. For example, they might use a hybrid strategy of utilising content based recommendations, such as the commonly used TF-IDF algorithm, to allow them to make a product recommendation even if that item has not yet been purchased.
This is just one of a number of ways in which you can use collaborative filtering to generate recommendations. In this example, we’ve used sales data to generate our correlations, however, you can also calculate these correlations based on other metrics.
For example, on content based sites such as blogs, you can serve content based recommendations by calculating the similarity of the text using Natural Language Processing. On movie review sites, you can calculate correlations and generate predictions by using the scores customers give to movies they’ve reviewed. Page views and various engagement metrics can also work well.
Matt Clarke, Sunday, May 02, 2021