How to create a simple product recommender system in Pandas

Picture by Markus Spiske, Unsplash.

15 minutes to read

Product recommender systems, or recommendation systems, as they’re also known are ubiquitous on e-commerce websites these days. They’re relatively simple to create and even fairly basic ones can give striking results. In this project, I’ll show you how you can knock up a simple product recommender system in Python, solely using Pandas.

Product recommendation systems can be created using a wide variety of data science techniques and can make use of user data to provide personalised recommendations, make content based recommendations using Natural Language Processing (NLP), utilise data on the user’s browsing history, adopt hybrid approaches, or simply make generalised recommendations based on anonymised user data via an approach known as collaborative filtering.

Collaborative filtering for product recommender systems

Collaborative filtering is probably the most widely used algorithm for creating product recommender systems in online retailing. This algorithm is comparatively simple to implement and generates relevant and accurate recommendations. This improves the user experience by presenting products popular with other customers, which can help boost basket sizes, average order value, sales, and revenue.

The name “collaborative filtering” comes from the two main parts of the algorithm. The “filtering” part refers to the generation of predictions, while the “collaborative” part refers to the fact that the data are sourced by the “collaboration” of many users.

Basically, in our context, collaborative filtering looks at each product and then examines the data from all customers to identify through a simple correlation score, such as Pearson’s R, to identify the other products most correlated with the target, depending on whether they were purchased together in the same user’s purchase.

In this project, we’ll utilise Pandas to create a simple collaborative filtering model to generate product recommendations for an e-commerce store. This uses user purchase data on baskets from an e-commerce site and makes recommendations by identifying correlations between different items that commonly occur together within baskets.

It’s quick and easy to build and generates decent results - often as good as those from many paid tools or website plugins or extensions. If you’re using the manually generated product recs shown in platforms such as Magento, this can be a great way to quickly generate recommendations and is great value for money, since unlike the commercial recommender systems it costs nothing to set up and you won’t have to pay any commission.

Import the packages

First, open a Jupyter notebook an import the Pandas package using import pandas as pd. If you don’t have Pandas, you can install it by entering pip3 install pandas in your terminal.

import pandas as pd

Load the data

Next, load up a product transaction items dataset. I’ve used the famous Online Retail dataset from the UCI Machine Learning Repository. Load this up into Pandas and display the first few rows using the head() function.

As the example data below show, we have a line for each product within each customer’s basket. The first five lines of the data show that customer 17850 has purchased various products in their basket on this e-commerce site. To help this business generate relevant recommendations, we’re going to look at each product purchased and identify which other items are most commonly purchased alongside it, effectively using the past behaviour of customers to predict the future.

df = pd.read_csv('online_retail.csv')
df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

We only require a small subset of the fields - the order or invoice number, the stock code or description, and the quantity of units purchased. We can filter the dataframe down to these columns using the code below.

df_baskets = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity']]
df_baskets.head()

	InvoiceNo	StockCode	Description	Quantity
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6
1	536365	71053	WHITE METAL LANTERN	6
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6

Examine the most popular products

To get a handle on what are the most popular products, we can use groupby() and agg() to calculate some summary statistics for the products in this dataset. This shows us that the top-selling line is the WHITE HANGING HEART T-LIGHT HOLDER.

df.groupby('Description').agg(
    orders=('InvoiceNo', 'nunique'),
    quantity=('Quantity', 'sum')
).sort_values(by='orders', ascending=False).head(10)

	orders	quantity
Description
WHITE HANGING HEART T-LIGHT HOLDER	2302	35317
REGENCY CAKESTAND 3 TIER	2169	13033
JUMBO BAG RED RETROSPOT	2135	47363
PARTY BUNTING	1706	18022
LUNCH BAG RED RETROSPOT	1607	18779
ASSORTED COLOUR BIRD ORNAMENT	1467	36381
SET OF 3 CAKE TINS PANTRY DESIGN	1458	7336
PACK OF 72 RETROSPOT CAKE CASES	1334	36039
LUNCH BAG BLACK SKULL.	1295	12112
NATURAL SLATE HEART CHALKBOARD	1266	9120

Create an item matrix

The primary component of our recommendation engine is a matrix. This states the quantity of units of each item present in each customer’s basket. We can create this matrix easily using the pivot_table() function. We’ll set the InvoiceNo as the index column, and we’ll place each product name in a column, with the quantity of units stored in each cell. Any NaN values will be replaced by zeros.

df_items = df_baskets.pivot_table(index='InvoiceNo', columns=['Description'], values='Quantity').fillna(0)
df_items.head(3)

Description	4 PURPLE FLOCK DINNER CANDLES	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	I LOVE LONDON MINI RUCKSACK	NINE DRAWER OFFICE TIDY	OVAL WALL MIRROR DIAMANTE	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	...	wrongly coded 20713	wrongly coded 23343	wrongly coded-23343	wrongly marked	wrongly marked 23343	wrongly marked carton 22804	wrongly marked. 23343 in box	wrongly sold (22719) barcode	wrongly sold as sets	wrongly sold sets
InvoiceNo
536365	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536366	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536367	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3 rows × 4223 columns

Create product recommendations

Finally, we can create a little helper function for our recommendation system to make it quick and easy to identify which products are associated with others. First, we use the corrwith() function to identify the Pearson correlation coefficient for each product with every other. We then drop the NaN values, and place these in a dataframe sorted by descending correlation.

When we run the get_recommendations() function we will pass in our item matrix dataframe containing each product and the number of times it co-occurred in a basket, as well as the column name for our target product. The recommender function will then calculate the Pearson correlation for the item and return the most correlated products, thus generating accurate product recs for us to display to the user on the product detail page, or present in an email to the user.

def get_recommendations(df, item):
    """Generate a set of product recommendations using item-based collaborative filtering.
    
    Args:
        df (dataframe): Pandas dataframe containing matrix of items purchased.
        item (string): Column name for target item. 
        
    Returns: 
        recommendations (dataframe): Pandas dataframe containing product recommendations. 
    """
    
    recommendations = df.corrwith(df[item])
    recommendations.dropna(inplace=True)
    recommendations = pd.DataFrame(recommendations, columns=['correlation']).reset_index()
    recommendations = recommendations.sort_values(by='correlation', ascending=False)
    
    return recommendations

To run the function we pass in the dataframe containing our matrix of baskets and items, and the name of the target product. The function will return a product recommendation based on the products most commonly associated with that item.

For the “White Hanging Heart T-Light Holder”, the most correlated item is the “Gin + Tonic Diet Metal Sign”, so recommending this on the same page or in the same email might boost sales.

recommendations = get_recommendations(df_items, 'WHITE HANGING HEART T-LIGHT HOLDER')
recommendations.head()

	Description	correlation
3918	WHITE HANGING HEART T-LIGHT HOLDER	1.000000
1478	GIN + TONIC DIET METAL SIGN	0.824987
1241	FAIRY CAKE FLANNEL ASSORTED COLOUR	0.820905
1072	DOORMAT FAIRY CAKE	0.483524
3627	TEA TIME PARTY BUNTING	0.469207

For “Party Bunting”, we see that “Spotty Bunting” also commonly appears in the same baskets.

recommendations = get_recommendations(df_items, 'PARTY BUNTING')
recommendations.head()

	Description	correlation
2471	PARTY BUNTING	1.000000
3524	SPOTTY BUNTING	0.254707
3301	SET/20 FRUIT SALAD PAPER NAPKINS	0.207832
726	CHARLOTTE BAG SUKI DESIGN	0.181390
120	75 GREEN FAIRY CAKE CASES	0.176897

The cold start problem

One drawback with collaborative filtering is that it suffers from the “cold start problem”. Recommender systems - whether they’re using content based, item based, or user based filtering methods - all have one requirement in common: their underlying algorithms require a good amount of information for them to generate a relevant product recommendation.

When products are new, and few customers have purchased them, the amount of information available to recommender systems can be too low to calculate a correlation and the products may not appear within recommendations. This is known as the “cold start”. It takes a time for products to warm up and generate enough data to allow recommender systems to produce relevant results.

Some product recommendation algorithms avoid the cold start problem by using different approaches, or a mixture of algorithms or filtering methods, so they are less reliant on customer information. For example, they might use a hybrid strategy of utilising content based recommendations, such as the commonly used TF-IDF algorithm, to allow them to make a product recommendation even if that item has not yet been purchased.

Other recommendation systems

This is just one of a number of ways in which you can use collaborative filtering to generate recommendations. In this example, we’ve used sales data to generate our correlations, however, you can also calculate these correlations based on other metrics.

For example, on content based sites such as blogs, you can serve content based recommendations by calculating the similarity of the text using Natural Language Processing. On movie review sites, you can calculate correlations and generate predictions by using the scores customers give to movies they’ve reviewed. Page views and various engagement metrics can also work well.

Matt Clarke, Sunday, May 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.