How to create a simple product recommender system in Pandas

Learn how to create a product recommender or product recommendation system in Python using Pandas to implement the collaborative filtering algorithm.

How to create a simple product recommender system in Pandas
Picture by Markus Spiske, Unsplash.
15 minutes to read

Product recommender systems, or recommendation systems, as they’re also known are ubiquitous on e-commerce websites these days. They’re relatively simple to create and even fairly basic ones can give striking results. In this project, I’ll show you how you can knock up a simple product recommender system in Python, solely using Pandas.

Product recommendation systems can be created using a wide variety of data science techniques and can make use of user data to provide personalised recommendations, make content based recommendations using Natural Language Processing (NLP), utilise data on the user’s browsing history, adopt hybrid approaches, or simply make generalised recommendations based on anonymised user data via an approach known as collaborative filtering.

Collaborative filtering for product recommender systems

Collaborative filtering is probably the most widely used algorithm for creating product recommender systems in online retailing. This algorithm is comparatively simple to implement and generates relevant and accurate recommendations. This improves the user experience by presenting products popular with other customers, which can help boost basket sizes, average order value, sales, and revenue.

The name “collaborative filtering” comes from the two main parts of the algorithm. The “filtering” part refers to the generation of predictions, while the “collaborative” part refers to the fact that the data are sourced by the “collaboration” of many users.

Basically, in our context, collaborative filtering looks at each product and then examines the data from all customers to identify through a simple correlation score, such as Pearson’s R, to identify the other products most correlated with the target, depending on whether they were purchased together in the same user’s purchase.

In this project, we’ll utilise Pandas to create a simple collaborative filtering model to generate product recommendations for an e-commerce store. This uses user purchase data on baskets from an e-commerce site and makes recommendations by identifying correlations between different items that commonly occur together within baskets.

It’s quick and easy to build and generates decent results - often as good as those from many paid tools or website plugins or extensions. If you’re using the manually generated product recs shown in platforms such as Magento, this can be a great way to quickly generate recommendations and is great value for money, since unlike the commercial recommender systems it costs nothing to set up and you won’t have to pay any commission.

Import the packages

First, open a Jupyter notebook an import the Pandas package using import pandas as pd. If you don’t have Pandas, you can install it by entering pip3 install pandas in your terminal.

import pandas as pd

Load the data

Next, load up a product transaction items dataset. I’ve used the famous Online Retail dataset from the UCI Machine Learning Repository. Load this up into Pandas and display the first few rows using the head() function.

As the example data below show, we have a line for each product within each customer’s basket. The first five lines of the data show that customer 17850 has purchased various products in their basket on this e-commerce site. To help this business generate relevant recommendations, we’re going to look at each product purchased and identify which other items are most commonly purchased alongside it, effectively using the past behaviour of customers to predict the future.

df = pd.read_csv('online_retail.csv')
df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

We only require a small subset of the fields - the order or invoice number, the stock code or description, and the quantity of units purchased. We can filter the dataframe down to these columns using the code below.

df_baskets = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity']]
df_baskets.head()
InvoiceNo StockCode Description Quantity
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6

To get a handle on what are the most popular products, we can use groupby() and agg() to calculate some summary statistics for the products in this dataset. This shows us that the top-selling line is the WHITE HANGING HEART T-LIGHT HOLDER.

df.groupby('Description').agg(
    orders=('InvoiceNo', 'nunique'),
    quantity=('Quantity', 'sum')
).sort_values(by='orders', ascending=False).head(10)
orders quantity
Description
WHITE HANGING HEART T-LIGHT HOLDER 2302 35317
REGENCY CAKESTAND 3 TIER 2169 13033
JUMBO BAG RED RETROSPOT 2135 47363
PARTY BUNTING 1706 18022
LUNCH BAG RED RETROSPOT 1607 18779
ASSORTED COLOUR BIRD ORNAMENT 1467 36381
SET OF 3 CAKE TINS PANTRY DESIGN 1458 7336
PACK OF 72 RETROSPOT CAKE CASES 1334 36039
LUNCH BAG BLACK SKULL. 1295 12112
NATURAL SLATE HEART CHALKBOARD 1266 9120

Create an item matrix

The primary component of our recommendation engine is a matrix. This states the quantity of units of each item present in each customer’s basket. We can create this matrix easily using the pivot_table() function. We’ll set the InvoiceNo as the index column, and we’ll place each product name in a column, with the quantity of units stored in each cell. Any NaN values will be replaced by zeros.

df_items = df_baskets.pivot_table(index='InvoiceNo', columns=['Description'], values='Quantity').fillna(0)
df_items.head(3)
Description 4 PURPLE FLOCK DINNER CANDLES 50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER I LOVE LONDON MINI BACKPACK I LOVE LONDON MINI RUCKSACK NINE DRAWER OFFICE TIDY OVAL WALL MIRROR DIAMANTE RED SPOT GIFT BAG LARGE SET 2 TEA TOWELS I LOVE LONDON SPACEBOY BABY GIFT SET ... wrongly coded 20713 wrongly coded 23343 wrongly coded-23343 wrongly marked wrongly marked 23343 wrongly marked carton 22804 wrongly marked. 23343 in box wrongly sold (22719) barcode wrongly sold as sets wrongly sold sets
InvoiceNo
536365 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536366 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536367 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 4223 columns

Create product recommendations

Finally, we can create a little helper function for our recommendation system to make it quick and easy to identify which products are associated with others. First, we use the corrwith() function to identify the Pearson correlation coefficient for each product with every other. We then drop the NaN values, and place these in a dataframe sorted by descending correlation.

When we run the get_recommendations() function we will pass in our item matrix dataframe containing each product and the number of times it co-occurred in a basket, as well as the column name for our target product. The recommender function will then calculate the Pearson correlation for the item and return the most correlated products, thus generating accurate product recs for us to display to the user on the product detail page, or present in an email to the user.

def get_recommendations(df, item):
    """Generate a set of product recommendations using item-based collaborative filtering.
    
    Args:
        df (dataframe): Pandas dataframe containing matrix of items purchased.
        item (string): Column name for target item. 
        
    Returns: 
        recommendations (dataframe): Pandas dataframe containing product recommendations. 
    """
    
    recommendations = df.corrwith(df[item])
    recommendations.dropna(inplace=True)
    recommendations = pd.DataFrame(recommendations, columns=['correlation']).reset_index()
    recommendations = recommendations.sort_values(by='correlation', ascending=False)
    
    return recommendations

To run the function we pass in the dataframe containing our matrix of baskets and items, and the name of the target product. The function will return a product recommendation based on the products most commonly associated with that item.

For the “White Hanging Heart T-Light Holder”, the most correlated item is the “Gin + Tonic Diet Metal Sign”, so recommending this on the same page or in the same email might boost sales.

recommendations = get_recommendations(df_items, 'WHITE HANGING HEART T-LIGHT HOLDER')
recommendations.head()
Description correlation
3918 WHITE HANGING HEART T-LIGHT HOLDER 1.000000
1478 GIN + TONIC DIET METAL SIGN 0.824987
1241 FAIRY CAKE FLANNEL ASSORTED COLOUR 0.820905
1072 DOORMAT FAIRY CAKE 0.483524
3627 TEA TIME PARTY BUNTING 0.469207

For “Party Bunting”, we see that “Spotty Bunting” also commonly appears in the same baskets.

recommendations = get_recommendations(df_items, 'PARTY BUNTING')
recommendations.head()
Description correlation
2471 PARTY BUNTING 1.000000
3524 SPOTTY BUNTING 0.254707
3301 SET/20 FRUIT SALAD PAPER NAPKINS 0.207832
726 CHARLOTTE BAG SUKI DESIGN 0.181390
120 75 GREEN FAIRY CAKE CASES 0.176897

The cold start problem

One drawback with collaborative filtering is that it suffers from the “cold start problem”. Recommender systems - whether they’re using content based, item based, or user based filtering methods - all have one requirement in common: their underlying algorithms require a good amount of information for them to generate a relevant product recommendation.

When products are new, and few customers have purchased them, the amount of information available to recommender systems can be too low to calculate a correlation and the products may not appear within recommendations. This is known as the “cold start”. It takes a time for products to warm up and generate enough data to allow recommender systems to produce relevant results.

Some product recommendation algorithms avoid the cold start problem by using different approaches, or a mixture of algorithms or filtering methods, so they are less reliant on customer information. For example, they might use a hybrid strategy of utilising content based recommendations, such as the commonly used TF-IDF algorithm, to allow them to make a product recommendation even if that item has not yet been purchased.

Other recommendation systems

This is just one of a number of ways in which you can use collaborative filtering to generate recommendations. In this example, we’ve used sales data to generate our correlations, however, you can also calculate these correlations based on other metrics.

For example, on content based sites such as blogs, you can serve content based recommendations by calculating the similarity of the text using Natural Language Processing. On movie review sites, you can calculate correlations and generate predictions by using the scores customers give to movies they’ve reviewed. Page views and various engagement metrics can also work well.

Matt Clarke, Sunday, May 02, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Building Recommendation Engines in Python

Learn to build recommendation engines in Python using machine learning techniques.

Start course for FREE

Comments