How to create content recommendations using TF IDF

Learn how to use the Term-Frequency Inverse Document Frequency (TF IDF) and cosine similarity to generate content recommendations for a classic car site.

How to create content recommendations using TF IDF
Picture by Jason Leung, Unsplash.
15 minutes to read

After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t afford to buy, don’t have enough room to house, and whose purchase would lead to divorce and bankruptcy.

One of my favourite of such sites is The Market, as it includes well-written product copy that other car auction sites don’t have. However, while its inventory is small, it currently lacks a recommendation engine that serves up other cars I might like to imagine I could afford to buy.

I totally get why The Market doesn’t have recommendations. The number of cars sold is very low, there are a limited number of concurrent auctions, and most people make a single purchase, so a regular “customers who bought this also bought” model would be useless.

Content-based recommendations

However, despite the lack of sales data normally required to generate product recommendations, there’s still a way that these could be added. We could generate recommendations based on content similarity instead.

For example, if you’re looking at a listing for a Ferrari 308 GTB, you might also be interested in checking out the 308 GTS. We can do this via two Natural Language Processing (NLP) techniques: Term-Frequency Inverse Document Frequency or TF-IDF, and cosine similarity.

Term Frequency Inverse Document Frequency (TF-IDF)

TF-IDF is a statistic which show the importance of specific words in a document versus the other documents in collection of documents, or “corpus”. Basically, TF-IDF counts up the number of times a given phrase occurs within a document and compares it to other documents.

If a page contains the words “Ferrari 308” numerous times, and other documents in the corpus do not, then it’s probable that the document is about the “Ferrari 308”. Simply find all the documents where the scores for a phrase are high and you’ve got your matches.

Cosine similarity

Cosine similarity measures the similarity between two vectors. Since TF-IDF returns vectors showing the score a document gets versus the corpus, we can use cosine similarity to identify the closest matches after we’ve used TF-IDF to generate the vectors.

I’ll skip the complicated maths, but basically we first generate the TF-IDF vectors containing the raw numbers, and then use cosine similarity to check these across all documents. We can then sort the output and identify the closest matches based on their text similarity.

Ferrari Picture by Sid Ramirez, Unsplash.

Import the packages

To get started, open up a Jupyter notebook and import pandas, numpy, the TfidfVectorizer, cosine_similarity and linear_kernel modules from scikit-learn.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

Load the data

Next, load up your dataset. I’m using some product descriptions I scraped from The Market, but you can use product page content, blog posts, or anything else you have which is similar.

df = pd.read_csv('themarket_pages.csv')
url title description h1 html image text
732 1969 MG MGC GT AUTOMATIC For Sale by Auction ['This MGC is originally a Channel Islands car... 1969 MG MGC GT AUTOMATIC <!doctype html>\n<html class="no-js" lang="en"... [' 1969 MG MGC GT AUTOMATIC\nBackground\nOnly pro...
684 2004 Mercedes-Benz SL65 AMG For Sale by Auction ['With just 25,500 miles on the odometer, this... 2004 Mercedes-Benz SL65 AMG <!doctype html>\n<html class="no-js" lang="en"... [' 2004 Mercedes-Benz SL65 AMG\nBackground\nFollo...
530 1959 LAND ROVER SERIES II LWB For Sale by Auction ['Spending the first third of its life oversea... 1959 LAND ROVER SERIES II LWB <!doctype html>\n<html class="no-js" lang="en"... [' 1959 LAND ROVER SERIES II LWB\nBackground\nFro...
736 2000 MG MGF VVC 1.8 For Sale by Auction ['This delightful and honest little 1.8-litre ... 2000 MG MGF VVC 1.8 <!doctype html>\n<html class="no-js" lang="en"... [' 2000 MG MGF VVC 1.8\nBackground\nThe MG F and ...
854 1989 Peugeot 205 GTi 1.9 For Sale by Auction ['First registered in August 1989, the vendor ... 1989 Peugeot 205 GTi 1.9 <!doctype html>\n<html class="no-js" lang="en"... [' 1989 Peugeot 205 GTi 1.9\nBackground\nLaunched...
44 2008 Alpina BMW D3 For Sale by Auction ['One of only 614 ever produced, this lovely A... 2008 Alpina BMW D3 <!doctype html>\n<html class="no-js" lang="en"... [' 2008 Alpina BMW D3\nBackground\nFollowing the ...
797 1963 MGB Roadster For Sale by Auction ['With just one previous keeper, a Dr Chapman ... 1963 MGB Roadster <!doctype html>\n<html class="no-js" lang="en"... [' 1963 MGB Roadster\nBackground\nIntroduced in 1...
149 2010 BENTLEY Flying Spur Speed For Sale by Auc... ['First registered on the 5th of November 2010... 2010 BENTLEY Flying Spur Speed <!doctype html>\n<html class="no-js" lang="en"... [' 2010 BENTLEY Flying Spur Speed\nBackground\nEs...
691 1990 Mercedes 190E 2.0 For Sale by Auction ['This is a five-owner-from new example finish... 1990 Mercedes 190E 2.0 <!doctype html>\n<html class="no-js" lang="en"... [' 1990 Mercedes 190E 2.0\nBackground\nThe W201 1...
1201 1995 MERCEDES-BENZ SL60 AMG For Sale by Auction ['1995 MERCEDES-BENZ SL60 AMG 43k Miles - Imma... 1995 MERCEDES-BENZ SL60 AMG <!doctype html>\n<html class="no-js" lang="en"... [' 1995 MERCEDES-BENZ SL60 AMG\nBackground\nMuch ...
598     1972 MERCEDES-BENZ 250CE W114\nBackground\nThe...
1026    1965 SUNBEAM ALPINE Series V\nBackground\nFoll...
265     1988 DAIMLER Double Six\nBackground\nJaguar's ...
766     1961 MGA Roadster 1600 Mk 1\nBackground\nThe M...
521     1955 LAND ROVER Series 1 Soft Top. 86 Inch\nBa...
Name: text, dtype: object

Prepare the data

Next, we’ll tidy up the data a little. There are some duplicate page titles in here, so we’ll drop these from the dataframe and return a list of the indices, so we can use them for looking up values. We’ll also fill in some NaN values with blanks to avoid TF-IDF complaining.

indices = pd.Series(df.index, index=df['title']).drop_duplicates()
content = df['text'].fillna('')

Create TF-IDF model

First, we’ll set up TfidfVectorizer and tell it to use English stop words. This will remove common words like “the” and “of” to leave the more important ones. TF-IDF will additionally down-weight common words that appear across documents.

tfidf = TfidfVectorizer(stop_words='english')

Next, we’ll create a TF-IDF matrix by passing the text column to the fit_transform() function. That will give us the numbers from which we can calculate similarities.

tfidf_matrix = tfidf.fit_transform(content)

Now we have our matrix of TF-IDF vectors, we can use linear_kernel() to calculate a cosine similarity matrix for the vectors. There are several ways to do this, but the below approach worked for me.

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Get recommendations based on text similarity

Now the model is built, and we have our TF-IDF matrix and a cosine similarity matrix covering all the documents, we can create a helper function to generate content recommendations. The code in this is a bit fiddly, so I’ve annotated it at each step.

Basically, it takes the dataframe of text, the name of the column being used to search from, the value to search for, the cosine similarity matrix, and the number of recommendations to return. It then looks up the title and returns the documents with the closest cosine similarity.

def get_recommendations(df, column, value, cosine_similarities, limit=10):
    """Return a dataframe of content recommendations based on TF-IDF cosine similarity.
        df (object): Pandas dataframe containing the text data. 
        column (string): Name of column used, i.e. 'title'. 
        value (string): Name of title to get recommendations for, i.e. 1982 Ferrari 308 GTSi For Sale by Auction
        cosine_similarities (array): Cosine similarities matrix from linear_kernel
        limit (int, optional): Optional limit on number of recommendations to return. 
        Pandas dataframe. 
    # Return indices for the target dataframe column and drop any duplicates
    indices = pd.Series(df.index, index=df[column]).drop_duplicates()

    # Get the index for the target value
    target_index = indices[value]

    # Get the cosine similarity scores for the target value
    cosine_similarity_scores = list(enumerate(cosine_similarities[target_index]))

    # Sort the cosine similarities in order of closest similarity
    cosine_similarity_scores = sorted(cosine_similarity_scores, key=lambda x: x[1], reverse=True)

    # Return tuple of the requested closest scores excluding the target item and index
    cosine_similarity_scores = cosine_similarity_scores[1:limit+1]

    # Extract the tuple values
    index = (x[0] for x in cosine_similarity_scores)
    scores = (x[1] for x in cosine_similarity_scores)    

    # Get the indices for the closest items
    recommendation_indices = [i[0] for i in cosine_similarity_scores]

    # Get the actutal recommendations
    recommendations = df[column].iloc[recommendation_indices]

    # Return a dataframe
    df = pd.DataFrame(list(zip(index, recommendations, scores)), 
                      columns=['index','recommendation', 'cosine_similarity_score']) 

    return df

Generate the recommendations

Finally, we can put it in action and see how it works. First, we’ll take the title of the “1982 Ferrari 308 GTSi For Sale by Auction” auction and see what we get back. It works perfectly. The closest matches are the 308 GTB, the 308 GTS, and another 308 GTB, followed by more Ferraris.

recommendations = get_recommendations(df, 
                                      '1982 Ferrari 308 GTSi For Sale by Auction', 
index recommendation cosine_similarity_score
0 284 1976 FERRARI 308GTB VETRORESINA For Sale by Au... 0.554754
1 282 1985 FERRARI 308 GTS QV For Sale by Auction 0.424918
2 285 1977 Ferrari 308GTB For Sale by Auction 0.384198
3 296 1999 FERRARI F355 F1 GTS For Sale by Auction 0.335060
4 295 1996 FERRARI F355 GTS - Manual For Sale by Auc... 0.309254
5 293 2006 FERRARI 612 SCAGLIETTI For Sale by Auction 0.302505
6 288 1992 FERRARI 348tb For Sale by Auction 0.302221
7 297 1998 FERRARI F355 Spider For Sale by Auction 0.300773
8 281 1973 Ferrari 246GT Dino For Sale by Auction 0.298583
9 294 1999 FERRARI F355 F1 Berlinetta For Sale by Au... 0.294583

The “1959 LAND ROVER SERIES II LWB For Sale by Auction” search was a bit tougher, but all the Series II Land Rovers do appear at the top, along with a Range Rover, which seems fair enough. The approach seems to work really well on this content.

recommendations = get_recommendations(df, 
                                      '1959 LAND ROVER SERIES II LWB For Sale by Auction', 
index recommendation cosine_similarity_score
0 527 1968 LAND ROVER SERIES II A Pick up For Sale b... 0.434031
1 521 1955 LAND ROVER Series 1 Soft Top. 86 Inch For... 0.425604
2 528 1958 Land Rover SERIES II SWB For Sale by Auction 0.415383
3 535 1967 LAND ROVER SERIES IIa 88inch For Sale by ... 0.408842
4 523 1968 LAND ROVER Series 2A For Sale by Auction 0.401876
5 529 1963 LAND ROVER SERIES II 88" For Sale by Auction 0.398268
6 525 1979 LAND ROVER Series 3 88 For Sale by Auction 0.392146
7 957 1999 RANGE ROVER P38 TReK Expedition For Sale ... 0.390698
8 499 1970 Land Rover 1/2 ton Lightweight V8 Series ... 0.389819
9 539 1969 Land Rover SWB For Sale by Auction 0.384898

Matt Clarke, Saturday, August 14, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Introduction to Natural Language Processing in Python

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

Start course for FREE