After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t afford to buy, don’t have enough room to house, and whose purchase would lead to divorce and bankruptcy.
One of my favourite of such sites is The Market, as it includes well-written product copy that other car auction sites don’t have. However, while its inventory is small, it currently lacks a recommendation engine that serves up other cars I might like to imagine I could afford to buy.
I totally get why The Market doesn’t have recommendations. The number of cars sold is very low, there are a limited number of concurrent auctions, and most people make a single purchase, so a regular “customers who bought this also bought” model would be useless.
However, despite the lack of sales data normally required to generate product recommendations, there’s still a way that these could be added. We could generate recommendations based on content similarity instead.
For example, if you’re looking at a listing for a Ferrari 308 GTB, you might also be interested in checking out the 308 GTS. We can do this via two Natural Language Processing (NLP) techniques: Term-Frequency Inverse Document Frequency or TF-IDF, and cosine similarity.
TF-IDF is a statistic which show the importance of specific words in a document versus the other documents in collection of documents, or “corpus”. Basically, TF-IDF counts up the number of times a given phrase occurs within a document and compares it to other documents.
If a page contains the words “Ferrari 308” numerous times, and other documents in the corpus do not, then it’s probable that the document is about the “Ferrari 308”. Simply find all the documents where the scores for a phrase are high and you’ve got your matches.
Cosine similarity measures the similarity between two vectors. Since TF-IDF returns vectors showing the score a document gets versus the corpus, we can use cosine similarity to identify the closest matches after we’ve used TF-IDF to generate the vectors.
I’ll skip the complicated maths, but basically we first generate the TF-IDF vectors containing the raw numbers, and then use cosine similarity to check these across all documents. We can then sort the output and identify the closest matches based on their text similarity.
Picture by Sid Ramirez, Unsplash.
To get started, open up a Jupyter notebook and import
linear_kernel modules from scikit-learn.
import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.metrics.pairwise import linear_kernel
Next, load up your dataset. I’m using some product descriptions I scraped from The Market, but you can use product page content, blog posts, or anything else you have which is similar.
df = pd.read_csv('themarket_pages.csv') df.sample(10)
|732||https://themarket.co.uk/listings/mg/mgc/63e0fd...||1969 MG MGC GT AUTOMATIC For Sale by Auction||['This MGC is originally a Channel Islands car...||1969 MG MGC GT AUTOMATIC||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1969 MG MGC GT AUTOMATIC\nBackground\nOnly pro...|
|684||https://themarket.co.uk/listings/mercedes-benz...||2004 Mercedes-Benz SL65 AMG For Sale by Auction||['With just 25,500 miles on the odometer, this...||2004 Mercedes-Benz SL65 AMG||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||2004 Mercedes-Benz SL65 AMG\nBackground\nFollo...|
|530||https://themarket.co.uk/listings/land-rover/se...||1959 LAND ROVER SERIES II LWB For Sale by Auction||['Spending the first third of its life oversea...||1959 LAND ROVER SERIES II LWB||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1959 LAND ROVER SERIES II LWB\nBackground\nFro...|
|736||https://themarket.co.uk/listings/mg/mgf-vvc-18...||2000 MG MGF VVC 1.8 For Sale by Auction||['This delightful and honest little 1.8-litre ...||2000 MG MGF VVC 1.8||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||2000 MG MGF VVC 1.8\nBackground\nThe MG F and ...|
|854||https://themarket.co.uk/listings/peugeot/205-g...||1989 Peugeot 205 GTi 1.9 For Sale by Auction||['First registered in August 1989, the vendor ...||1989 Peugeot 205 GTi 1.9||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1989 Peugeot 205 GTi 1.9\nBackground\nLaunched...|
|44||https://themarket.co.uk/listings/alpina-bmw/d3...||2008 Alpina BMW D3 For Sale by Auction||['One of only 614 ever produced, this lovely A...||2008 Alpina BMW D3||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||2008 Alpina BMW D3\nBackground\nFollowing the ...|
|797||https://themarket.co.uk/listings/mgb/roadster/...||1963 MGB Roadster For Sale by Auction||['With just one previous keeper, a Dr Chapman ...||1963 MGB Roadster||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1963 MGB Roadster\nBackground\nIntroduced in 1...|
|149||https://themarket.co.uk/listings/bentley/flyin...||2010 BENTLEY Flying Spur Speed For Sale by Auc...||['First registered on the 5th of November 2010...||2010 BENTLEY Flying Spur Speed||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||2010 BENTLEY Flying Spur Speed\nBackground\nEs...|
|691||https://themarket.co.uk/listings/mercedes/190e...||1990 Mercedes 190E 2.0 For Sale by Auction||['This is a five-owner-from new example finish...||1990 Mercedes 190E 2.0||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1990 Mercedes 190E 2.0\nBackground\nThe W201 1...|
|1201||https://themarket.co.uk/listings/mercedes-benz...||1995 MERCEDES-BENZ SL60 AMG For Sale by Auction||['1995 MERCEDES-BENZ SL60 AMG 43k Miles - Imma...||1995 MERCEDES-BENZ SL60 AMG||<!doctype html>\n<html class="no-js" lang="en"...||['https://patina-media.s3.amazonaws.com/previe...||1995 MERCEDES-BENZ SL60 AMG\nBackground\nMuch ...|
598 1972 MERCEDES-BENZ 250CE W114\nBackground\nThe... 1026 1965 SUNBEAM ALPINE Series V\nBackground\nFoll... 265 1988 DAIMLER Double Six\nBackground\nJaguar's ... 766 1961 MGA Roadster 1600 Mk 1\nBackground\nThe M... 521 1955 LAND ROVER Series 1 Soft Top. 86 Inch\nBa... Name: text, dtype: object
Next, we’ll tidy up the data a little. There are some duplicate page titles in here, so we’ll drop these from the dataframe and return a list of the indices, so we can use them for looking up values. We’ll also fill in some
NaN values with blanks to avoid TF-IDF complaining.
indices = pd.Series(df.index, index=df['title']).drop_duplicates() content = df['text'].fillna('')
First, we’ll set up
TfidfVectorizer and tell it to use English stop words. This will remove common words like “the” and “of” to leave the more important ones. TF-IDF will additionally down-weight common words that appear across documents.
tfidf = TfidfVectorizer(stop_words='english')
Next, we’ll create a TF-IDF matrix by passing the
text column to the
fit_transform() function. That will give us the numbers from which we can calculate similarities.
tfidf_matrix = tfidf.fit_transform(content)
Now we have our matrix of TF-IDF vectors, we can use
linear_kernel() to calculate a cosine similarity matrix for the vectors. There are several ways to do this, but the below approach worked for me.
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
Now the model is built, and we have our TF-IDF matrix and a cosine similarity matrix covering all the documents, we can create a helper function to generate content recommendations. The code in this is a bit fiddly, so I’ve annotated it at each step.
Basically, it takes the dataframe of text, the name of the column being used to search from, the value to search for, the cosine similarity matrix, and the number of recommendations to return. It then looks up the title and returns the documents with the closest cosine similarity.
def get_recommendations(df, column, value, cosine_similarities, limit=10): """Return a dataframe of content recommendations based on TF-IDF cosine similarity. Args: df (object): Pandas dataframe containing the text data. column (string): Name of column used, i.e. 'title'. value (string): Name of title to get recommendations for, i.e. 1982 Ferrari 308 GTSi For Sale by Auction cosine_similarities (array): Cosine similarities matrix from linear_kernel limit (int, optional): Optional limit on number of recommendations to return. Returns: Pandas dataframe. """ # Return indices for the target dataframe column and drop any duplicates indices = pd.Series(df.index, index=df[column]).drop_duplicates() # Get the index for the target value target_index = indices[value] # Get the cosine similarity scores for the target value cosine_similarity_scores = list(enumerate(cosine_similarities[target_index])) # Sort the cosine similarities in order of closest similarity cosine_similarity_scores = sorted(cosine_similarity_scores, key=lambda x: x, reverse=True) # Return tuple of the requested closest scores excluding the target item and index cosine_similarity_scores = cosine_similarity_scores[1:limit+1] # Extract the tuple values index = (x for x in cosine_similarity_scores) scores = (x for x in cosine_similarity_scores) # Get the indices for the closest items recommendation_indices = [i for i in cosine_similarity_scores] # Get the actutal recommendations recommendations = df[column].iloc[recommendation_indices] # Return a dataframe df = pd.DataFrame(list(zip(index, recommendations, scores)), columns=['index','recommendation', 'cosine_similarity_score']) return df
Finally, we can put it in action and see how it works. First, we’ll take the title of the “1982 Ferrari 308 GTSi For Sale by Auction” auction and see what we get back. It works perfectly. The closest matches are the 308 GTB, the 308 GTS, and another 308 GTB, followed by more Ferraris.
recommendations = get_recommendations(df, 'title', '1982 Ferrari 308 GTSi For Sale by Auction', cosine_similarities)
|0||284||1976 FERRARI 308GTB VETRORESINA For Sale by Au...||0.554754|
|1||282||1985 FERRARI 308 GTS QV For Sale by Auction||0.424918|
|2||285||1977 Ferrari 308GTB For Sale by Auction||0.384198|
|3||296||1999 FERRARI F355 F1 GTS For Sale by Auction||0.335060|
|4||295||1996 FERRARI F355 GTS - Manual For Sale by Auc...||0.309254|
|5||293||2006 FERRARI 612 SCAGLIETTI For Sale by Auction||0.302505|
|6||288||1992 FERRARI 348tb For Sale by Auction||0.302221|
|7||297||1998 FERRARI F355 Spider For Sale by Auction||0.300773|
|8||281||1973 Ferrari 246GT Dino For Sale by Auction||0.298583|
|9||294||1999 FERRARI F355 F1 Berlinetta For Sale by Au...||0.294583|
The “1959 LAND ROVER SERIES II LWB For Sale by Auction” search was a bit tougher, but all the Series II Land Rovers do appear at the top, along with a Range Rover, which seems fair enough. The approach seems to work really well on this content.
recommendations = get_recommendations(df, 'title', '1959 LAND ROVER SERIES II LWB For Sale by Auction', cosine_similarities)
|0||527||1968 LAND ROVER SERIES II A Pick up For Sale b...||0.434031|
|1||521||1955 LAND ROVER Series 1 Soft Top. 86 Inch For...||0.425604|
|2||528||1958 Land Rover SERIES II SWB For Sale by Auction||0.415383|
|3||535||1967 LAND ROVER SERIES IIa 88inch For Sale by ...||0.408842|
|4||523||1968 LAND ROVER Series 2A For Sale by Auction||0.401876|
|5||529||1963 LAND ROVER SERIES II 88" For Sale by Auction||0.398268|
|6||525||1979 LAND ROVER Series 3 88 For Sale by Auction||0.392146|
|7||957||1999 RANGE ROVER P38 TReK Expedition For Sale ...||0.390698|
|8||499||1970 Land Rover 1/2 ton Lightweight V8 Series ...||0.389819|
|9||539||1969 Land Rover SWB For Sale by Auction||0.384898|
Matt Clarke, Saturday, August 14, 2021
Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.Start course for FREE