How to create a collaborative filtering recommender system

22 minutes to read

Recommender systems, or recommendation engines as they’re also known, are everywhere these days. Whether you’re looking for books on Amazon, tracks on Spotify, movies on Netflix or a date on Tinder/Grinder, you’ll be served up the recommendations using this method.

As with everything in data science, there are many different ways you can generate recommendations, but probably the most widely used method is collaborative filtering. It can be done in any language, including SQL, which makes it straightforward to implement.

This fairly simple system was popularised by Amazon and has become widespread on content and ecommerce sites across the internet. Collaborative filtering actually comes in two different forms - item-based collaborative filtering and user-based collaborative filtering. They work in the same way, but their source data differs, allowing them to serve a slightly different style of date, movie, or product recommendation.

Some commonly seen examples of this are:

Customers who bought this also bought…
Customers who liked this also liked…
Customers who read this also read…
Customers who engaged with this also engaged with…
Customers who watched this also watched…
Customers who swiped right on him/her also swiped right on…

Many movie recommendation engines use collaborative filtering

User-based collaborative filtering

In user-based collaborative filtering you build up a matrix of every user and record all of their interactions, like the tracks they’ve listened to, the movies they’ve watched, the scores they’ve given or the articles they’ve read. This gives you one row per user with an item (like a track or movie) per column with the metric inside.

Using this matrix, we can compute the similarity between users based on what they liked, watched or rated, and identify users who have similar tastes. If a user has similar tastes to another and has read, watched or rated some of the same items, you can recommend them other things they might like.

User-based collaborative filtering works, but it does have some issues. Firstly, people’s tastes change over time, so someone who liked Justin Bieber when they were 13 may now have moved on in life. Secondly, as you need to create one row per user, you end up with a truly enormous matrix on larger sites making this approach harder to scale. For example, if you’ve got 200K users on your platform and only 50K active, you’ll still need a row for each one.

The other issue with user-based collaborative filtering is that it can be rigged by users. Create a bot to generate fake user accounts and then give your item a glowing rating using a shilling attack and this method could be gamed.

Item-based collaborative filtering

Item-based collaborative filtering is much the same as user-based collaborative filtering, but instead of looking at relationships between users it looks at relationships between the items, so it’s less sensitive to shilling attacks (particularly as it’s often based on items purchased, making it expensive to rig.) Amazon has used this technique for years and it works well. As you only need one row per item, instead of per user, it’s also much more scalable, and it’s less sensitive to time constraints.

You can perform item-based collaborative filtering in various ways, depending on your source data and the recommendation you wish to make. We’ll be using item-based collaborative filtering to create a movie recommender system so we’ll be making a matrix of every film, identifying which movies are similar to other movies and then making recommendations to users.

Cinema

1. Load the data

We are going to make our movie recommendations using the item-based collaborative filtering technique on the excellent MovieLens dataset, which is available in various sizes. The “ml-latest-small” dataset is a subset of the full data based on 100,000 ratings by 600 users across 9000 different movies, and comes in at around 1MB, so it’s ideal for practicing.

Download a copy of the zip file and place the files into a folder called data, then load up Pandas and read the CSV files into dataframes. We don’t actually need all the files, so just load up the movies.csv and ratings.csv files.

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 1000)

df_movies = pd.read_csv('data/movies.csv')
df_ratings = pd.read_csv('data/ratings.csv')

Next, we’ll view the head() of each dataframe to check out the content within. As you can see from below, df_movies contains the movieId, title and a list of genres. The df_ratings dataframe contains the userId, the movieId, their rating and a timestamp.

df_movies.sample(3)

	movieId	title	genres
5687	27741	Twilight Samurai, The (Tasogare Seibei) (2002)	Drama\|Romance
4684	6994	Hard Way, The (1991)	Action\|Comedy
6707	58425	Heima (2007)	Documentary

df_ratings.sample(3)

	userId	movieId	rating	timestamp
16085	104	4308	5.0	1048586839
27702	187	6870	3.0	1161849787
8481	57	3552	4.0	965798049

2. Re-shape the data and create a matrix

As the first step we need to merge the movies dataframe to the ratings dataframe. This gives us a massive dataframe containing one row per movie review, along with the userId and rating.

df_movies_ratings = df_movies.merge(df_ratings, on='movieId', how='left')
df_movies_ratings.head(3)

	movieId	title	genres	userId	rating	timestamp
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1.0	4.0	9.649827e+08
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	5.0	4.0	8.474350e+08
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	7.0	4.5	1.106636e+09

Next, we need to take df_movie_ratings and reshape the dataframe using Pandas so that each row in the dataframe represents a unique user and each column represents a review and its rating. This changes our dataframe from long format data to wide format data. With one column created for every movie in the dataset, this gives us loads of columns (9724 in fact) and most of them will contain NaN (not a number) values because no person is ever going to review every movie in the dataset.

df_user_ratings = df_movies_ratings.pivot_table(index='userId', columns=['title'], values='rating')
df_user_ratings.head(3)

title	'71 (2014)	'Hellboy': The Seeds of Creation (2004)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	...	eXistenZ (1999)	xXx (2002)	xXx: State of the Union (2005)	¡Three Amigos! (1986)	À nous la liberté (Freedom for Us) (1931)
userId
1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	4.0	NaN
2.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN
3.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN

3 rows × 9719 columns

3: Identify users who rated a given film

Next we’re first going to fetch all the ratings by userId on “Star Wars: Episode IV - A New Hope (1977)” from the df_user_ratings dataframe. Although I’ve never seen this film, I’m fairly sure that people who watch it will also have watched and like other movies in the franchise, so the results will be easily to interpret. This gives us a series contain the userId and rating for every person in the dataset, if they’ve rated the film or not. The users who did not rate the film will have a NaN value in the rating column.

rated_movie = df_user_ratings['Star Wars: Episode IV - A New Hope (1977)']
rated_movie.head(3)

userId
1.0    5.0
2.0    NaN
3.0    NaN
Name: Star Wars: Episode IV - A New Hope (1977), dtype: float64

4: Identify correlations

Next, we’re going to find the correlations of all movies against our target film using the Pandas corrwith() function. This computes a pairwise correlation of the the rated movie series (or vector) with all of the other movies and returns the movies which have the most ratings in common. By dropping the NaN values we get to see only the movies with ratings.

similar_movies = df_user_ratings.corrwith(rated_movie)
similar_movies.dropna(inplace=True)
similar_movies = pd.DataFrame(similar_movies, columns=['correlation'])
similar_movies.head(5)

	correlation
title
'burbs, The (1989)	0.155161
(500) Days of Summer (2009)	0.024299
*batteries not included (1987)	-0.269069
10 Cent Pistol (2015)	1.000000
10 Cloverfield Lane (2016)	0.360885

If we sort the movies in descending order of correlation with our target film, we get back a list of the ones which are highly correlated. However, as you’ll see from the list below, although they are highly correlated, they actually look like pretty weak recommendations, which shows us that we still have work to do.

similar_movies.sort_values(by='correlation', ascending=False).head(5)

	correlation
title
Lakeview Terrace (2008)	1.0
Cry_Wolf (a.k.a. Cry Wolf) (2005)	1.0
Creep (2014)	1.0
Non-Stop (2014)	1.0
Not Without My Daughter (1991)	1.0

5: Engineer features to improve performance

The problem with the above approach is that it doesn’t take the number of ratings into consideration. People who like Star Wars, might not want to watch Shrek the Halls, even if the correlation score seems perfect. This has happened because small numbers of people who liked Star Wars also rated other films. To overcome this we need to engineer some additional features to let us filter the volume of reviews so that they’re only considered similar if there’s a reasonable number of ratings in common. We first need to calculate the total number of ratings and the mean rating for each movie.

df_movies_ratings['total_ratings'] = df_movies_ratings.groupby('movieId')['rating'].transform('count')
df_movies_ratings['mean_rating'] = df_movies_ratings.groupby('movieId')['rating'].transform('mean')
df_movies_ratings.head(3)

	movieId	title	genres	userId	rating	timestamp	total_ratings	mean_rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1.0	4.0	9.649827e+08	215	3.92093
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	5.0	4.0	8.474350e+08	215	3.92093
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	7.0	4.5	1.106636e+09	215	3.92093

Now we’ve calculated the total ratings and mean rating for each movie, we’ll drop the duplicate rows and create a new dataframe called df_movie_statistics which includes the core information we need.

df_movie_statistics = df_movies_ratings[['movieId', 'title', 'total_ratings', 'mean_rating']]
df_movie_statistics.drop_duplicates('movieId', keep='first', inplace=True)

df_movie_statistics.head(3)

	movieId	title	total_ratings	mean_rating
0	1	Toy Story (1995)	215	3.920930
215	2	Jumanji (1995)	110	3.431818
325	3	Grumpier Old Men (1995)	52	3.259615

Now we have a count of total ratings, we can filter our original df_movie_statistics dataframe to create a new df_popular_movies dataframe which contains only those with 50 or more ratings. If we use this, we should find that the quality of the results go up, because they’re less impacted by noise and should provide a better indication of consistency. Then, if we print the top five results we can see the five mostly highly rated films where there are at least 50 ratings. This looks spot on.

df_popular_movies = df_movie_statistics['total_ratings'] >= 50
df_popular_movies = df_movie_statistics[df_popular_movies].sort_values(['total_ratings', 
                                                    'mean_rating'], ascending=False)
df_popular_movies.head()

	movieId	title	total_ratings	mean_rating
10019	356	Forrest Gump (1994)	329	4.164134
8652	318	Shawshank Redemption, The (1994)	317	4.429022
7860	296	Pulp Fiction (1994)	307	4.197068
16228	593	Silence of the Lambs, The (1991)	279	4.161290
45015	2571	Matrix, The (1999)	278	4.192446

If we sort the df_popular_movies dataframe in ascending order of the total_ratings we see that they start from 50, which is exactly what we need.

df_popular_movies.sort_values(by='total_ratings', ascending=True).head()

	movieId	title	total_ratings	mean_rating
57198	3785	Scary Movie (2000)	50	2.92
97724	116797	The Imitation Game (2014)	50	4.02
20520	910	Some Like It Hot (1959)	50	4.01
80434	34405	Serenity (2005)	50	3.94
93100	88125	Harry Potter and the Deathly Hallows: Part 2 (...	50	3.91

By merging the df_popular_movies dataframe above containing the mean ratings on films where more than 50 people have voted with our similar_movies dataframe we get a better picture. If we then drop the NaN values on all of the films that do not have ratings, and sort in descending order of correlation, we get back a list of films that were also rated by people who watched our target film.

Now the results look pretty good. We’ve got other Star Wars films in the top results, Star Trek and some other films that are highly rated in general. That gives us a “people who watched this also watched” style recommendation.

similar_movies = similar_movies.reset_index()
popular_similar_movies = similar_movies.merge(df_popular_movies, on='title', how='left')
popular_similar_movies = popular_similar_movies.dropna()
popular_similar_movies.sort_values(by='correlation', ascending=False).head(10)

	title	correlation	movieId	total_ratings	mean_rating
4209	Star Wars: Episode IV - A New Hope (1977)	1.000000	260.0	251.0	4.231076
4210	Star Wars: Episode V - The Empire Strikes Back...	0.777970	1196.0	211.0	4.215640
4211	Star Wars: Episode VI - Return of the Jedi (1983)	0.734230	1210.0	196.0	4.137755
1689	Fugitive, The (1993)	0.482078	457.0	190.0	3.992105
4077	Slumdog Millionaire (2008)	0.479859	63082.0	71.0	3.809859
653	Bowling for Columbine (2002)	0.464610	5669.0	58.0	3.775862
46	28 Days Later (2002)	0.451605	6502.0	58.0	3.974138
2247	Inglourious Basterds (2009)	0.448799	68157.0	88.0	4.136364
2146	Hunt for Red October, The (1990)	0.421778	1610.0	90.0	3.872222
1177	Desperado (1995)	0.420516	163.0	66.0	3.560606

We can modify this very easily to give results for “people who liked this film also liked” simply by filtering against the mean rating. If we assume that any score over 4 is a good indication of a film someone really liked, by adding this additional filter we get a tweaked set of results with “better” films.

popular_similar_liked_movies = popular_similar_movies[popular_similar_movies['mean_rating'] >= 4]
popular_similar_liked_movies.sort_values(by='correlation', ascending=False).head(10)

	title	correlation	movieId	total_ratings	mean_rating
4209	Star Wars: Episode IV - A New Hope (1977)	1.000000	260.0	251.0	4.231076
4210	Star Wars: Episode V - The Empire Strikes Back...	0.777970	1196.0	211.0	4.215640
4211	Star Wars: Episode VI - Return of the Jedi (1983)	0.734230	1210.0	196.0	4.137755
2247	Inglourious Basterds (2009)	0.448799	68157.0	88.0	4.136364
2240	Indiana Jones and the Last Crusade (1989)	0.410916	1291.0	140.0	4.046429
2675	Lord of the Rings: The Return of the King, The...	0.406602	7153.0	185.0	4.118919
3622	Raiders of the Lost Ark (Indiana Jones and the...	0.384779	1198.0	200.0	4.207500
2337	Jaws (1975)	0.372132	1387.0	91.0	4.005495
1793	Godfather, The (1972)	0.365920	858.0	192.0	4.289062
2227	Inception (2010)	0.356304	79132.0	143.0	4.066434

The collaborative filtering approach is just one of a number of different algorithms you can use for recommender systems. In the next part, we’ll try another approach which makes use of some other data within the MovieLens dataset.

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.