Recommender systems, or recommendation engines as they’re also known, are everywhere these days. Whether you’re looking for books on Amazon, tracks on Spotify, movies on Netflix or a date on Tinder/Grinder, you’ll be served up the recommendations using this method.
As with everything in data science, there are many different ways you can generate recommendations, but probably the most widely used method is collaborative filtering. It can be done in any language, including SQL, which makes it straightforward to implement.
This fairly simple system was popularised by Amazon and has become widespread on content and ecommerce sites across the internet. Collaborative filtering actually comes in two different forms - item-based collaborative filtering and user-based collaborative filtering. They work in the same way, but their source data differs, allowing them to serve a slightly different style of date, movie, or product recommendation.
Some commonly seen examples of this are:
In user-based collaborative filtering you build up a matrix of every user and record all of their interactions, like the tracks they’ve listened to, the movies they’ve watched, the scores they’ve given or the articles they’ve read. This gives you one row per user with an item (like a track or movie) per column with the metric inside.
Using this matrix, we can compute the similarity between users based on what they liked, watched or rated, and identify users who have similar tastes. If a user has similar tastes to another and has read, watched or rated some of the same items, you can recommend them other things they might like.
User-based collaborative filtering works, but it does have some issues. Firstly, people’s tastes change over time, so someone who liked Justin Bieber when they were 13 may now have moved on in life. Secondly, as you need to create one row per user, you end up with a truly enormous matrix on larger sites making this approach harder to scale. For example, if you’ve got 200K users on your platform and only 50K active, you’ll still need a row for each one.
The other issue with user-based collaborative filtering is that it can be rigged by users. Create a bot to generate fake user accounts and then give your item a glowing rating using a shilling attack and this method could be gamed.
Item-based collaborative filtering is much the same as user-based collaborative filtering, but instead of looking at relationships between users it looks at relationships between the items, so it’s less sensitive to shilling attacks (particularly as it’s often based on items purchased, making it expensive to rig.) Amazon has used this technique for years and it works well. As you only need one row per item, instead of per user, it’s also much more scalable, and it’s less sensitive to time constraints.
You can perform item-based collaborative filtering in various ways, depending on your source data and the recommendation you wish to make. We’ll be using item-based collaborative filtering to create a movie recommender system so we’ll be making a matrix of every film, identifying which movies are similar to other movies and then making recommendations to users.
We are going to make our movie recommendations using the item-based collaborative filtering technique on the excellent MovieLens dataset, which is available in various sizes. The “ml-latest-small” dataset is a subset of the full data based on 100,000 ratings by 600 users across 9000 different movies, and comes in at around 1MB, so it’s ideal for practicing.
Download a copy of the zip file and place the files into a folder called data
, then load up
Pandas and read the CSV files into dataframes. We don’t actually need all the files, so just load up the movies.csv
and ratings.csv
files.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 1000)
df_movies = pd.read_csv('data/movies.csv')
df_ratings = pd.read_csv('data/ratings.csv')
Next, we’ll view the head()
of each dataframe to check out the content within. As you can see from below, df_movies
contains the movieId
, title
and a list of genres. The df_ratings
dataframe contains the userId
, the movieId
, their rating
and a timestamp
.
df_movies.sample(3)
movieId | title | genres | |
---|---|---|---|
5687 | 27741 | Twilight Samurai, The (Tasogare Seibei) (2002) | Drama|Romance |
4684 | 6994 | Hard Way, The (1991) | Action|Comedy |
6707 | 58425 | Heima (2007) | Documentary |
df_ratings.sample(3)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
16085 | 104 | 4308 | 5.0 | 1048586839 |
27702 | 187 | 6870 | 3.0 | 1161849787 |
8481 | 57 | 3552 | 4.0 | 965798049 |
As the first step we need to merge the movies dataframe to the ratings dataframe. This gives us a massive dataframe containing one row per movie review, along with the userId
and rating
.
df_movies_ratings = df_movies.merge(df_ratings, on='movieId', how='left')
df_movies_ratings.head(3)
movieId | title | genres | userId | rating | timestamp | |
---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1.0 | 4.0 | 9.649827e+08 |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 5.0 | 4.0 | 8.474350e+08 |
2 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 7.0 | 4.5 | 1.106636e+09 |
Next, we need to take df_movie_ratings
and reshape the dataframe using Pandas so that each row in the dataframe represents a unique user and each column represents a review and its rating. This changes our dataframe from long format data to wide format data. With one column created for every movie in the dataset, this gives us loads of columns (9724 in fact) and most of them will contain NaN
(not a number) values because no person is ever going to review every movie in the dataset.
df_user_ratings = df_movies_ratings.pivot_table(index='userId', columns=['title'], values='rating')
df_user_ratings.head(3)
title | '71 (2014) | 'Hellboy': The Seeds of Creation (2004) | 'Round Midnight (1986) | 'Salem's Lot (2004) | 'Til There Was You (1997) | ... | eXistenZ (1999) | xXx (2002) | xXx: State of the Union (2005) | ¡Three Amigos! (1986) | À nous la liberté (Freedom for Us) (1931) |
---|---|---|---|---|---|---|---|---|---|---|---|
userId | |||||||||||
1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 4.0 | NaN |
2.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN |
3.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN |
3 rows × 9719 columns
Next we’re first going to fetch all the ratings by userId
on “Star Wars: Episode IV - A New Hope (1977)” from the df_user_ratings
dataframe. Although I’ve never seen this film, I’m fairly sure that people who watch it will also have watched and like other movies in the franchise, so the results will be easily to interpret. This gives us a series contain the userId
and rating
for every person in the dataset, if they’ve rated the film or not. The users who did not rate the film will have a NaN
value in the rating column.
rated_movie = df_user_ratings['Star Wars: Episode IV - A New Hope (1977)']
rated_movie.head(3)
userId
1.0 5.0
2.0 NaN
3.0 NaN
Name: Star Wars: Episode IV - A New Hope (1977), dtype: float64
Next, we’re going to find the correlations of all movies against our target film using the Pandas corrwith()
function. This computes a pairwise correlation of the the rated movie series (or vector) with all of the other movies and returns the movies which have the most ratings in common. By dropping the NaN
values we get to see only the movies with ratings.
similar_movies = df_user_ratings.corrwith(rated_movie)
similar_movies.dropna(inplace=True)
similar_movies = pd.DataFrame(similar_movies, columns=['correlation'])
similar_movies.head(5)
correlation | |
---|---|
title | |
'burbs, The (1989) | 0.155161 |
(500) Days of Summer (2009) | 0.024299 |
*batteries not included (1987) | -0.269069 |
10 Cent Pistol (2015) | 1.000000 |
10 Cloverfield Lane (2016) | 0.360885 |
If we sort the movies in descending order of correlation with our target film, we get back a list of the ones which are highly correlated. However, as you’ll see from the list below, although they are highly correlated, they actually look like pretty weak recommendations, which shows us that we still have work to do.
similar_movies.sort_values(by='correlation', ascending=False).head(5)
correlation | |
---|---|
title | |
Lakeview Terrace (2008) | 1.0 |
Cry_Wolf (a.k.a. Cry Wolf) (2005) | 1.0 |
Creep (2014) | 1.0 |
Non-Stop (2014) | 1.0 |
Not Without My Daughter (1991) | 1.0 |
The problem with the above approach is that it doesn’t take the number of ratings into consideration. People who like Star Wars, might not want to watch Shrek the Halls, even if the correlation score seems perfect. This has happened because small numbers of people who liked Star Wars also rated other films. To overcome this we need to engineer some additional features to let us filter the volume of reviews so that they’re only considered similar if there’s a reasonable number of ratings in common. We first need to calculate the total number of ratings and the mean rating for each movie.
df_movies_ratings['total_ratings'] = df_movies_ratings.groupby('movieId')['rating'].transform('count')
df_movies_ratings['mean_rating'] = df_movies_ratings.groupby('movieId')['rating'].transform('mean')
df_movies_ratings.head(3)
movieId | title | genres | userId | rating | timestamp | total_ratings | mean_rating | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1.0 | 4.0 | 9.649827e+08 | 215 | 3.92093 |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 5.0 | 4.0 | 8.474350e+08 | 215 | 3.92093 |
2 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 7.0 | 4.5 | 1.106636e+09 | 215 | 3.92093 |
Now we’ve calculated the total ratings and mean rating for each movie, we’ll drop the duplicate rows and create a new dataframe called df_movie_statistics
which includes the core information we need.
df_movie_statistics = df_movies_ratings[['movieId', 'title', 'total_ratings', 'mean_rating']]
df_movie_statistics.drop_duplicates('movieId', keep='first', inplace=True)
df_movie_statistics.head(3)
movieId | title | total_ratings | mean_rating | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | 215 | 3.920930 |
215 | 2 | Jumanji (1995) | 110 | 3.431818 |
325 | 3 | Grumpier Old Men (1995) | 52 | 3.259615 |
Now we have a count of total ratings, we can filter our original df_movie_statistics
dataframe to create a new df_popular_movies
dataframe which contains only those with 50 or more ratings. If we use this, we should find that the quality of the results go up, because they’re less impacted by noise and should provide a better indication of consistency. Then, if we print the top five results we can see the five mostly highly rated films where there are at least 50 ratings. This looks spot on.
df_popular_movies = df_movie_statistics['total_ratings'] >= 50
df_popular_movies = df_movie_statistics[df_popular_movies].sort_values(['total_ratings',
'mean_rating'], ascending=False)
df_popular_movies.head()
movieId | title | total_ratings | mean_rating | |
---|---|---|---|---|
10019 | 356 | Forrest Gump (1994) | 329 | 4.164134 |
8652 | 318 | Shawshank Redemption, The (1994) | 317 | 4.429022 |
7860 | 296 | Pulp Fiction (1994) | 307 | 4.197068 |
16228 | 593 | Silence of the Lambs, The (1991) | 279 | 4.161290 |
45015 | 2571 | Matrix, The (1999) | 278 | 4.192446 |
If we sort the df_popular_movies
dataframe in ascending order of the total_ratings
we see that they start from 50, which is exactly what we need.
df_popular_movies.sort_values(by='total_ratings', ascending=True).head()
movieId | title | total_ratings | mean_rating | |
---|---|---|---|---|
57198 | 3785 | Scary Movie (2000) | 50 | 2.92 |
97724 | 116797 | The Imitation Game (2014) | 50 | 4.02 |
20520 | 910 | Some Like It Hot (1959) | 50 | 4.01 |
80434 | 34405 | Serenity (2005) | 50 | 3.94 |
93100 | 88125 | Harry Potter and the Deathly Hallows: Part 2 (... | 50 | 3.91 |
By merging the df_popular_movies
dataframe above containing the mean ratings on films where more than 50 people have voted with our similar_movies
dataframe we get a better picture. If we then drop the NaN
values on all of the films that do not have ratings, and sort in descending order of correlation, we get back a list of films that were also rated by people who watched our target film.
Now the results look pretty good. We’ve got other Star Wars films in the top results, Star Trek and some other films that are highly rated in general. That gives us a “people who watched this also watched” style recommendation.
similar_movies = similar_movies.reset_index()
popular_similar_movies = similar_movies.merge(df_popular_movies, on='title', how='left')
popular_similar_movies = popular_similar_movies.dropna()
popular_similar_movies.sort_values(by='correlation', ascending=False).head(10)
title | correlation | movieId | total_ratings | mean_rating | |
---|---|---|---|---|---|
4209 | Star Wars: Episode IV - A New Hope (1977) | 1.000000 | 260.0 | 251.0 | 4.231076 |
4210 | Star Wars: Episode V - The Empire Strikes Back... | 0.777970 | 1196.0 | 211.0 | 4.215640 |
4211 | Star Wars: Episode VI - Return of the Jedi (1983) | 0.734230 | 1210.0 | 196.0 | 4.137755 |
1689 | Fugitive, The (1993) | 0.482078 | 457.0 | 190.0 | 3.992105 |
4077 | Slumdog Millionaire (2008) | 0.479859 | 63082.0 | 71.0 | 3.809859 |
653 | Bowling for Columbine (2002) | 0.464610 | 5669.0 | 58.0 | 3.775862 |
46 | 28 Days Later (2002) | 0.451605 | 6502.0 | 58.0 | 3.974138 |
2247 | Inglourious Basterds (2009) | 0.448799 | 68157.0 | 88.0 | 4.136364 |
2146 | Hunt for Red October, The (1990) | 0.421778 | 1610.0 | 90.0 | 3.872222 |
1177 | Desperado (1995) | 0.420516 | 163.0 | 66.0 | 3.560606 |
We can modify this very easily to give results for “people who liked this film also liked” simply by filtering against the mean rating. If we assume that any score over 4 is a good indication of a film someone really liked, by adding this additional filter we get a tweaked set of results with “better” films.
popular_similar_liked_movies = popular_similar_movies[popular_similar_movies['mean_rating'] >= 4]
popular_similar_liked_movies.sort_values(by='correlation', ascending=False).head(10)
title | correlation | movieId | total_ratings | mean_rating | |
---|---|---|---|---|---|
4209 | Star Wars: Episode IV - A New Hope (1977) | 1.000000 | 260.0 | 251.0 | 4.231076 |
4210 | Star Wars: Episode V - The Empire Strikes Back... | 0.777970 | 1196.0 | 211.0 | 4.215640 |
4211 | Star Wars: Episode VI - Return of the Jedi (1983) | 0.734230 | 1210.0 | 196.0 | 4.137755 |
2247 | Inglourious Basterds (2009) | 0.448799 | 68157.0 | 88.0 | 4.136364 |
2240 | Indiana Jones and the Last Crusade (1989) | 0.410916 | 1291.0 | 140.0 | 4.046429 |
2675 | Lord of the Rings: The Return of the King, The... | 0.406602 | 7153.0 | 185.0 | 4.118919 |
3622 | Raiders of the Lost Ark (Indiana Jones and the... | 0.384779 | 1198.0 | 200.0 | 4.207500 |
2337 | Jaws (1975) | 0.372132 | 1387.0 | 91.0 | 4.005495 |
1793 | Godfather, The (1972) | 0.365920 | 858.0 | 192.0 | 4.289062 |
2227 | Inception (2010) | 0.356304 | 79132.0 | 143.0 | 4.066434 |
The collaborative filtering approach is just one of a number of different algorithms you can use for recommender systems. In the next part, we’ll try another approach which makes use of some other data within the MovieLens dataset.
Matt Clarke, Tuesday, March 02, 2021