How to create a collaborative filtering recommender system

Learn how to use item-based and user-based collaborative filtering to create a powerful recommender system in Python.

How to create a collaborative filtering recommender system
22 minutes to read

Recommender systems, or recommendation engines as they’re also known, are everywhere these days. Whether you’re looking for books on Amazon, tracks on Spotify, movies on Netflix or a date on Tinder/Grinder, you’ll be served up the recommendations using this method.

As with everything in data science, there are many different ways you can generate recommendations, but probably the most widely used method is collaborative filtering. It can be done in any language, including SQL, which makes it straightforward to implement.

This fairly simple system was popularised by Amazon and has become widespread on content and ecommerce sites across the internet. Collaborative filtering actually comes in two different forms - item-based collaborative filtering and user-based collaborative filtering. They work in the same way, but their source data differs, allowing them to serve a slightly different style of date, movie, or product recommendation.

Some commonly seen examples of this are:

  • Customers who bought this also bought…
  • Customers who liked this also liked…
  • Customers who read this also read…
  • Customers who engaged with this also engaged with…
  • Customers who watched this also watched…
  • Customers who swiped right on him/her also swiped right on…

Many movie recommendation engines use collaborative filtering

User-based collaborative filtering

In user-based collaborative filtering you build up a matrix of every user and record all of their interactions, like the tracks they’ve listened to, the movies they’ve watched, the scores they’ve given or the articles they’ve read. This gives you one row per user with an item (like a track or movie) per column with the metric inside.

Using this matrix, we can compute the similarity between users based on what they liked, watched or rated, and identify users who have similar tastes. If a user has similar tastes to another and has read, watched or rated some of the same items, you can recommend them other things they might like.

User-based collaborative filtering works, but it does have some issues. Firstly, people’s tastes change over time, so someone who liked Justin Bieber when they were 13 may now have moved on in life. Secondly, as you need to create one row per user, you end up with a truly enormous matrix on larger sites making this approach harder to scale. For example, if you’ve got 200K users on your platform and only 50K active, you’ll still need a row for each one.

The other issue with user-based collaborative filtering is that it can be rigged by users. Create a bot to generate fake user accounts and then give your item a glowing rating using a shilling attack and this method could be gamed.

Item-based collaborative filtering

Item-based collaborative filtering is much the same as user-based collaborative filtering, but instead of looking at relationships between users it looks at relationships between the items, so it’s less sensitive to shilling attacks (particularly as it’s often based on items purchased, making it expensive to rig.) Amazon has used this technique for years and it works well. As you only need one row per item, instead of per user, it’s also much more scalable, and it’s less sensitive to time constraints.

You can perform item-based collaborative filtering in various ways, depending on your source data and the recommendation you wish to make. We’ll be using item-based collaborative filtering to create a movie recommender system so we’ll be making a matrix of every film, identifying which movies are similar to other movies and then making recommendations to users.

Cinema

1. Load the data

We are going to make our movie recommendations using the item-based collaborative filtering technique on the excellent MovieLens dataset, which is available in various sizes. The “ml-latest-small” dataset is a subset of the full data based on 100,000 ratings by 600 users across 9000 different movies, and comes in at around 1MB, so it’s ideal for practicing.

Download a copy of the zip file and place the files into a folder called data, then load up Pandas and read the CSV files into dataframes. We don’t actually need all the files, so just load up the movies.csv and ratings.csv files.

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 1000)

df_movies = pd.read_csv('data/movies.csv')
df_ratings = pd.read_csv('data/ratings.csv')

Next, we’ll view the head() of each dataframe to check out the content within. As you can see from below, df_movies contains the movieId, title and a list of genres. The df_ratings dataframe contains the userId, the movieId, their rating and a timestamp.

df_movies.sample(3)
movieId title genres
5687 27741 Twilight Samurai, The (Tasogare Seibei) (2002) Drama|Romance
4684 6994 Hard Way, The (1991) Action|Comedy
6707 58425 Heima (2007) Documentary
df_ratings.sample(3)
userId movieId rating timestamp
16085 104 4308 5.0 1048586839
27702 187 6870 3.0 1161849787
8481 57 3552 4.0 965798049

2. Re-shape the data and create a matrix

As the first step we need to merge the movies dataframe to the ratings dataframe. This gives us a massive dataframe containing one row per movie review, along with the userId and rating.

df_movies_ratings = df_movies.merge(df_ratings, on='movieId', how='left')
df_movies_ratings.head(3)
movieId title genres userId rating timestamp
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1.0 4.0 9.649827e+08
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 5.0 4.0 8.474350e+08
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 7.0 4.5 1.106636e+09

Next, we need to take df_movie_ratings and reshape the dataframe using Pandas so that each row in the dataframe represents a unique user and each column represents a review and its rating. This changes our dataframe from long format data to wide format data. With one column created for every movie in the dataset, this gives us loads of columns (9724 in fact) and most of them will contain NaN (not a number) values because no person is ever going to review every movie in the dataset.

df_user_ratings = df_movies_ratings.pivot_table(index='userId', columns=['title'], values='rating')
df_user_ratings.head(3)
title '71 (2014) 'Hellboy': The Seeds of Creation (2004) 'Round Midnight (1986) 'Salem's Lot (2004) 'Til There Was You (1997) ... eXistenZ (1999) xXx (2002) xXx: State of the Union (2005) ¡Three Amigos! (1986) À nous la liberté (Freedom for Us) (1931)
userId
1.0 NaN NaN NaN NaN NaN ... NaN NaN NaN 4.0 NaN
2.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
3.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN

3 rows × 9719 columns

3: Identify users who rated a given film

Next we’re first going to fetch all the ratings by userId on “Star Wars: Episode IV - A New Hope (1977)” from the df_user_ratings dataframe. Although I’ve never seen this film, I’m fairly sure that people who watch it will also have watched and like other movies in the franchise, so the results will be easily to interpret. This gives us a series contain the userId and rating for every person in the dataset, if they’ve rated the film or not. The users who did not rate the film will have a NaN value in the rating column.

rated_movie = df_user_ratings['Star Wars: Episode IV - A New Hope (1977)']
rated_movie.head(3)
userId
1.0    5.0
2.0    NaN
3.0    NaN
Name: Star Wars: Episode IV - A New Hope (1977), dtype: float64

4: Identify correlations

Next, we’re going to find the correlations of all movies against our target film using the Pandas corrwith() function. This computes a pairwise correlation of the the rated movie series (or vector) with all of the other movies and returns the movies which have the most ratings in common. By dropping the NaN values we get to see only the movies with ratings.

similar_movies = df_user_ratings.corrwith(rated_movie)
similar_movies.dropna(inplace=True)
similar_movies = pd.DataFrame(similar_movies, columns=['correlation'])
similar_movies.head(5)
correlation
title
'burbs, The (1989) 0.155161
(500) Days of Summer (2009) 0.024299
*batteries not included (1987) -0.269069
10 Cent Pistol (2015) 1.000000
10 Cloverfield Lane (2016) 0.360885

If we sort the movies in descending order of correlation with our target film, we get back a list of the ones which are highly correlated. However, as you’ll see from the list below, although they are highly correlated, they actually look like pretty weak recommendations, which shows us that we still have work to do.

similar_movies.sort_values(by='correlation', ascending=False).head(5)
correlation
title
Lakeview Terrace (2008) 1.0
Cry_Wolf (a.k.a. Cry Wolf) (2005) 1.0
Creep (2014) 1.0
Non-Stop (2014) 1.0
Not Without My Daughter (1991) 1.0

5: Engineer features to improve performance

The problem with the above approach is that it doesn’t take the number of ratings into consideration. People who like Star Wars, might not want to watch Shrek the Halls, even if the correlation score seems perfect. This has happened because small numbers of people who liked Star Wars also rated other films. To overcome this we need to engineer some additional features to let us filter the volume of reviews so that they’re only considered similar if there’s a reasonable number of ratings in common. We first need to calculate the total number of ratings and the mean rating for each movie.

df_movies_ratings['total_ratings'] = df_movies_ratings.groupby('movieId')['rating'].transform('count')
df_movies_ratings['mean_rating'] = df_movies_ratings.groupby('movieId')['rating'].transform('mean')
df_movies_ratings.head(3)
movieId title genres userId rating timestamp total_ratings mean_rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1.0 4.0 9.649827e+08 215 3.92093
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 5.0 4.0 8.474350e+08 215 3.92093
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 7.0 4.5 1.106636e+09 215 3.92093

Now we’ve calculated the total ratings and mean rating for each movie, we’ll drop the duplicate rows and create a new dataframe called df_movie_statistics which includes the core information we need.

df_movie_statistics = df_movies_ratings[['movieId', 'title', 'total_ratings', 'mean_rating']]
df_movie_statistics.drop_duplicates('movieId', keep='first', inplace=True)
df_movie_statistics.head(3)
movieId title total_ratings mean_rating
0 1 Toy Story (1995) 215 3.920930
215 2 Jumanji (1995) 110 3.431818
325 3 Grumpier Old Men (1995) 52 3.259615

Now we have a count of total ratings, we can filter our original df_movie_statistics dataframe to create a new df_popular_movies dataframe which contains only those with 50 or more ratings. If we use this, we should find that the quality of the results go up, because they’re less impacted by noise and should provide a better indication of consistency. Then, if we print the top five results we can see the five mostly highly rated films where there are at least 50 ratings. This looks spot on.

df_popular_movies = df_movie_statistics['total_ratings'] >= 50
df_popular_movies = df_movie_statistics[df_popular_movies].sort_values(['total_ratings', 
                                                    'mean_rating'], ascending=False)
df_popular_movies.head()
movieId title total_ratings mean_rating
10019 356 Forrest Gump (1994) 329 4.164134
8652 318 Shawshank Redemption, The (1994) 317 4.429022
7860 296 Pulp Fiction (1994) 307 4.197068
16228 593 Silence of the Lambs, The (1991) 279 4.161290
45015 2571 Matrix, The (1999) 278 4.192446

If we sort the df_popular_movies dataframe in ascending order of the total_ratings we see that they start from 50, which is exactly what we need.

df_popular_movies.sort_values(by='total_ratings', ascending=True).head()
movieId title total_ratings mean_rating
57198 3785 Scary Movie (2000) 50 2.92
97724 116797 The Imitation Game (2014) 50 4.02
20520 910 Some Like It Hot (1959) 50 4.01
80434 34405 Serenity (2005) 50 3.94
93100 88125 Harry Potter and the Deathly Hallows: Part 2 (... 50 3.91

By merging the df_popular_movies dataframe above containing the mean ratings on films where more than 50 people have voted with our similar_movies dataframe we get a better picture. If we then drop the NaN values on all of the films that do not have ratings, and sort in descending order of correlation, we get back a list of films that were also rated by people who watched our target film.

Now the results look pretty good. We’ve got other Star Wars films in the top results, Star Trek and some other films that are highly rated in general. That gives us a “people who watched this also watched” style recommendation.

similar_movies = similar_movies.reset_index()
popular_similar_movies = similar_movies.merge(df_popular_movies, on='title', how='left')
popular_similar_movies = popular_similar_movies.dropna()
popular_similar_movies.sort_values(by='correlation', ascending=False).head(10)
title correlation movieId total_ratings mean_rating
4209 Star Wars: Episode IV - A New Hope (1977) 1.000000 260.0 251.0 4.231076
4210 Star Wars: Episode V - The Empire Strikes Back... 0.777970 1196.0 211.0 4.215640
4211 Star Wars: Episode VI - Return of the Jedi (1983) 0.734230 1210.0 196.0 4.137755
1689 Fugitive, The (1993) 0.482078 457.0 190.0 3.992105
4077 Slumdog Millionaire (2008) 0.479859 63082.0 71.0 3.809859
653 Bowling for Columbine (2002) 0.464610 5669.0 58.0 3.775862
46 28 Days Later (2002) 0.451605 6502.0 58.0 3.974138
2247 Inglourious Basterds (2009) 0.448799 68157.0 88.0 4.136364
2146 Hunt for Red October, The (1990) 0.421778 1610.0 90.0 3.872222
1177 Desperado (1995) 0.420516 163.0 66.0 3.560606

We can modify this very easily to give results for “people who liked this film also liked” simply by filtering against the mean rating. If we assume that any score over 4 is a good indication of a film someone really liked, by adding this additional filter we get a tweaked set of results with “better” films.

popular_similar_liked_movies = popular_similar_movies[popular_similar_movies['mean_rating'] >= 4]
popular_similar_liked_movies.sort_values(by='correlation', ascending=False).head(10)
title correlation movieId total_ratings mean_rating
4209 Star Wars: Episode IV - A New Hope (1977) 1.000000 260.0 251.0 4.231076
4210 Star Wars: Episode V - The Empire Strikes Back... 0.777970 1196.0 211.0 4.215640
4211 Star Wars: Episode VI - Return of the Jedi (1983) 0.734230 1210.0 196.0 4.137755
2247 Inglourious Basterds (2009) 0.448799 68157.0 88.0 4.136364
2240 Indiana Jones and the Last Crusade (1989) 0.410916 1291.0 140.0 4.046429
2675 Lord of the Rings: The Return of the King, The... 0.406602 7153.0 185.0 4.118919
3622 Raiders of the Lost Ark (Indiana Jones and the... 0.384779 1198.0 200.0 4.207500
2337 Jaws (1975) 0.372132 1387.0 91.0 4.005495
1793 Godfather, The (1972) 0.365920 858.0 192.0 4.289062
2227 Inception (2010) 0.356304 79132.0 143.0 4.066434

The collaborative filtering approach is just one of a number of different algorithms you can use for recommender systems. In the next part, we’ll try another approach which makes use of some other data within the MovieLens dataset.

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.