Build a Collaborative Filtering and Graph based Movie Recommendation System using Streamlit

Darshan Kanade
15 min readFeb 25, 2023

A movie recommender system using Collaborative Filtering and Pinterest’s graph based recommendation algorithm-Pixie

Photo by Jens Kreuter on Unsplash

With the growing competition between the OTT platforms there emerge new schemes and strategies to increase the retention. While there may be big attractive offers employed, it is the content itself which is ultimately responsible for retaining (or losing) the customers.

And so, here comes the “Recommender Systems (RS)” , an information filtering system to suggest the most relevant content to the user. The idea may sound very simple but it is the best way to engage the customer with the product. If a user doesn’t see the relevant content on your platform they are likely to turn to your competitor.

Now, this relevance of an item to a user may be determined in a few ways which brings us to the three basic types of Recommender Systems:

  1. Content based Recommender Systems: It treats the problem as a typical Machine Learning classification or regression task that uses the meta-information about both users and items as features for the dataset; class label is the rating given by the user on that movie.
  1. Similarity based Recommender Systems: These are either user-user similarity based or item-item similarity based where similarity is computed using any similarity metric like cosine similarity. So to recommend movies to a user ‘u’ we can either look at the user-user similarity matrix and recommend movies watched by users similar to user ‘u’ or we can look at item-item similarity matrix and recommend movies similar to those movies which user ‘u’ has already watched.
  1. Collaborative Filtering Recommender Systems: Intuitively, this is very similar to the similarity based RS and is often considered as the same. However, here I’m differentiating the two on account of the mathematical approach behind it. Mathematically, it solves the matrix completion task for a user-item matrix (A) whose elements (Aᵤᵢ) are the ratings given by a user ‘u’ on an item ‘i’. This task is achieved using Matrix Factorization (MF) techniques like SVD, NNMF, etc. owing to which it is also called “Matrix Factorization based Recommender Systems”.
Collaborative Filtering Recommender System

Phew! That was a lot of information to pack in a short intro. However, of these we are only going to implement the Collaborative Filtering RS. So, if it’s not clear yet fret not, we will encounter it again later.

But before that, here’s one more addition to the above three techniques, Graph based Recommendation System:

So, when I read about Pixie (no, not that haircut) in this article, I was determined on trying it out for movie recommendations. If you haven’t heard about Pixie, it is Pinterest’s graph based algorithm built for suggesting ‘Related Pins’ in real-time with low latency. Unlike Netflix for example, Pinterest has to power recommendations for more than 100 billion ideas saved by 150 million people and that too in real time, jeez! The way Pixie works (oh and such a cute name by the way (^-^)) is as follows:

It creates a bipartite graph of Pins and Boards. When a user saves a Pin ‘q’ to a board, Pixie starts a biased random walk from ‘q’ for 100,000 steps, the path being Pins-Boards-Pins. From this it determines the 1000 Pins that were hit the most during the walk and recommends them to the user. And oh well, the walk is “biased ”because we set the probability to return to the source node ‘q’ as 0.5. This ensures that we don’t stray too far from the query Pin ‘q’.

Pins V/s Boards Bipartite Graph

For our movie recommendations, we will create a bipartite graph of movies on one side and genre + plot keywords on the other.

Now, that the theory is behind us, let’s get going…

Like any Machine Learning problem, we follow the steps in data science life cycle.

Problem Statement:

Given the movie metadata and user ratings on different movies, we have to recommend new movies to the user.

Data Overview:

The dataset is downloaded from here. It is the redistributed form of the original MovieLens¹ dataset ml-latest-small. You can download and read more about it at the link provided. These are the files that we care about:

  • movies_metadata.csv
  • keywords.csv
  • links_small.csv
  • ratings_small.csv

If you have enough RAM on your machine you can download the larger files for ratings and links.

import pandas as pd
movies = pd.read_csv(path_+'movies_metadata.csv')
ratings = pd.read_csv(path_+'ratings_small.csv')
keywords = pd.read_csv(path_+'keywords.csv')
links = pd.read_csv(path_+'links_small.csv')

Exploratory Data Analysis:

In this stage we perform data cleaning, preprocessing, data analysis, etc. which I won’t get in much detail of.

  • ratings dataframe consists of columns: movieId, userId, rating and timestamp.
  • links dataframe consits of columns: id (movieId), tmdbId, imdbId
  • movies dataframe consists of too many columns which are unnecessary and we get rid of except for a few.
movies.columns
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')

We merge these three dataframes along with the keywords dataframe to get movies_ratings with columns: movieId, title, genres, keywords, tmdbId, userId, rating, timestamp.

Note: movieId is in series 1,2,3,… and so on while we need tmdbId for fetching movie details from the TMDB API during deployment of the model.

Then we perform the train test split as follows:

if not (os.path.isfile(path_+’xtrain.csv’) and os.path.isfile(path_+’xtest.csv’)): 
movie_ratings[:int(len(movie_ratings)*0.8)].to_csv(path_+’xtrain.csv’, index=False)
movie_ratings[int(len(movie_ratings)*0.8):].to_csv(path_+’xtest.csv’, index=False)
X_train = pd.read_csv(path_+'xtrain.csv')
X_test = pd.read_csv(path_+'xtest.csv')

This is the right time to introduce you to a problem with recommender systems — Cold Start Problem! This is basically when you encounter a user who hasn’t watched any movies, a new user. Without any knowledge of a user’s likes and dislikes how do we recommend items to them? Same with a new item, a movie which is not rated yet. Whom do we recommend such a movie?

train_movies = set(X_train.movieId.values)
test_movies = set(X_test.movieId.values)
total_movies = train_movies.union(test_movies)
print('No. of movies in train set:', len(train_movies))
print('No. of movies in test set:', len(test_movies))
print('No. of movies present in test set but not in train set:', len(test_movies-train_movies))
print('Percentage of movies present in test but not in train set of all the movies: {}%'.format(len(test_movies-train_movies)/len(total_movies)*100))
No. of movies in train set: 7328
No. of movies in test set: 4732
No. of movies present in test set but not in train set: 1697
Percentage of movies present in test but not in train set of all the movies: 18.803324099722992%
train_users = set(X_train.userId.values)
test_users = set(X_test.userId.values)
total_users = train_users.union(test_users)
print('No. of users in train set:', len(train_users))
print('No. of users in test set:', len(test_users))
print('No. of users present in test set but not in train set:', len(test_users-train_users))
print('Percentage of users present in test but not in train set of all the users: {}%'.format(len(test_users-train_users)/len(total_users)*100))
No. of users in train set: 546
No. of users in test set: 148
No. of users present in test set but not in train set: 125
Percentage of users present in test but not in train set of all the users: 18.628912071535023%

In our case we have roughly same percentage of movies and users that occur in the test set but not in the train set. One way to tackle this issue is to use the meta-data of the user/item like user’s age or location, item’s category or genre, actors, director, etc. to recommend it to the relevant users. However, if we don’t have this information, as is our case, we can simply recommend the most popular movies on the platform to a new user, a Popularity Based Recommender System, otherwise known as Demographic Filtering, a generalized system. Let’s build one!

But hang on! At this point I want you to visualize the workflow of the final product and hence, now, keeping aside the traditional data cycle stages, I’ll talk about the deployment stage as well.

So, we will design a website/app such that it represents the homepage of a random user from the dataset. For this we use numpy’s random sampling method numpy.random.choice() to randomly sample a userId from the dataset. We then get the list of movies this user has watched. If this list is empty then we know that the user is a ‘new user’ in which case we display the 20 most popular movies. If the list is not empty then we can display the top 10 recommendations obtained using Collaborative Filtering approach. For an old user we also display the top 5 movies from this list as a way for us to understand whether they are similar to the top 20 Collaborative Filtering recommendations.

Flowchart for app workflow

We use Streamlit for it’s ease to build our app. We set the page configuration as shown below.

import streamlit as st

st.set_page_config(
page_title="Movie App",
page_icon=":film_projector:",
layout="centered",
initial_sidebar_state="expanded")

user_no = np.random.choice(range(0,671))

1. Popularity Based Recommender System

For popularity, we choose only those movies which are rated by at least 100 users. top20_ids contain the 20 most popular movies in our dataset.

top_bool = X_train.groupby('movieId').count()['rating']
top_ind = top_bool[top_bool>100].index
top_movies = X_train[X_train.movieId.isin(top_ind)]
top20_ids = top_movies.groupby('movieId').rating.mean().sort_values(ascending=False)[:20].index #average rating

2. Collaborative Filtering Recommender System

And we are back to it! As mentioned earlier, it uses Matrix Factorization technique at its core. This helps to capture the implicit interaction of a user with a movie. Now if I get into the working of the algorithm it would be a whole new discussion which is not the focus here. Also, although we can write the code for the entire algorithm from scratch, we will resort to the more powerful implementation of it using the Surprise Library. It also contains more variations of the actual algorithm which I tried out.

The trainset for surprise library is an object of surprise.trainset.Trainset class which has to be built as shown below. Testset on the other hand is just a list of list (or tuples).

import surprise
from surprise import Reader,Dataset

train_ratings = X_train[['userId','movieId','rating']]
test_ratings = X_test[['userId','movieId','rating']]

#Trainset
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(train_ratings,reader)
trainset = data.build_full_trainset()

#Testset
#list of list (or tuples)==> (user, id, rating)
testset = test_ratings.values

We will test three models (BaselineOnly, SVD, SVD++) available in Surprise Library and choose the one which performs better. The performance metric will be RMSE. The output that we get from the algorithm (test/predict method) will be a list of Predictions, an example is shown below.

[Prediction(uid=383, iid=47, r_ui=5.0, est=3.614847468612193, details={‘was_impossible’: False}),
Prediction(uid=383, iid=1079, r_ui=3.0, est=3.5756706411528865, details={'was_impossible': False}),
...]

where uid = userId, iid=movieId, r_ui=rating, est=predicted rating. To extract the relevant data we define the following two functions:

from sklearn.metrics import mean_squared_error

def get_predictions(results):
true = np.array([i.r_ui for i in results])
pred = np.array([i.est for i in results])
return true,pred

def rmse(true,pred):
return np.sqrt(np.mean((true-pred)**2))

Now, we can train our models. You can read the documentation for each. Here, SVD for example, is the famous SVD algorithm, the one popularized by Simon Funk during the Netflix Prize. TUDUM!

It tries to minimize the squared error between true and predicted rating using SGD(Stochastic Gradient Descent); what these variables mean can be understood from the Netflix Prize link provided above.

equation for minimizing the error between true and predicted rating
from surprise import BaselineOnly, SVD, SVDpp

if not os.path.isfile(path_+'svdalgo.pkl'):
algo = SVD(n_factors=100, biased=True, random_state=42)
algo.fit(trainset)
surprise.dump.dump(path_+'svdalgo.pkl',algo=algo)
print('Done Dumping...!')
algo = surprise.dump.load(path_+'svdalgo.pkl')[1] #tuple (prediction,algo)

The trainset object that we built previously is required for only fitting the data (fit method), for testing we need the same format as our testset. So, we do this,

train_test = trainset.build_testset()
train_results = algo.test(train_test)#, verbose=True)
test_results = algo.test(testset)#, verbose=True)

true,pred = get_predictions(train_results)
print('Train RMSE:', rmse(true,pred))
true,pred = get_predictions(test_results)
print('Test RMSE:', rmse(true,pred))

Here, are the results for the three algorithms I tried. SVD++ algorithm is chosen as it performed slightly better than SVD.

Here in recommend_collab() function we define the process explained in the flowchart previously. movies_watched is the list of movies rated by the user, if the list is empty we shall recommend the top20_ids (popular movies), else, we predict the ratings of the user on all the unwatched movies using svdpp_algo. From this we choose the top10_ids which were predicted to be highly rated and recommend those to the user. liked_ids contain 5 movies that were rated well by the user.

svdpp_algo = surprise.dump.load(path_+'svdpp_algo.pkl')[1]     #tuple (prediction,algo)

def recommend_collab(user_id):
movies_watched = set(train_ratings[train_ratings.userId==user_id].movieId.values)

if len(movies_watched)==0:
return [],[]

movies_unwatched = set(train_movies.movieId) - movies_watched

results = []
for mid in movies_unwatched:
results.append(svdpp_algo.predict(user_id, mid))

df = pd.DataFrame([(i.iid,i.est) for i in results], columns=['movieId', 'rating']).sort_values('rating', ascending=False)
top10_ids = df.movieId[:10]

liked_ids = train_ratings[train_ratings.userId==user_id].sort_values('rating', ascending=False).movieId.values[:5]
return (top10_ids, liked_ids)

top10_ids, liked_ids = recommend_collab(user_no)

3. Graph based Recommender Systems

Finally!! Drumroll please :) In this system, we will recommend 5 movies similar to a movie that the user looks for using the search(or selection) bar as shown below.

Like I mentioned earlier we first create a bipartite graph of movies against genres and plot-keywords.

Movie V/s Genres and Keywords Bipartite Graph

One more important thing, during EDA it was found that some general keywords like man, ship, love, pen, etc. occur more frequently than others like man vs machine, artificial intelligence, etc., same with genres like drama, comedy, thriller, romance, etc. which are more popular than others. These, although very important, during randomwalk they may be visited more often than others even if the query movie is say, animation, a very rare genre in the dataset.

So, to avoid this, we provide weightage to each link/edge such that the most frequent keywords and genres are given less importance, a simple trick is to compute the distribution of movies across each unique genre and keyword and take the inverse of these values as weights for a movie belonging to that genre/keyword.

We also give 20 times more weightage to genre as it is well, pretty much an important factor in obtaining similar movies.

import networkx as nx
from networkx.algorithms import bipartite
import pickle

dist = []
den = len(train_movies)
for i in genres: #genres is a set of all the 20 unique genres
v = len(train_movies[train_movies.genres.str.match('.*'+i+'.*')==True])/den*100
dist.append((v,i))
dist = pd.DataFrame(dist).sort_values(0)
dist[0] = dist[0].apply(lambda x: x**(-1))
weights_genre = dict(zip(list(dist[1].values),list(dist[0].values)))


dist = []
den = len(train_movies)
for i in tqdm(keywords): #keywords is a set of all the 11611 unique genres
if i=='': continue #empty keyword
v = len(train_movies[train_movies.keywords.str.match('.*'+i+'.*')==True])/den*100
dist.append((v,i))
dist = pd.DataFrame(dist).sort_values(0)
dist[0] = dist[0].apply(lambda x: (x+1)**(-1))
weights_keywords = dict(zip(list(dist[1].values),list(dist[0].values)))

#create edges
edges = []
for row in train_movies.iterrows():
mid = row[1].movieId
edges.extend([(mid, k, weights_keywords.get(k,0)*10) for k in ast.literal_eval(row[1].keywords)])
edges.extend([(mid, g, weights_genre[g]*200) for g in ast.literal_eval(row[1].genres)])
edges[:10]
data = pd.DataFrame(edges, columns=['movieId','keywords','weights'])

#create graph
if not os.path.isfile(path_+'graph.pkl'):
B = nx.Graph()
B.add_nodes_from(data.movieId.unique(), bipartite=0, label='movie')
B.add_nodes_from(data.keywords.unique(), bipartite=1, label='keygen')
B.add_weighted_edges_from(edges)
pickle.dump(B,open(path_+'graph.pkl','wb'))

Here’s a node for the movie ‘Toy Story’ showing the links with it’s genres (family, animation, comedy) and keywords. Note how highly weighted genres and keywords are pretty close to the node.

sub=nx.ego_graph(B,1)          #Toy Story
nx.draw_networkx(sub)
weighted edges of ‘Toy Story’ movie node

Without the weights, the randomwalk would visit movies based on friendship, rivalry, jealousy, comedy, friends, boy, etc., the recommendations being rom-coms, love traingles, crime and such. Oh god!

We choose the length of path as 10,000, two other important parameters are p and q.

  • 1/p defines the probability with which we return to the source node
  • 1/q defines the probability with which we move away from the source node.

Thus, p should be low for 1/p to be high and the exact opposite for q. For Pinterest’s Pixie we saw that 1/p was 0.5, but this did not work too well with the movies dataset that we have. So I tweaked these two parameters for various combination of values and tested the results on a few movies to finally settle with p=0.01 and q=100

The randomwalk contains movieId, genre, keywords. We get rid of the strings and keep just the movieIds. After this we use sklearn’s CountVectorizer to count the occurrences of movieIds and then get the top5_ids which occur the most.

from stellargraph.data import BiasedRandomWalk
from stellargraph import StellarGraph
from sklearn.feature_extraction.text import CountVectorizer

B = pickle.load(open(path_+'graph.pkl','rb'))

def graph_recommend(q):

rw = BiasedRandomWalk(StellarGraph(B))
walk = rw.run(nodes=[q], n=1, length=10000, p=0.01, q=100, weighted=True, seed=42)

#with 1/p prob, it returns to the source node
#with 1/q prob, it moves away from the source node
#Shape of walk: (1,10000)

walk = list(filter(lambda x:type(x)==int, walk[0])) #getting rid of keywords and genres... left with only movieIds
walk = list(map(str, walk)) #for countvectorizer
walk = ' '.join(walk) #['m1','m2','m3'] ====> 'm1 m2 m3'... for tokenzation

vocab = {str(mov):ind for ind,mov in enumerate(train_movies.movieId.sort_values().unique())} #movieId:index
vec = CountVectorizer(vocabulary=vocab)
embed = vec.fit_transform([walk])

reverse_vocab = {v:int(k) for k,v in vocab.items()} #index:movieId
embed = np.array(embed.todense())[0]

top5_ids=[]
for ind in embed.argsort()[::-1]:
if len(top5_ids)==5: break
movid = reverse_vocab[ind]
if movid!=q: top5_ids.append(movid)

return top5_ids

Now coming to the final act!

We are almost done. Now, we just have to implement the UI (User Interface) of our application.

  • For both “new user” and “old user” we have to have a search/selection bar to select a movie in our dataset. Other than that,
  • “new user” homepage consists of 20 popular movies recommendations
  • “old user” homepage consists of 5 favorite movies block and 10 Collaborative Filtering recommendations
  • When the user searches for a movie, the movie details are displayed under the selection bar, below which is a block to display 5 similar movies to this movie obtained from graph based recommendations
a) new user b) old user c) search results

This is how we make it happen:


if len(top10_ids)==0:
st.header('Hello stranger!!')

show_bar()
st.title('Most Popular movies on the platform')
show_blocks(top20_ids,[4,8,12,16,20])

else:
st.header('Welcome user '+str(user_no))
show_bar()
st.title('Your Favourites...')
show_blocks(liked_ids,[1,2,3,4,5])
st.title('Based on your taste...')
show_blocks(top10_ids,[2,4,6,8,10])

The above code uses some user-defined functions show_bar() to display the selection bar and show_blocks() to display the movie tiles. The list that you see as a parameter to the show_blocks() function defines the steps in which you split the movie_ids list to display the tiles. You’ll understand it better when you look at the function definition.

Another important thing, the search results page shows the details of the movie like it’s poster, plot overview, date of release, runtime, average rating and number of votes. We fetch these details from the TMBD API. You’ll have to create an account, fill in a few details and request an api key for your project.

import requests

base_url = 'https://image.tmdb.org/t/p/w500' #for poster image

def fetch_details(tmdbId):
response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key=<<your_api_key>>&language=en-US'.format(tmdbId))
return response.json()

def show_blocks(id_lst,steps):

lst = []
for ind in id_lst:
tid = train_movies[train_movies.movieId==ind].tmdbId.values[0]
details = fetch_details(tid)
name = details['title']
img = base_url+details['poster_path']
lst.append((name, img))

col1, col2, col3, col4, col5 = st.columns(5)
with col1:
for i in lst[:steps[0]]:
st.image(i[1])
st.text(i[0])
with col2:
for i in lst[steps[0]:steps[1]]:
st.image(i[1])
st.text(i[0])
with col3:
for i in lst[steps[1]:steps[2]]:
st.image(i[1])
st.text(i[0])
with col4:
for i in lst[steps[2]:steps[3]]:
st.image(i[1])
st.text(i[0])
with col5:
for i in lst[steps[3]:steps[4]]:
st.image(i[1])
st.text(i[0])

def show_bar():
mov_name = st.selectbox('What are you looking for...?', train_movies.title.values)
mid, tid = train_movies[train_movies.title==mov_name][['movieId','tmdbId']].values[0]
details = fetch_details(tid)

if st.button('Search'):

col1,col2 = st.columns(2)
with col1:
st.image(base_url+details['poster_path'])
with col2:
st.header(details['title'])
st.caption(details['tagline'])
st.write(details['overview'])
st.markdown("**Released in {}**".format(details['release_date']))
st.write('Runtime: {} mins'.format(details['runtime']))
st.write('Avg. Rating: {} :star: Votes: {} :thumbsup:'.format(details['vote_average'], details['vote_count']))

st.header("More like this...")
top5_ids = graph_recommend(mid)
show_blocks(top5_ids,[1,2,3,4,5])

And so we are done at last! Now we just see how well we are doing with the recommendations. Well in case of Genre based recommendations we are doing pretty well as seen in the Toy Story example, considering that it shows a few animation movies. And for Collaborative Filtering Recommendations we can’t say for sure because the movies are recommended based on the user-movie interaction, user-user and movie-movie similarity, but if we look at the example below it looks like we are doing a decent job. The user seems to have interest in crime movies, has also watched ‘The Lord of the Rings’. Our system too recommends some crime movies and although we don’t see any Batman movies, we do see another ‘The Lord of the Rings’ movie there. So yeah!

Lastly,

I’m really thankful if you stayed till here! We have implemented a Movie Recommendation App using three recommendation techniques: Popularity based, Collaborative Filtering based and Graph based. The results too are decent and of course there’s always scope for betterment, maybe we can try different values for p and q, the randomwalk parameters or maybe we can use a different weighing method for edges. Let me know if you discover something.

Also, if you’re like me you might have wondered if we could use graph based method to do collaborative filtering itself with a graph of ‘movie’ and ‘user ’nodes! Well, if you did, you may want to check this blog.

You can find the code on my Github profile. If you found this article helpful, please clap for it or leave the feedback below. Thank you!

[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19.

https://dl.acm.org/doi/10.1145/2827872

--

--