Recommender System with Collaborative Filtering

Posted on * • 12 minutes • 2508 words

A recommender system is a type of information filtering system that predicts a user’s preferences for items (such as movies, books, music, products, etc.) and suggests relevant items to the user. These systems are widely used in e-commerce platforms, streaming services, social media platforms, and many other applications where personalized recommendations can enhance user experience and engagement.

There are several types of recommender systems, including:

Collaborative Filtering: This approach recommends items based on the preferences of users who have similar tastes. It doesn’t require explicit knowledge about the items or users, but rather relies on the patterns and similarities in user behavior.
Content-Based Filtering: Content-based filtering recommends items based on their attributes and features. It analyzes the characteristics of both the items and the user’s preferences to make recommendations. For example, recommending movies based on their genre, actors, directors, etc., and matching them with the user’s historical preferences.
Hybrid Recommender Systems: Hybrid systems combine multiple recommendation techniques to provide more accurate and diverse recommendations. For instance, combining collaborative filtering and content-based filtering to leverage the strengths of both approaches.
Knowledge-Based Recommender Systems: Knowledge-based systems recommend items based on explicit knowledge about user preferences, domain-specific rules, or constraints. These systems are often used in domains where there is rich domain knowledge available.
Context-Aware Recommender Systems: Context-aware systems take into account contextual information such as time, location, and device used when making recommendations. For example, recommending nearby restaurants based on a user’s current location and time of day.

Collaborative Filtering

import numpy as np
import pandas as pd
import sys

movies = pd.read_csv("./ml-20m/movies.csv")
tags = pd.read_csv("./ml-20m/tags.csv")
ratings = pd.read_csv("./ml-20m/ratings.csv", nrows=16000000)

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

tags.head()

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078

ratings.head()

	userId	movieId	rating	timestamp
0	1	2	3.5	1112486027
1	1	29	3.5	1112484676
2	1	32	3.5	1112484819
3	1	47	3.5	1112484727
4	1	50	3.5	1112484580

tags.drop(['timestamp'], axis=1, inplace=True)
ratings.drop(['timestamp'], axis=1, inplace=True)

len(ratings.movieId.unique())

movies['genres'] = movies['genres'].str.replace('|', ' ')

# Restrict to users that have rated atleast 60 movies
ratings_df = ratings.groupby('userId').filter(lambda x: len(x) >= 60)

ratings_df.shape

(14248972, 3)

ratings.shape

(16000000, 3)

len(ratings.userId.unique())

len(ratings_df.userId.unique())

# whihc all  movies are there in ratings_df, keep only those in movies 
ratings_movie_list = ratings_df['movieId'].unique().tolist()
movies = movies[movies['movieId'].isin(ratings_movie_list)]
movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy
1	2	Jumanji (1995)	Adventure Children Fantasy
2	3	Grumpier Old Men (1995)	Comedy Romance
3	4	Waiting to Exhale (1995)	Comedy Drama Romance
4	5	Father of the Bride Part II (1995)	Comedy

merged_df = pd.merge(movies, tags, on='movieId', how='left')
merged_df.head()

	movieId	title	genres	userId	tag
0	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	1644.0	Watched
1	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	1741.0	computer animation
2	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	1741.0	Disney animated feature
3	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	1741.0	Pixar animation
4	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	1741.0	TÃ©a Leoni does not star in this movie

merged_df.fillna("", inplace=True)
merged_df = pd.DataFrame(merged_df.groupby('movieId')['tag'].apply(' '.join))

merged_df.head()

	tag
movieId
1	Watched computer animation Disney animated fea...
2	time travel adapted from:book board game child...
3	old people that is actually funny sequel fever...
4	chick flick revenge characters chick flick cha...
5	Diane Keaton family sequel Steve Martin weddin...

final_df = pd.merge(movies, merged_df, on='movieId', how='left')

final_df.head()

	movieId	title	genres	tag
0	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	Watched computer animation Disney animated fea...
1	2	Jumanji (1995)	Adventure Children Fantasy	time travel adapted from:book board game child...
2	3	Grumpier Old Men (1995)	Comedy Romance	old people that is actually funny sequel fever...
3	4	Waiting to Exhale (1995)	Comedy Drama Romance	chick flick revenge characters chick flick cha...
4	5	Father of the Bride Part II (1995)	Comedy	Diane Keaton family sequel Steve Martin weddin...

final_df['metadata'] = final_df[['tag', 'genres']].apply(' '.join, axis=1)

final_df.head()

	movieId	title	genres	tag	metadata
0	1	Toy Story (1995)	Adventure Animation Children Comedy Fantasy	Watched computer animation Disney animated fea...	Watched computer animation Disney animated fea...
1	2	Jumanji (1995)	Adventure Children Fantasy	time travel adapted from:book board game child...	time travel adapted from:book board game child...
2	3	Grumpier Old Men (1995)	Comedy Romance	old people that is actually funny sequel fever...	old people that is actually funny sequel fever...
3	4	Waiting to Exhale (1995)	Comedy Drama Romance	chick flick revenge characters chick flick cha...	chick flick revenge characters chick flick cha...
4	5	Father of the Bride Part II (1995)	Comedy	Diane Keaton family sequel Steve Martin weddin...	Diane Keaton family sequel Steve Martin weddin...

final_df.shape

(25093, 5)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
tfidf_mat = tfidf.fit_transform(final_df['metadata'])

tfidf_df = pd.DataFrame(tfidf_mat.toarray(), index=final_df.index.tolist()) 
tfidf_df.head()

	0	1	2	3	4	5	6	7	8	9	...	23519	23520	23521	23522	23523	23524	23525	23526	23527	23528
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 23529 columns

tfidf_df.shape  #each row is a movie

(25093, 23529)

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=200)
latent_matrix = svd.fit_transform(tfidf_df)

latent_matrix_df1 = pd.DataFrame(latent_matrix[:,0:200], index = final_df['title'].tolist())

import matplotlib.pyplot as plt
%matplotlib inline

explained_var = svd.explained_variance_ratio_.cumsum()
plt.plot(explained_var, '.-')
plt.xlabel("SVD components")
plt.ylabel("cumulative var explained(%)")
plt.show()

png

latent_matrix.shape

(25093, 200)

ratings_df.head()

	userId	movieId	rating
0	1	2	3.5
1	1	29	3.5
2	1	32	3.5
3	1	47	3.5
4	1	50	3.5

ratings_df1 = pd.merge(movies[["movieId"]], ratings_df, on="movieId", how = "right")

ratings_df1.head()

	movieId	userId	rating
0	1	3	4.0
1	1	8	4.0
2	1	11	4.5
3	1	13	4.0
4	1	14	4.5

ratings_df2 = ratings_df1.pivot(index='movieId', columns='userId', values = 'rating').fillna(0)
ratings_df2.head()

userId	1	2	3	5	7	8	11	13	14	16	...	110703	110706	110707	110708	110710	110711	110712	110714	110722	110724
movieId
1	0.0	0.0	4.0	0.0	0.0	4.0	4.5	4.0	4.5	3.0	...	0.0	5.0	0.0	5.0	0.0	4.0	0.0	3.0	4.0	0.0
2	3.5	0.0	0.0	3.0	0.0	0.0	0.0	3.0	0.0	0.0	...	0.0	4.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	4.0	0.0	0.0	3.0	5.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	3.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	3.0

5 rows × 60448 columns

ratings_df2.shape

(25093, 60448)

svd = TruncatedSVD(n_components=200)
latent_matrix = svd.fit_transform(ratings_df2)
latent_matrix_df2 = pd.DataFrame(latent_matrix, index = final_df['title'].tolist())

latent_matrix_df2.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
Toy Story (1995)	503.065269	-10.274285	118.147003	63.001323	33.324811	144.449895	-58.370170	58.839586	-50.801258	7.799550	...	-17.939855	-6.105493	13.302291	10.795169	-6.692900	-19.925751	8.026037	-17.779819	-3.730701	-11.210057
Jumanji (1995)	226.951509	-6.880712	142.044769	-38.659399	-34.455545	9.189622	-59.369870	43.351875	19.546100	-23.091564	...	-5.759097	15.838553	-4.651140	-7.410019	3.693425	12.570774	-3.953578	24.891350	10.353275	3.837859
Grumpier Old Men (1995)	94.293293	-45.533961	61.644364	-38.511678	-28.622765	-0.040053	-3.608147	1.412690	-17.248191	-29.525224	...	-0.670059	1.097869	-2.833690	-2.598111	-3.171674	-3.332366	4.101573	-3.049846	4.151136	-5.025944
Waiting to Exhale (1995)	23.234759	-25.256543	18.811186	-7.308903	-25.304709	0.539134	-0.387492	3.263833	-4.996824	2.722049	...	0.020090	-1.532013	-2.560055	-0.007939	-1.131231	0.192202	0.356880	-3.906472	-0.169543	-1.063388
Father of the Bride Part II (1995)	80.873515	-40.008944	67.371184	-34.720191	-44.404007	13.227772	-10.653996	8.077569	-11.464432	-18.275107	...	-0.786211	-0.501860	-4.795863	-3.421932	-3.465008	3.484481	0.667434	-0.128325	3.316472	-3.240816

5 rows × 200 columns

explained_var = svd.explained_variance_ratio_.cumsum()
plt.plot(explained_var, '.-')
plt.xlabel("SVD components")
plt.ylabel("cumulative var explained(%)")
plt.show()

png

from sklearn.metrics.pairwise import cosine_similarity

latent_matrix_df1.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
Toy Story (1995)	0.027829	0.053107	0.019021	0.003907	0.005143	-0.027165	0.117464	-0.000203	0.000751	0.073886	...	-0.072060	0.014005	-0.030398	0.085778	0.211589	-0.046935	0.008914	0.015430	0.035303	-0.034852
Jumanji (1995)	0.011114	0.011237	0.025765	0.002484	0.014320	-0.001906	0.070994	-0.001397	0.008939	0.040755	...	0.017030	0.026041	0.014499	0.021231	-0.076609	-0.022787	0.054812	-0.017474	0.044199	-0.005890
Grumpier Old Men (1995)	0.040006	0.073972	-0.004636	-0.001118	0.031234	0.002447	-0.003453	0.000312	-0.001469	0.000665	...	0.021166	-0.003256	-0.014490	0.009730	-0.000665	-0.009319	0.000651	0.005782	-0.006482	0.036551
Waiting to Exhale (1995)	0.138340	0.076832	-0.021021	-0.002120	0.100808	0.013420	-0.012406	-0.003615	-0.006283	-0.002056	...	0.037234	0.028600	0.028611	-0.039458	-0.026834	-0.080583	0.047081	-0.010145	-0.005737	0.013725
Father of the Bride Part II (1995)	0.040096	0.084344	0.000854	0.000621	-0.013870	-0.000925	0.013931	0.006003	0.006188	0.011663	...	-0.029974	-0.014503	0.007450	0.044511	-0.000263	0.021819	0.004452	-0.039057	0.003836	0.003095

5 rows × 200 columns

latent_matrix_df2.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
Toy Story (1995)	503.065269	-10.274285	118.147003	63.001323	33.324811	144.449895	-58.370170	58.839586	-50.801258	7.799550	...	-17.939855	-6.105493	13.302291	10.795169	-6.692900	-19.925751	8.026037	-17.779819	-3.730701	-11.210057
Jumanji (1995)	226.951509	-6.880712	142.044769	-38.659399	-34.455545	9.189622	-59.369870	43.351875	19.546100	-23.091564	...	-5.759097	15.838553	-4.651140	-7.410019	3.693425	12.570774	-3.953578	24.891350	10.353275	3.837859
Grumpier Old Men (1995)	94.293293	-45.533961	61.644364	-38.511678	-28.622765	-0.040053	-3.608147	1.412690	-17.248191	-29.525224	...	-0.670059	1.097869	-2.833690	-2.598111	-3.171674	-3.332366	4.101573	-3.049846	4.151136	-5.025944
Waiting to Exhale (1995)	23.234759	-25.256543	18.811186	-7.308903	-25.304709	0.539134	-0.387492	3.263833	-4.996824	2.722049	...	0.020090	-1.532013	-2.560055	-0.007939	-1.131231	0.192202	0.356880	-3.906472	-0.169543	-1.063388
Father of the Bride Part II (1995)	80.873515	-40.008944	67.371184	-34.720191	-44.404007	13.227772	-10.653996	8.077569	-11.464432	-18.275107	...	-0.786211	-0.501860	-4.795863	-3.421932	-3.465008	3.484481	0.667434	-0.128325	3.316472	-3.240816

5 rows × 200 columns

# Check similaruty of movie with content and collaboratice matricess
movie_content_vector = np.array(latent_matrix_df1.loc['Toy Story (1995)']).reshape(1,-1)
movie_collab_vector = np.array(latent_matrix_df2.loc['Toy Story (1995)']).reshape(1,-1)

score_1 = cosine_similarity(latent_matrix_df1, movie_content_vector).reshape(-1)
score_2 = cosine_similarity(latent_matrix_df2, movie_collab_vector).reshape(-1)

#average score 
av_score = (score_1 + score_2)/2.0

movie_sim = {'content':score_1, 'collab': score_2, 'hybrid':av_score}
simil_df = pd.DataFrame(movie_sim, index=latent_matrix_df1.index)

simil_df.head()

	content	collab	hybrid
Toy Story (1995)	1.000000	1.000000	1.000000
Jumanji (1995)	0.073692	0.570838	0.322265
Grumpier Old Men (1995)	0.058560	0.460268	0.259414
Waiting to Exhale (1995)	0.029523	0.277400	0.153462
Father of the Bride Part II (1995)	0.052702	0.450922	0.251812

simil_df.sort_values('content', ascending=False) # based on movie content toystory is similar to toy story2

	content	collab	hybrid
Toy Story (1995)	1.000000	1.000000	1.000000
Toy Story 2 (1999)	0.960975	0.765740	0.863358
Bug's Life, A (1998)	0.905825	0.654965	0.780395
Ratatouille (2007)	0.898999	0.429254	0.664126
Monsters, Inc. (2001)	0.883235	0.621118	0.752176
...	...	...	...
Life, Above All (2010)	-0.113142	0.073144	-0.019999
Nell (1994)	-0.117324	0.341333	0.112004
Stevie (2002)	-0.117623	0.174810	0.028593
Newsfront (1978)	-0.123699	0.029111	-0.047294
Samson and Delilah (2009)	-0.124658	0.108386	-0.008136

25093 rows × 3 columns

simil_df.sort_values('collab', ascending=False) # based on user who likes Toystory would like Toystory2

	content	collab	hybrid
Toy Story (1995)	1.000000	1.000000	1.000000
Toy Story 2 (1999)	0.960975	0.765740	0.863358
Aladdin (1992)	0.413372	0.686649	0.550010
Lion King, The (1994)	0.456150	0.675752	0.565951
Star Wars: Episode IV - A New Hope (1977)	0.020970	0.675528	0.348249
...	...	...	...
Koumiko Mystery, The (Mystère Koumiko, Le) (1967)	0.031572	-0.021734	0.004919
Steep (2007)	-0.001602	-0.021942	-0.011772
Cheers for Miss Bishop (1941)	-0.000040	-0.023719	-0.011880
Stranger, The (Agantuk) (Visitor, The) (1991)	0.009928	-0.025073	-0.007573
Happy End (1967)	0.047334	-0.029115	0.009110

25093 rows × 3 columns

simil_df.sort_values('hybrid', ascending=False)

	content	collab	hybrid
Toy Story (1995)	1.000000	1.000000	1.000000
Toy Story 2 (1999)	0.960975	0.765740	0.863358
Bug's Life, A (1998)	0.905825	0.654965	0.780395
Monsters, Inc. (2001)	0.883235	0.621118	0.752176
Finding Nemo (2003)	0.869589	0.603694	0.736641
...	...	...	...
Love unto Death (L'amour a mort) (1984)	-0.096918	0.016217	-0.040350
Cane Toads: The Conquest (2010)	-0.092350	0.002899	-0.044726
Amish Murder, An (2013)	-0.090066	-0.000880	-0.045473
Newsfront (1978)	-0.123699	0.029111	-0.047294
You Ain't Seen Nothin' Yet (Vous n'avez encore rien vu) (2012)	-0.096918	-0.021501	-0.059209

25093 rows × 3 columns

	0	1	2	3	4	5	6	7	8	9	...	23519	23520	23521	23522	23523	23524	23525	23526	23527	23528
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	0	1	2	3	4	5	6	7	8	9	...	23519	23520	23521	23522	23523	23524	23525	23526	23527	23528
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	0	1	2	3	4	5	6	7	8	9	...	23519	23520	23521	23522	23523	23524	23525	23526	23527	23528
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0