by kulka053/presentation full · dark knight rocky sita aur gita star trek cliffhanger a.i. mi...
TRANSCRIPT
ByAtul S. Kulkarni
Graduate Student,University of Minnesota Duluth
Under The Guidance ofDr. Richard Maclin
Problem Statementy Given a set of users with their previous ratings for a set of
movies, can we predict the rating they will assign to a movie they have not previously rated?
y Netflix puts it as y “The Netflix Prize seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.” – www.netlfixprize.com
y So what do they want?y 10% improvement to their existing system.
y They are paying $1 Million for this.
Problem Statementy Similarly, “which movie will you like” given that you
have seen X-Men, X-Men II, X-Men : The Last Stand and users who saw these movies also liked “X-Men Origins : Wolverine”?
y Answer:?
Background - Datasety Data in the training file is per movie
y It looks like thisMovie#
Customer#,Rating,Date of Rating
Customer#,Rating,Date of Rating
Customer#,Rating,Date of Rating
- Example 4:
1065039,3,2005-09-06
1544320,1,2004-06-28
410199,5,2004-10-16
Background – Dataset statsy Total ratings possible = 480,189 (user) * 17,770 (movies) = 8532958530 (8.5
Billion)y Total available = 100 Milliony The User x Movies matrix has 8.4 Billion entries
missingy Sparse Data
Background of the Solutiony What if I was very conservative about my rating and
someone else was too generous?y I rate the movie I like the most as 3 and the least as 1.y someone else rates his/her high at 5 and high at 3.y So am I like this person?
y Difficult to say.
y We are comparing two people with very high personal biases. Which will result in obvious flawed similarity measure.
y Solution? Normalization of the data.
Proposed Solutiony K-Nearest Neighbor approach (Overview)
y Given a query instance q(movieId, UserId)y normalize the data before processing.y Find the distance of this instance with all the users who
rated this movie.y Of the these users select the K users that are nearest to
the query instance as its neighborhood.y Average the rating of the users form this neighborhood
for this particular movie.y This is the predicted rating for the query instance.
Proposed Solution - Exampley Example: (Representative data, not real)
Matrix
Star
Wars
Dark
knight Rocky
Sita
Aur
Gita
Star
Trek Cliffhanger A.I. MI X-Men
Jim 1 3 1 5 2 1 1
Sean 2 3 2 4 5 3
John 3 4 5 3 4
Sidd 4 3 4 2
Penny 5 2 2 5 1
Pete 5 ? 4 4
Proposed Solution - Exampley calculate the Mean and Standard Deviation vectors.
meanRating standardDeviation
Jim 2 1.527525232
Sean 3.166666667 1.169045194
John 3.8 0.836660027
Sidd 3.25 0.957427108
Penny 3 1.870828693
Pete 4.333333333 0.577350269
Proposed Solution - Exampley Normalized data
MatrixStar
Wars
Dark
knightRocky
Sita
Aur
Gita
Star
TrekCliffhanger A.I. MI X-Men
Jim -0.65 0.65 -0.65 1.96 0 -0.65 -0.7
Sean -1 -0.14 -1 0.71 1.57 -0.14
John -1 0.24 1.434 -1 0.24
Sidd 0.783 -0.26 0.78 -1.3
Penny 1.069 -0.53 -0.53 1.07 -1.1
Pete 1.15 ? -0.6 -0.58
Proposed Solution - Exampley So now we have a query instance q(Pete, Sita Aur Gita)
y i.e. we wish to evaluate how much will Pete like movie “Sita Aur Gita” on a scale of 1 - 5.
y To do this we need to indentify Pete’s two neighbors who rated this movie. (2-NN case).
y Users who rated the movie Sita Aur Gita are.
candidate_users
Jim
Sidd
Penny
Proposed Solution - Exampley Users with their distance and the 2 neighbors in the
neighborhood are
y 2 Nearest Neighbors are Jim and Sidd.
Users Distance
Jim 0.500046868
Sidd 1.360699721
Peny 1.646395237
Proposed Solution - Exampley The average of the ratings by Jim and Sidd to movie
“Sita Aur Gita” is “0.7956”.y So is our prediction “0.7956” correct? Not yet.y This prediction is in normalized form.y We need to bring it back to Pete’s prediction level.
How?y Multiply by Standard Deviation of Pete’s ratings.y Add Pete mean rating to this product.
y (0.7956 * 0.5773) + 4.3333 = 4.7925y So predicted rating for Pete is 4.7925.
Experiments - Setupy This is a regression problem, hence we want to know if
we are off the expected value, how off are we?y Hence, Test Metric used is
y Root Mean Square Error (RMSE):
y Absolute Average Error (AAE):
y Time taken.
Experiments - Resultsy Result on described dataset
Method Absolute Average Error Root Mean Square Error Time (Minutes)
K-NN 0.5087 0.67164 8640 *
C-K-NN 0.6894 0.88995 9
Netflix (Ladder Board
Topper)
NA 0.8596 NA
Netflix Current System1 NA 0.9514 NA
Experiments - ResultsRMSE Comparisons Time taken
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
K-NN C-K-NN Netflix (Current Topper)
Netflix (Current System)
Comparison of the RMSE and Absolute Average Error
RMSE
Absolute Average Error
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
K-NN C-K-NN
Time in Minutes
Time in Minutes