by kulka053/presentation full · dark knight rocky sita aur gita star trek cliffhanger a.i. mi...

ByAtul S. Kulkarni

Graduate Student,University of Minnesota Duluth

Under The Guidance ofDr. Richard Maclin

http://www.d.umn.edu/~kulka053/Presentation_full.pdf

Problem Statementy Given a set of users with their previous ratings for a set of

movies, can we predict the rating they will assign to a movie they have not previously rated?

y Netflix puts it as y “The Netflix Prize seeks to substantially improve the accuracy of

predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.” – www.netlfixprize.com

y So what do they want?y 10% improvement to their existing system.

y They are paying $1 Million for this.

Problem Statementy Similarly, “which movie will you like” given that you

have seen X-Men, X-Men II, X-Men : The Last Stand and users who saw these movies also liked “X-Men Origins : Wolverine”?

y Answer:?

Background - Datasety Data in the training file is per movie

y It looks like thisMovie#

Customer#,Rating,Date of Rating



- Example 4:

1065039,3,2005-09-06

1544320,1,2004-06-28

410199,5,2004-10-16

Background – Dataset statsy Total ratings possible = 480,189 (user) * 17,770 (movies) = 8532958530 (8.5

Billion)y Total available = 100 Milliony The User x Movies matrix has 8.4 Billion entries

missingy Sparse Data

Background of the Solutiony What if I was very conservative about my rating and

someone else was too generous?y I rate the movie I like the most as 3 and the least as 1.y someone else rates his/her high at 5 and high at 3.y So am I like this person?

y Difficult to say.

y We are comparing two people with very high personal biases. Which will result in obvious flawed similarity measure.

y Solution? Normalization of the data.

==> subtract mean, divide by STD

Proposed Solutiony K-Nearest Neighbor approach (Overview)

y Given a query instance q(movieId, UserId)y normalize the data before processing.y Find the distance of this instance with all the users who

rated this movie.y Of the these users select the K users that are nearest to

the query instance as its neighborhood.y Average the rating of the users form this neighborhood

for this particular movie.y This is the predicted rating for the query instance.

Proposed Solution - Exampley Example: (Representative data, not real)

Matrix

Star

Wars

Dark

knight Rocky

Sita

Aur

Gita

Star

Trek Cliffhanger A.I. MI X-Men

Jim 1 3 1 5 2 1 1

Sean 2 3 2 4 5 3

John 3 4 5 3 4

Sidd 4 3 4 2

Penny 5 2 2 5 1

Pete 5 ? 4 4

Proposed Solution - Exampley calculate the Mean and Standard Deviation vectors.

meanRating standardDeviation

Jim 2 1.527525232

Sean 3.166666667 1.169045194

John 3.8 0.836660027

Sidd 3.25 0.957427108

Penny 3 1.870828693

Pete 4.333333333 0.577350269

Proposed Solution - Exampley Normalized data

MatrixStar

Wars

Dark

knightRocky

Sita

Aur

Gita

Star

TrekCliffhanger A.I. MI X-Men

Jim -0.65 0.65 -0.65 1.96 0 -0.65 -0.7

Sean -1 -0.14 -1 0.71 1.57 -0.14

John -1 0.24 1.434 -1 0.24

Sidd 0.783 -0.26 0.78 -1.3

Penny 1.069 -0.53 -0.53 1.07 -1.1

Pete 1.15 ? -0.6 -0.58

Proposed Solution - Exampley So now we have a query instance q(Pete, Sita Aur Gita)

y i.e. we wish to evaluate how much will Pete like movie “Sita Aur Gita” on a scale of 1 - 5.

y To do this we need to indentify Pete’s two neighbors who rated this movie. (2-NN case).

y Users who rated the movie Sita Aur Gita are.

candidate_users

Jim

Sidd

Penny

Proposed Solution - Exampley Users with their distance and the 2 neighbors in the

neighborhood are

y 2 Nearest Neighbors are Jim and Sidd.

Users Distance

Jim 0.500046868

Sidd 1.360699721

Peny 1.646395237

Proposed Solution - Exampley The average of the ratings by Jim and Sidd to movie

“Sita Aur Gita” is “0.7956”.y So is our prediction “0.7956” correct? Not yet.y This prediction is in normalized form.y We need to bring it back to Pete’s prediction level.

How?y Multiply by Standard Deviation of Pete’s ratings.y Add Pete mean rating to this product.

y (0.7956 * 0.5773) + 4.3333 = 4.7925y So predicted rating for Pete is 4.7925.

Experiments - Setupy This is a regression problem, hence we want to know if

we are off the expected value, how off are we?y Hence, Test Metric used is

y Root Mean Square Error (RMSE):

y Absolute Average Error (AAE):

y Time taken.

Experiments - Resultsy Result on described dataset

Method Absolute Average Error Root Mean Square Error Time (Minutes)

K-NN 0.5087 0.67164 8640 *

C-K-NN 0.6894 0.88995 9

Netflix (Ladder Board

Topper)

NA 0.8596 NA

Netflix Current System1 NA 0.9514 NA

Experiments - ResultsRMSE Comparisons Time taken

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K-NN C-K-NN Netflix (Current Topper)

Netflix (Current System)

Comparison of the RMSE and Absolute Average Error

RMSE

Absolute Average Error

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

K-NN C-K-NN

Time in Minutes

Time in Minutes

by kulka053/presentation full · dark knight rocky sita aur gita star trek cliffhanger a.i. mi...

Documents