cs 572: information retrieval - emory universityeugene/cs572/lectures/lecture21... ·...

CS 572: Information Retrieval

Recommender Systems

Acknowledgements

Many slides in this lecture are adapted from Xavier Amatriain (Netflix),, and Y.

Koren (Yahoo),

Big Picture

Part 1: Classical IR: indexing, ranking, metrics

Part 2: Web: crawling, indexing, social web

Part 3: User-centric Information Access

– Recommender Systems

– Personalized Search

– Search interfaces and “intelligent assistants”

4/4/2016 CS 572: Information Retrieval. Spring 2016 2

Recommender Systems

• Recommender Systems: general approaches (today)

– Applications: News, Social recommendation (Wed)

• User feedback in Search (Wed)

• Personalized search (4/11)

• Search interfaces (4/13)

• Question Answering (4/18, Denis Savenkov)

– Collaborative Filtering

• Application: News Recommendation

– Wed: Applications to Search

• Improving Search Ranking with User Feedback


Logistics for next 2 weeks

• Monday, 18 April: Question Answering (Denis Savenkov)

• Wed, April 20: Final project Milestone presentations:

– 10-minutes (max 8 slides)

– Email slides or URL to me by Tuesday night, 4/19.


Information Retrieval Until Now

5

IR

SystemQuery

String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3

.

.

4/4/2016 CS 572: Information Retrieval. Spring 2016

Information Overload Is Real


Recommendations Systems Help


Problem Definition: Recommendation


Recommendation: Formal Problem


10

RS: Utility Matrix

0.4

10.2

0.30.5

0.21

King Kong LOTR Matrix Nacho Libre

Alice

Bob

Carol

David


Recommendation System: Process


Paradigms of recommender systems

Recommender systems reduce information overload by estimating relevance



Personalized recommendations



Collaborative: "Tell me what's popular among my peers"



Content-based: "Show me more of the

same what I've liked"



Knowledge-based: "Tell me what fits based on my needs"



Hybrid: combinations of various inputs and/or composition of different mechanism


18

Key Problems

• Gathering “known” ratings for matrix

• Extrapolate unknown ratings from known ratings

– Mainly interested in high unknown ratings

• Evaluating extrapolation methods


19

Gathering Ratings

• Explicit

– Ask people to rate items

– Doesn’t work well in practice – people can’t be bothered

• Implicit

– Learn ratings from user actions

• e.g., purchase implies high rating

– What about low ratings?


20

Rating interface


21

Extrapolating Utilities

• Key problem: matrix U is sparse

– most people have not rated most items

• Three approaches

– Content-based

– Collaborative

– Hybrid


Main Approaches


Tradeoffs

Pros Cons

Collaborative No knowledge-engineering effort, serendipity of results, learns market segments

Requires some form of rating feedback, cold start for new users and new items

Content-based No community required, comparison between items possible

Content descriptions necessary, cold start for new users, no surprises

Knowledge-based Deterministic recommendations, assured quality, no cold-start, can resemble sales dialogue

Knowledge engineering effort to bootstrap, basically static, does not react to short-term trends

4/4/2016CS 572: Information Retrieval. Spring 201623

What works (in “real world”)


COLLABORATIVE FILTERING


Collaborative Filtering (CF)

• The most prominent approach to generate recommendations– used by large, commercial e-commerce sites

– well-understood, various algorithms and variations exist

– applicable in many domains (book, movies, DVDs, ..)

• Approach– use the "wisdom of the crowd" to recommend items

• Basic assumption and idea– Users give ratings to catalog items (implicitly or explicitly)

– Customers who had similar tastes in the past, will have similar tastes in the future


User-based nearest-neighbor collaborative filtering (1)

• The basic technique:

– Given an "active user" (Alice) and an item I not yet seen by Alice

– The goal is to estimate Alice's rating for this item, e.g., by• find a set of users (peers) who liked the same items as Alice in the past and

who have rated item I

• use, e.g. the average of their ratings to predict, if Alice will like item I

• do this for all items Alice has not seen and recommend the best-rated

Item1 Item2 Item3 Item4 Item5

Alice 5 3 4 4 ?

User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1


28

Collaborative Filtering in Action

A 9

B 3

C

: :

Z 5

A

B

C 9

: :

Z 10

A 5

B 3

C

: :

Z 7

A

B

C 8

: :

Z

A 6

B 4

C

: :

Z

A 10

B 4

C 8

. .

Z 1

User

Database

Active

User

Correlation

Match

A 9

B 3

C

. .

Z 5

A 9

B 3

C

: :

Z 5

A 10

B 4

C 8

. .

Z 1

Extract

RecommendationsC


User-based nearest-neighbor collaborative filtering (2)

• Some first questions– How do we measure similarity?

– How many neighbors should we consider?

– How do we generate a prediction from the neighbors' ratings?


Alice 5 3 4 4 ?

User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1


Measuring user similarity

• A popular similarity measure in user-based CF: Pearson correlation

a, b : usersra,p : rating of user a for item pP : set of items, rated both by a and bPossible similarity values between -1 and 1; = user's

average ratings


Alice 5 3 4 4 ?

User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

sim = 0,85sim = 0,70

sim = -0,79

𝒓𝒂, 𝒓𝒃


Pearson correlation

• Takes differences in rating behavior into account

• Works well in usual domains, compared with alternative measures– such as cosine similarity

0

1

2

3

4

5

6

Item1 Item2 Item3 Item4

Ratings

Alice

User1

User4


Making predictions

• A common prediction function:

• Calculate, whether the neighbors' ratings for the unseen item i are higher or lower than their average

• Combine the rating differences – use the similarity as a weight

• Add/subtract the neighbors' bias from the active user's average and use this as a prediction


33

Significance Weighting

• Important not to trust correlations based on very few co-rated items.

• Include significance weights, sa,u, based on number of co-rated items, m.

uauaua csw ,,,

50 if 50

50 if 1

, mm

ms ua


34

Collaborative Filtering Method

• Weight all users with respect to similarity with the active user.

• Select a subset of the users (neighbors) to use as predictors.

• Normalize ratings and compute a prediction from a weighted combination of the selected neighbors’ ratings.

• Present items with highest predicted ratings as recommendations.


Making recommendations

• Making predictions is typically not the ultimate goal

• Usual approach (in academia)

– Rank items based on their predicted ratings

• However

– This might lead to the inclusion of (only) niche items

– In practice also: Take item popularity into account

• Approaches

– "Learning to rank"

• Optimize according to a given rank evaluation metric (see later)


36

Problems with Collaborative Filtering

• Cold Start: There needs to be enough other users already in the system to find a match.

• Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items.

• First Rater: Cannot recommend an item that has not been previously rated.– New items

– Esoteric items

• Popularity Bias: Cannot recommend items to someone with unique tastes. – Tends to recommend popular items.


Improving the metrics / prediction function

• Not all neighbor ratings might be equally "valuable"– Agreement on commonly liked items is not so informative as

agreement on controversial items– Possible solution: Give more weight to items that have a higher

variance

• Value of number of co-rated items– Use "significance weighting", by e.g., linearly reducing the weight

when the number of co-rated items is low

• Case amplification– Intuition: Give more weight to "very similar" neighbors, i.e., where

the similarity value is close to 1.

• Neighborhood selection– Use similarity threshold or fixed number of neighbors


Memory-based and model-based approaches

• User-based CF is said to be "memory-based"– the rating matrix is directly used to find neighbors / make

predictions

– does not scale for most real-world scenarios

– large e-commerce sites have tens of millions of customers and millions of items

• Model-based approaches– based on an offline pre-processing or "model-learning" phase

– at run-time, only the learned model is used to make predictions

– models are updated / re-trained periodically

– large variety of techniques used

– model-building and updating can be computationally expensive


2001: Item-based collaborative filtering recommendation algorithms, B.

Sarwar et al., WWW 2001

• Scalability issues arise with U2U if many more users than items (m >> n , m = |users|, n = |items|)

– e.g. Amazon.com– Space complexity O(m2) when pre-computed– Time complexity for computing Pearson O(m2n)

• High sparsity leads to few common ratings between two users

• Basic idea: "Item-based CF exploits relationships between items first, instead of relationships between users"


Item-based collaborative filtering

• Basic idea:

– Use the similarity between items (and not users) to make predictions

• Example:

– Look for items that are similar to Item5

– Take Alice's ratings for these items to predict the rating for Item5


Alice 5 3 4 4 ?

User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1


The cosine similarity measure

• Produces better results in item-to-item filtering– for some datasets, no consistent picture in literature

• Ratings are seen as vector in n-dimensional space• Similarity is calculated based on the angle between the vectors

• Adjusted cosine similarity– take average user ratings into account, transform the original ratings– U: set of users who have rated both items a and b


Pre-processing for item-based filtering

• Item-based filtering does not solve the scalability problem itself• Pre-processing approach by Amazon.com (in 2003)

– Calculate all pair-wise item similarities in advance– The neighborhood to be used at run-time is typically rather small,

because only items are taken into account which the user has rated– Item similarities are supposed to be more stable than user similarities

• Memory requirements– Up to N2 pair-wise similarities to be memorized (N = number of items) in

theory– In practice, this is significantly lower (items with no co-ratings)– Further reductions possible

• Minimum threshold for co-ratings (items, which are rated at least by n users)• Limit the size of the neighborhood (might affect recommendation accuracy)


More on ratings

• Pure CF-based systems only rely on the rating matrix• Explicit ratings

– Most commonly used (1 to 5, 1 to 7 Likert response scales)– Research topics

• "Optimal" granularity of scale; indication that 10-point scale is better accepted in movie domain

• Multidimensional ratings (multiple ratings per movie)

– Challenge• Users not always willing to rate many items; sparse rating matrices• How to stimulate users to rate more items?

• Implicit ratings– clicks, page views, time spent on some page, demo downloads …– Can be used in addition to explicit ones; question of correctness of

interpretation


Data sparsity problems

• Cold start problem– How to recommend new items? What to recommend to new

users?

• Straightforward approaches– Ask/force users to rate a set of items– Use another method (e.g., content-based, demographic or

simply non-personalized) in the initial phase

• Alternatives– Use better algorithms (beyond nearest-neighbor approaches)– Example:

• In nearest-neighbor approaches, the set of sufficiently similar neighbors might be to small to make good predictions

• Assume "transitivity" of neighborhoods


Example algorithms for sparse datasets

• Recursive CF– Assume there is a very close neighbor n of u who however has not

rated the target item i yet.

– Idea: • Apply CF-method recursively and predict a rating for item i for the neighbor

• Use this predicted rating instead of the rating of a more distant direct neighbor


Alice 5 3 4 4 ?

User1 3 1 2 3 ?

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

sim = 0,85

Predict rating forUser1


Graph-based methods

• "Spreading activation" (sketch) – Idea: Use paths of lengths > 3

to recommend items– Length 3: Recommend Item3 to User1– Length 5: Item1 also recommendable


More model-based approaches

• Plethora of different techniques proposed in the last years, e.g.,– Matrix factorization techniques, statistics

• singular value decomposition, principal component analysis

– Association rule mining• compare: shopping basket analysis

– Probabilistic models• clustering models, Bayesian networks, probabilistic Latent Semantic

Analysis

– Various other machine learning approaches

• Costs of pre-processing – Usually not discussed– Incremental updates possible?


2000: Application of Dimensionality Reduction in

Recommender System, B. Sarwar et al., WebKDD Workshop

• Basic idea: Trade more complex offline model building for faster online prediction generation

• Singular Value Decomposition for dimensionality reduction of rating matrices

– Captures important factors/aspects and their weights in the data

– factors can be genre, actors but also non-understandable ones

– Assumption that k dimensions capture the signals and filter out noise (K = 20 to 100)

• Constant time to make recommendations

• Approach also popular in IR (Latent Semantic Indexing), data compression, …


A picture says …

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

BobMary

Alice

Sue


Matrix factorization

VkT

Dim1 -0.44 -0.57 0.06 0.38 0.57

Dim2 0.58 -0.66 0.26 0.18 -0.36

Uk Dim1 Dim2

Alice 0.47 -0.30

Bob -0.44 0.23

Mary 0.70 -0.06

Sue 0.31 0.93 Dim1 Dim2

Dim1 5.63 0

Dim2 0 3.23

T

kkkk VUM

k

• SVD:

• Prediction:

= 3 + 0.84 = 3.84

)()(ˆ EPLVAliceUrr Tkkkuui


2008: Factorization meets the neighborhood: a multifaceted

collaborative filtering model, Y. Koren, ACM SIGKDD

Stimulated by work on Netflix competition– Prize of $1,000,000 for accuracy improvement of 10% RMSE

compared to own Cinematch system

– Very large dataset (~100M ratings, ~480K users , ~18K movies)

– Last ratings/user withheld (set K)

Root mean squared error metric optimized to 0.8567

K

rr

RMSEKiu

uiui

),(

2)ˆ(


Merges neighborhood models with latent factor models

Latent factor models

– good to capture weak signals in the overall data

Neighborhood models

– good at detecting strong relationships between close items

Combination in one prediction single function

– Local search method such as stochastic gradient descent to determine parameters

– Add penalty for high values to avoid over-fitting

2008: Factorization meets the neighborhood: a multifaceted

collaborative filtering model, Y. Koren, ACM SIGKDD

Kiu

iuiui

T

uiuuibqp

bbqpqpbbr),(

22222

,,)()(min

***

iTuiuui qpbbr ˆ


Collaborative Filtering Issues

• Pros: – well-understood, works well in some domains, no knowledge engineering required

• Cons:– requires user community, sparsity problems, no integration of other knowledge sources, no

explanation of results

• What is the best CF method?– In which situation and which domain? Inconsistent findings; always the same domains and data

sets; differences between methods are often very small (1/100)

• How to evaluate the prediction quality?– MAE / RMSE: What does an MAE of 0.7 actually mean?

– Serendipity: Not yet fully understood

• What about multi-dimensional ratings?


54

Content-based recommendations

• Main idea: recommend items to customer C similar to previous items rated highly by C

• Movie recommendations

– recommend movies with same actor(s), director, genre, …

• Websites, blogs, news

– recommend other sites with “similar” content


55

Overall idea

likes

Item profiles

Red

Circles

Triangles

User profile

match

recommendbuild


56

Item Profiles

• For each item, create an item profile

• Profile is a set of features– movies: author, title, actor, director,…

– text: set of “important” words in document

• How to pick important words?– Usual heuristic is TF.IDF (Term Frequency times

Inverse Doc Frequency)


57

TF.IDF Strikes Again!

fij = frequency of term ti in document dj

ni = number of docs that mention term i

N = total number of docs

TF.IDF score wij = TFij x IDFi

Doc profile = set of words with highest TF.IDF scores, together with their scores


58

User profiles and prediction

• User profile possibilities:

– Weighted average of rated item profiles

– Variation: weight by difference from average rating for item

– …

• Prediction heuristic

– Given user profile c and item profile s, estimate u(c,s) = cos(c,s) = c.s/(|c||s|)

– Need efficient method to find items with high utility


59

Model-based approaches

• For each user, learn a classifier that classifies items into rating classes

– liked by user and not liked by user

– e.g., Bayesian, regression, SVM

• Apply classifier to each item to find recommendation candidates

• Problem: scalability

– Can revisit when we cover classifiers in detail


60

Advantages of Content-Based Approach

• No need for data on other users.

– No cold-start or sparsity problems.

• Able to recommend to users with unique tastes.

• Able to recommend new and unpopular items

– No first-rater problem.

• Can provide explanations of recommended items by listing content-features that caused an item to be recommended.


61

Disadvantages of Content-Based Approach

• Requires content that can be encoded as meaningful features.

• Users’ tastes must be represented as a learnable function of these content features.

• Unable to exploit quality judgments of other users.

– Unless these are somehow included in the content features.


62

Combining Content and Collaboration

• Content-based and collaborative methods have complementary strengths and weaknesses.

• Combine methods to obtain the best of both.

• Various hybrid approaches:

– Apply both methods and combine recommendations.

– Use collaborative data as content.

– Use content-based predictor as another collaborator.

– Use content-based predictor to complete collaborative data.


63

Movie Domain• EachMovie Dataset [Compaq Research Labs]

– Contains user ratings for movies on a 0–5 scale.

– 72,916 users (avg. 39 ratings each).

– 1,628 movies.

– Sparse user-ratings matrix – (2.6% full).

• Crawled Internet Movie Database (IMDb)

– Extracted content for titles in EachMovie.

• Basic movie information:

– Title, Director, Cast, Genre, etc.

• Popular opinions:

– User comments, Newspaper and Newsgroup reviews, etc.


64

Content-Boosted Collaborative Filtering (Movie domain,

IMDbEachMovie Web Crawler

Movie

Content

Database

Full User

Ratings Matrix

Collaborative

FilteringActive

User Ratings

User Ratings

Matrix (Sparse)Content-based

Predictor

Recommendations


65

Content-Boosted CF - I

Content-Based

Predictor

Training Examples

Pseudo User-ratings Vector

Items with Predicted Ratings

User-ratings Vector

User-rated Items

Unrated Items


66

Content-Boosted CF - II

• Compute pseudo user ratings matrix

– Full matrix – approximates actual full user ratings matrix

• Perform CF

– Using Pearson corr. between pseudo user-rating vectors

User Ratings

Matrix

Pseudo User

Ratings Matrix

Content-Based

Predictor


67

Classic Experiments (Konstan)

• Used subset of EachMovie (7,893 users; 299,997 ratings)

• Test set: 10% of the users selected at random.– Test users that rated at least 40 movies.

– Train on the remainder sets.

• Hold-out set: 25% items for each test user.– Predict rating of each item in the hold-out set.

• Compared CBCF to other prediction approaches:– Pure CF

– Pure Content-based

– Naïve hybrid (averages CF and content-based predictions)


68

Metrics

• Mean Absolute Error (MAE)

– Compares numerical predictions with user ratings

• ROC sensitivity [Herlocker 99]

– How well predictions help users select high-quality items

– Ratings 4 considered “good”; < 4 considered “bad”

• Paired t-test for statistical significance


69

Results - I

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

MA

E

Algorithm

MAE

CF

Content

Naïve

CBCF

CBCF is significantly better (4% over CF) at (p < 0.001)


Results - II

0.58

0.6

0.62

0.64

0.66

0.68

RO

C-4

Algorithm

ROC Sensitivity

CF

Content

Naïve

CBCF

70

CBCF outperforms rest (5% improvement over CF)


71

Novelty versus Trust

• There is a trade-off– High confidence recommendations

• Recommendations are obvious

• Low utility for user

• However, they build trust– Users like to see some recommendations that they know are

right

– Recommendations with high prediction yet lower confidence

• Higher variability of error

• Higher novelty → higher utility for user– McLaughlin and Herlocker argue that “very obscure”

recommendations are often bad (e.g., hard to obtain)


Sample Results: [Mooney et al, SIGIR 04]

• Much better at predicting top-rated movies

• Drawback: tends to often predict blockbuster/already popular movies

• A serendipity/ trust trade-off

72

Modified Precis ion at Top-N

0

0.05

0.1

0.15

0.2

0.25

0.3

Top 1 Top 5 Top 10 Top 15 Top 20

Mo

dif

ied

Pre

cis

ion

User-to-User Item-Item Distri buti on


• One Recommender Systems research question

– What should be in that list?

Recommender Systems in e-Commerce


• Another question both in research and practice

– How do we know that these are good recommendations?



• This might lead to …– What is a good recommendation?

– What is a good recommendation strategy?

– What is a good recommendation strategy for my business?


We hope you will buy also …These have been in stock for quite a while now …


• Total sales numbers

• Promotion of certain items

• …

• Click-through-rates

• Interactivity on platform

• …

• Customer return rates

• Customer satisfaction and loyalty

What is a good recommendation?

What are the measures in practice?


Purpose and success criteria (1)

Different perspectives/aspects– Depends on domain and purpose– No holistic evaluation scenario exists

• Retrieval perspective– Reduce search costs– Provide "correct" proposals– Assumption: Users know in advance what they want

• Recommendation perspective– Serendipity – identify items from the Long Tail– Users did not know about existence


When does a RS do its job well?

"Recommend widely unknown items that users might actually like!"

20% of items accumulate 74% of all positive ratings

Recommend items from the long tail


Purpose and success criteria (2)

• Prediction perspective– Predict to what degree users like an item– Most popular evaluation scenario in research

• Interaction perspective– Give users a "good feeling"– Educate users about the product domain– Convince/persuade users - explain

• Finally, conversion perspective – Commercial situations– Increase "hit", "clickthrough", "lookers to bookers" rates– Optimize sales margins and profit


• Test with real users

– A/B tests

– Example measures: sales increase, click through rates

• Laboratory studies

– Controlled experiments

– Example measures: satisfaction with the system (questionnaires)

• Offline experiments

– Based on historical data

– Example measures: prediction accuracy, coverage

How do we as researchers know?


Empirical research

• Characterizing dimensions:

– Who is the subject that is in the focus of research?

– What research methods are applied?

– In which setting does the research take place?

Subject Online customers, students, historical online sessions, computers, …

Research method Experiments, quasi-experiments, non-experimental research

Setting Lab, real-world scenarios


Research methods

• Experimental vs. non-experimental (observational) research methods– Experiment (test, trial):

• "An experiment is a study in which at least one variable is manipulated and units are randomly assigned to different levels or categories of manipulated variable(s)."

• Units: users, historic sessions, …

• Manipulated variable: type of RS, groups of recommended items,

explanation strategies …

• Categories of manipulated variable(s): content-based RS, collaborative RS


Experiment designs


Evaluation ~ IR Evaluation

• Recommendation is viewed as information retrieval task:– Retrieve (recommend) all items which are predicted to be

"good" or "relevant".

• Common protocol :– Hide some items with known ground truth

– Rank items or predict ratings -> Count -> Cross-validate

• Ground truth established by human domain experts

Reality

Actually Good Actually Bad

Pre

dic

tio

n Rated Good

True Positive (tp) False Positive (fp)

Rated Bad

False Negative (fn) True Negative (tn)


Dilemma of IR measures in RS

IR measures are frequently applied, however:

Ground truth for most items actually unknown

What is a relevant item?

Different ways of measuring precision possible

Results from offline experimentation may have limited predictive power for

online user behavior.


• Rank Score extends recall and precision to take the positions of correct items in a ranked list into account– Particularly important in recommender systems as lower

ranked items may be overlooked by users– Learning-to-rank: Optimize models for such measures (e.g.,

AUC)

Metrics: Rank Score – position matters

Actually good

Item 237

Item 899

Recommended (predicted as good)

Item 345

Item 237

Item 187

For a user:

hit


Accuracy measures• Datasets with items rated by users

– MovieLens datasets 100K-10M ratings

– Netflix 100M ratings

• Historic user ratings constitute ground truth

• Metrics measure error rate

– Mean Absolute Error (MAE) computes the deviation between predicted ratings and actual ratings

Root Mean Square Error (RMSE) is similar to


Offline experimentation example

• Netflix competition– Web-based movie rental

– Prize of $1,000,000 for accuracy improvement (RMSE) of 10% compared to own Cinematch system.

• Historical dataset

– ~480K users rated ~18K movies on a scale of 1 to 5 (~100M ratings)

– Last 9 ratings/user withheld• Probe set – for teams for evaluation

• Quiz set – evaluates teams’ submissions for leaderboard

• Test set – used by Netflix to determine winner

• Today

– Rating prediction only seen as an additional input into the recommendation process


• Offline evaluation is the cheapest variant

– Still, gives us valuable insights

– and lets us compare our results (in theory)

• Dangers and trends:

– Domination of accuracy measures

– Focus on small set of domains (40% on movies in CS)

• Alternative and complementary measures:

– Diversity, Coverage, Novelty, Familiarity, Serendipity, Popularity, Concentration effects (Long tail)

An imperfect world



Netflix Challenge

scoremovieuser

1211

52131

43452

41232

37682

5763

4454

15685

23425

22345

5766

4566

scoremovieuser

?621

?961

?72

?32

?473

?153

?414

?284

?935

?745

?696

?836

Training data Test data

Movie rating data


Netflix Prize

• Training data

– 100 million ratings

– 480,000 users

– 17,770 movies

– 6 years of data: 2000-2005

• Test data

– Last few ratings of each user (2.8 million)

– Evaluation criterion: root mean squared error (RMSE)

– Netflix Cinematch RMSE: 0.9514

• Competition

– 2700+ teams

– $1 million grand prize for 10% improvement on Cinematch result

– $50,000 2007 progress prize for 8.43% improvement


Overall rating distribution

• Third of ratings are 4s

• Average rating is 3.68From TimelyDevelopment.com


#ratings per movie

• Avg #ratings/movie: 56274/4/2016 CS 572: Information Retrieval. Spring 2016 95

#ratings per user

• Avg #ratings/user: 2084/4/2016 CS 572: Information Retrieval. Spring 2016 96

Average movie rating by movie count

More ratings to better moviesFrom TimelyDevelopment.com


Most loved movies

CountAvg ratingTitle

137812 4.593 The Shawshank Redemption

133597 4.545 Lord of the Rings: The Return of the King

180883 4.306 The Green Mile

150676 4.460 Lord of the Rings: The Two Towers

139050 4.415 Finding Nemo

117456 4.504 Raiders of the Lost Ark

180736 4.299 Forrest Gump

147932 4.433 Lord of the Rings: The Fellowship of the ring

149199 4.325 The Sixth Sense

144027 4.333 Indiana Jones and the Last Crusade

4/4/2016CS 572: Information Retrieval. Spring 201698

Important RMSEs

Grand Prize: 0.8563; 10% improvement

BellKor: 0.8693; 8.63% improvement

Cinematch: 0.9514; baseline

Movie average: 1.0533

User average: 1.0651

Global average: 1.1296

Inherent noise: ????

Personalization

erroneous

accurate


Challenges

• Size of data

– Scalability

– Keeping data in memory

• Missing data

– 99 percent missing

– Very imbalanced

• Avoiding overfitting

• Test and training data differ significantly

movie #16322


101

The Rules

• Improve RMSE metric by 10%

– RMSE= square root of the averaged squared difference between each prediction and the actual rating

– amplifies the contributions of egregious errors, both false positives ("trust busters") and false negatives ("missed opportunities").

• 100 million “training” ratings provided

– User, Rating, Movie name, date


movie #7614 movie #3250

movie #15868

Lessons from the Netflix Prize

Yehuda Koren

The BellKor team

(with Robert Bell & Chris Volinsky)


• Use an ensemble of complementing predictors

• Two, half tuned models worth more than a single, fully tuned model

The BellKor recommender system


Effect of ensemble size

0.87

0.8725

0.875

0.8775

0.88

0.8825

0.885

0.8875

0.89

0.8925

1 8 15 22 29 36 43 50

#Predictors

Err

or

- R

MS

E


• Use an ensemble of complementing predictors

• Two, half tuned models worth more than a single, fully tuned model

• But:Many seemingly different models expose similar characteristics of the data, and won’t mix well

• Concentrate efforts along three axes...

The BellKor recommender system


The first axis:

• Multi-scale modeling of the data

• Combine top level, regional modeling of the data, with a refined, local view:

– k-NN: Extracting local patterns

– Factorization: Addressing regional effects

The three dimensions of the BellKor system

Global effects

Factorization

k-NN


Multi-scale modeling – 1st tier

• Mean rating: 3.7 stars

• The Sixth Sense is 0.5 stars above avg

• Joe rates 0.2 stars below avg

Baseline estimation:Joe will rate The Sixth Sense 4 stars

Global effects:


Multi-scale modeling – 2nd tier

• Both The Sixth Sense and Joe are placed high on the “Supernatural Thrillers” scale

Adjusted estimate:Joe will rate The Sixth Sense 4.5 stars

Factors model:


Multi-scale modeling – 3rd tier

• Joe didn’t like related movie Signs

• Final estimate:Joe will rate The Sixth Sense 4.2 stars

Neighborhood model:


The second axis:

• Quality of modeling

• Make the best out of a model

• Strive for:

– Fundamental derivation

– Simplicity

– Avoid overfitting

– Robustness against #iterations, parameter setting, etc.

• Optimizing is good, but don’t overdo it!


global

local

quality4/4/2016 CS 572: Information Retrieval. Spring 2016 112

• The third dimension will be discussed later...

• Next:Moving the multi-scale view along the quality axis


global

local

quality

???


• Earliest and most popular collaborative filtering method

• Derive unknown ratings from those of “similar” items (movie-movievariant)

• A parallel user-user flavor: rely on ratings of like-minded users (not in this talk)

Local modeling through k-NN


k-NN

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

movie

s

- unknown rating - rating between 1 to 5


k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

movie

s

- estimate rating of movie 1 by user 5


k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:

Identify movies similar to 1, rated by user 5

movie

s


k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:

s13=0.2, s16=0.3

movie

s


k-NN

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

(0.2*2+0.3*3)/(0.2+0.3)=2.6

movie

s


Properties of k-NN

• Intuitive

• No substantial preprocessing is required

• Easy to explain reasoning behind a recommendation

• Accurate?


http://wiki.c2b2.columbia.edu/workbench/index.php/Image:T_ReverseEngineering_FirstNeighbors.png

http://wiki.c2b2.columbia.edu/workbench/index.php/Image:T_ReverseEngineering_FirstNeighbors.png

Grand Prize: 0.8563

BellKor: 0.8693

Cinematch: 0.9514





0.96

0.91

k-NN on the RMSE scale

k-NN

erroneous

accurate


k-NN - Common practice

1. Define a similarity measure between items: sij

2. Select neighbors -- N(i;u): items most similar to i, that were rated by u

3. Estimate unknown rating, rui, as the weighted average:

( ; )

( ; )

ij uj ujj N i u

ui ui

ijj N i u

s r br b

s

baseline estimate for rui


k-NN - Common practice1. Define a similarity measure between items: sij

2. Select neighbors -- N(i;u): items similar to i, rated by u

3. Estimate unknown rating, rui, as the weighted average:

Problems:

1. Similarity measures are arbitrary; no fundamental

justification

2. Pairwise similarities isolate each neighbor; neglect

interdependencies among neighbors

3. Taking a weighted average is restricting; e.g., when

neighborhood information is limited

( ; )

( ; )

ij uj ujj N i u

ui ui

ijj N i u

s r br b

s


(We allow )

Interpolation weights

• Use a weighted sum rather than a weighted average:

( ; )

1ij

j N i u

w

( ; )ui ui ij uj ujj N i u

r b w r b

• Model relationships between item i and its neighbors

• Can be learnt through a least squares problem from all

other users that rated i:

2

( ; )Minw vi vi ij vj vjv u j N i u

r b w r b


Interpolation weights

2

( ; )Minw vi vi ij vj vjv u j N i u

r b w r b

• Interpolation weights derived based on their role;

no use of an arbitrary similarity measure

• Explicitly account for interrelationships among the

neighbors

Challenges:

• Deal with missing values

• Avoid overfitting

• Efficient implementation

Mostly

unknown

Estimate inner-products

among movie ratings


Geared

towards

females

Geared

towards

males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Gus

Dave




45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

users

A rank-3 SVD approximation


Estimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users


users

?



45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users


users

?



45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

2.4


users


Latent factor models45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

Properties:• SVD isn’t defined when entries are unknown use specialized

methods

• Very powerful model can easily overfit, sensitive to regularization

• Probably most popular model among contestants

– 12/11/2006: Simon Funk describes an SVD based method

– 12/29/2006: Free implementation at timelydevelopment.com


Grand Prize: 0.8563

BellKor: 0.8693

Cinematch: 0.9514





0.93

0.89

Factorization on the RMSE scale

factorization


Our approach

• User factors:Model a user u as a vector pu ~ Nk(, )

• Movie factors:Model a movie i as a vector qi ~ Nk(γ, Λ)

• Ratings:Measure “agreement” between u and i: rui ~ N(pu

Tqi, ε2)

• Maximize model’s likelihood:– Alternate between recomputing user-factors,

movie-factors and model parameters– Special cases:

• Alternating Ridge regression• Nonnegative matrix factorization


Combining multi-scale views

global effects

regional effects

local effects

Residual fitting

factorization

k-NN

Weighted average

A unified model

k-NN

factorization


Localized factorization model

• Standard factorization: User u is a linear function parameterized by pu

rui = puTqi

• Allow user factors – pu – to depend on the item being predicted

rui = pu(i)Tqi

• Vector pu(i) models behavior of u on items like i


RMSE vs. #Factors

0.905

0.91

0.915

0.92

0.925

0.93

0.935

0 20 40 60 80

#Factors

RM

SE

Factorization

Localized factorization

Results on Netflix Probe set

More

accurate


Top contenders for Progress Prize 2007

0

1

2

3

4

5

6

7

8

9

10

10/2

/2006

11/2

/2006

12/2

/2006

1/2

/2007

2/2

/2007

3/2

/2007

4/2

/2007

5/2

/2007

6/2

/2007

7/2

/2007

8/2

/2007

9/2

/2007

10/2

/2007

% i

mp

rovem

en

t

ML@Toronto

How low can he go?

wxyzConsulting

Gravity

BellKor

Grand prize

multi-scale

modeling


• Can exploit movie titles and release year

• But movies side is pretty much covered anyway...

• It’s about the users!

• Turning to the third dimension...

Seek alternative perspectives of the data

global

local

quality

???


The third dimension of the BellKor system

• A powerful source of information:Characterize users by which movies they rated, rather than how they rated

• A binary representation of the data

45531

312445

53432142

24542

522434

42331

users

mo

vie

s

010100100101

111001001100

011101011011

010010010110

110000111100

010010010101

usersm

ovie

s


The third dimension of the BellKor system

movie #17270

Great news to recommender systems:

• Works even better on real life datasets (?)

• Improve accuracy by exploiting implicit feedback

• Implicit behavior is abundant and easy to collect:

– Rental history

– Search patterns

– Browsing history

– ...

• Implicit feedback allows predicting personalized ratings for users that never rated!


binaryimplicit

global

local

quality


Where do you want to be?

• All over the global-local axis

• Relatively high on the quality axis

• Can we go in between the explicit-implicit axis?

• Yes! Relevant methods:

– Conditional RBMs [ML@UToronto]

– NSVD [Arek Paterek]

– Asymmetric factor models

ratingsexplicit


LessonsWhat it takes to win:

1. Think deeper – design better algorithms

2. Think broader – use an ensemble of multiple predictors

3. Think different – model the data from different perspectives

At the personal level:

1. Have fun with the data

2. Work hard, long breath

3. Good teammates

Rapid progress of science:

1. Availability of large, real life data

2. Challenge, competition

3. Effective collaboration

movie #13043


45531

312445

53432142

24542

522434

42331

34321

454

2434

32

5

24

Yehuda Koren

AT&T Labs – Research

[email protected]

BellKor homepage:

www.research.att.com/~volinsky/netflix/4/4/2016 CS 572: Information Retrieval. Spring 2016 143

mailto:[email protected]

Summary

• Most effective approaches are often hybrid between collaborative and content-based filtering– c.f. Netflix, Google News recommendations

• Lots of data available, but algorithm scalability critical – (use Map/Reduce, Hadoop)

• Suggested rules-of-thumb (take w/ grain of salt):– If have moderate amounts of rating data for each user, use

collaborative filtering with kNN (reasonable baseline)– When little data available, use content (model-) based

recommendations


Resources

• Recommender Systems Survey: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975

• Koren et al. on winning the NetFlix challenge:

– http://videolectures.net/kdd09_koren_cftd/

– http://www.searchenginecaffe.com/2009/09/winning-

1m-netflix-prize-methods.html

• http://www.recommenderbook.net/


http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975

http://videolectures.net/kdd09_koren_cftd/

http://www.searchenginecaffe.com/2009/09/winning-1m-netflix-prize-methods.html

http://www.recommenderbook.net/