cs 572: information retrieval - emory universityeugene/cs572/lectures/lecture21... ·...
TRANSCRIPT
CS 572: Information Retrieval
Recommender Systems
Acknowledgements
Many slides in this lecture are adapted from Xavier Amatriain (Netflix),, and Y.
Koren (Yahoo),
Big Picture
Part 1: Classical IR: indexing, ranking, metrics
Part 2: Web: crawling, indexing, social web
Part 3: User-centric Information Access
– Recommender Systems
– Personalized Search
– Search interfaces and “intelligent assistants”
4/4/2016 CS 572: Information Retrieval. Spring 2016 2
Recommender Systems
• Recommender Systems: general approaches (today)
– Applications: News, Social recommendation (Wed)
• User feedback in Search (Wed)
• Personalized search (4/11)
• Search interfaces (4/13)
• Question Answering (4/18, Denis Savenkov)
– Collaborative Filtering
• Application: News Recommendation
– Wed: Applications to Search
• Improving Search Ranking with User Feedback
4/4/2016 CS 572: Information Retrieval. Spring 2016 3
Logistics for next 2 weeks
• Monday, 18 April: Question Answering (Denis Savenkov)
• Wed, April 20: Final project Milestone presentations:
– 10-minutes (max 8 slides)
– Email slides or URL to me by Tuesday night, 4/19.
4/4/2016 CS 572: Information Retrieval. Spring 2016 4
Information Retrieval Until Now
5
IR
SystemQuery
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
4/4/2016 CS 572: Information Retrieval. Spring 2016
Information Overload Is Real
4/4/2016 CS 572: Information Retrieval. Spring 2016 6
Recommendations Systems Help
4/4/2016 CS 572: Information Retrieval. Spring 2016 7
Problem Definition: Recommendation
4/4/2016 CS 572: Information Retrieval. Spring 2016 8
Recommendation: Formal Problem
4/4/2016 CS 572: Information Retrieval. Spring 2016 9
10
RS: Utility Matrix
0.4
10.2
0.30.5
0.21
King Kong LOTR Matrix Nacho Libre
Alice
Bob
Carol
David
4/4/2016 CS 572: Information Retrieval. Spring 2016
Recommendation System: Process
4/4/2016 CS 572: Information Retrieval. Spring 2016 11
Paradigms of recommender systems
Recommender systems reduce information overload by estimating relevance
4/4/2016 CS 572: Information Retrieval. Spring 2016 12
Paradigms of recommender systems
Personalized recommendations
4/4/2016 CS 572: Information Retrieval. Spring 2016 13
Paradigms of recommender systems
Collaborative: "Tell me what's popular among my peers"
4/4/2016 CS 572: Information Retrieval. Spring 2016 14
Paradigms of recommender systems
Content-based: "Show me more of the
same what I've liked"
4/4/2016 CS 572: Information Retrieval. Spring 2016 15
Paradigms of recommender systems
Knowledge-based: "Tell me what fits based on my needs"
4/4/2016 CS 572: Information Retrieval. Spring 2016 16
Paradigms of recommender systems
Hybrid: combinations of various inputs and/or composition of different mechanism
4/4/2016 CS 572: Information Retrieval. Spring 2016 17
18
Key Problems
• Gathering “known” ratings for matrix
• Extrapolate unknown ratings from known ratings
– Mainly interested in high unknown ratings
• Evaluating extrapolation methods
4/4/2016 CS 572: Information Retrieval. Spring 2016
19
Gathering Ratings
• Explicit
– Ask people to rate items
– Doesn’t work well in practice – people can’t be bothered
• Implicit
– Learn ratings from user actions
• e.g., purchase implies high rating
– What about low ratings?
4/4/2016 CS 572: Information Retrieval. Spring 2016
20
Rating interface
4/4/2016 CS 572: Information Retrieval. Spring 2016
21
Extrapolating Utilities
• Key problem: matrix U is sparse
– most people have not rated most items
• Three approaches
– Content-based
– Collaborative
– Hybrid
4/4/2016 CS 572: Information Retrieval. Spring 2016
Main Approaches
4/4/2016 CS 572: Information Retrieval. Spring 2016 22
Tradeoffs
Pros Cons
Collaborative No knowledge-engineering effort, serendipity of results, learns market segments
Requires some form of rating feedback, cold start for new users and new items
Content-based No community required, comparison between items possible
Content descriptions necessary, cold start for new users, no surprises
Knowledge-based Deterministic recommendations, assured quality, no cold-start, can resemble sales dialogue
Knowledge engineering effort to bootstrap, basically static, does not react to short-term trends
4/4/2016CS 572: Information Retrieval. Spring 201623
What works (in “real world”)
4/4/2016 CS 572: Information Retrieval. Spring 2016 24
COLLABORATIVE FILTERING
4/4/2016 CS 572: Information Retrieval. Spring 2016 25
Collaborative Filtering (CF)
• The most prominent approach to generate recommendations– used by large, commercial e-commerce sites
– well-understood, various algorithms and variations exist
– applicable in many domains (book, movies, DVDs, ..)
• Approach– use the "wisdom of the crowd" to recommend items
• Basic assumption and idea– Users give ratings to catalog items (implicitly or explicitly)
– Customers who had similar tastes in the past, will have similar tastes in the future
4/4/2016 CS 572: Information Retrieval. Spring 2016 26
User-based nearest-neighbor collaborative filtering (1)
• The basic technique:
– Given an "active user" (Alice) and an item I not yet seen by Alice
– The goal is to estimate Alice's rating for this item, e.g., by• find a set of users (peers) who liked the same items as Alice in the past and
who have rated item I
• use, e.g. the average of their ratings to predict, if Alice will like item I
• do this for all items Alice has not seen and recommend the best-rated
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
4/4/2016 CS 572: Information Retrieval. Spring 2016 27
28
Collaborative Filtering in Action
A 9
B 3
C
: :
Z 5
A
B
C 9
: :
Z 10
A 5
B 3
C
: :
Z 7
A
B
C 8
: :
Z
A 6
B 4
C
: :
Z
A 10
B 4
C 8
. .
Z 1
User
Database
Active
User
Correlation
Match
A 9
B 3
C
. .
Z 5
A 9
B 3
C
: :
Z 5
A 10
B 4
C 8
. .
Z 1
Extract
RecommendationsC
4/4/2016 CS 572: Information Retrieval. Spring 2016
User-based nearest-neighbor collaborative filtering (2)
• Some first questions– How do we measure similarity?
– How many neighbors should we consider?
– How do we generate a prediction from the neighbors' ratings?
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
4/4/2016 CS 572: Information Retrieval. Spring 2016 29
Measuring user similarity
• A popular similarity measure in user-based CF: Pearson correlation
a, b : usersra,p : rating of user a for item pP : set of items, rated both by a and bPossible similarity values between -1 and 1; = user's
average ratings
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
sim = 0,85sim = 0,70
sim = -0,79
𝒓𝒂, 𝒓𝒃
4/4/2016 CS 572: Information Retrieval. Spring 2016 30
Pearson correlation
• Takes differences in rating behavior into account
• Works well in usual domains, compared with alternative measures– such as cosine similarity
0
1
2
3
4
5
6
Item1 Item2 Item3 Item4
Ratings
Alice
User1
User4
4/4/2016 CS 572: Information Retrieval. Spring 2016 31
Making predictions
• A common prediction function:
• Calculate, whether the neighbors' ratings for the unseen item i are higher or lower than their average
• Combine the rating differences – use the similarity as a weight
• Add/subtract the neighbors' bias from the active user's average and use this as a prediction
4/4/2016 CS 572: Information Retrieval. Spring 2016 32
33
Significance Weighting
• Important not to trust correlations based on very few co-rated items.
• Include significance weights, sa,u, based on number of co-rated items, m.
uauaua csw ,,,
50 if 50
50 if 1
, mm
ms ua
4/4/2016 CS 572: Information Retrieval. Spring 2016
34
Collaborative Filtering Method
• Weight all users with respect to similarity with the active user.
• Select a subset of the users (neighbors) to use as predictors.
• Normalize ratings and compute a prediction from a weighted combination of the selected neighbors’ ratings.
• Present items with highest predicted ratings as recommendations.
4/4/2016 CS 572: Information Retrieval. Spring 2016
Making recommendations
• Making predictions is typically not the ultimate goal
• Usual approach (in academia)
– Rank items based on their predicted ratings
• However
– This might lead to the inclusion of (only) niche items
– In practice also: Take item popularity into account
• Approaches
– "Learning to rank"
• Optimize according to a given rank evaluation metric (see later)
4/4/2016 CS 572: Information Retrieval. Spring 2016 35
36
Problems with Collaborative Filtering
• Cold Start: There needs to be enough other users already in the system to find a match.
• Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items.
• First Rater: Cannot recommend an item that has not been previously rated.– New items
– Esoteric items
• Popularity Bias: Cannot recommend items to someone with unique tastes. – Tends to recommend popular items.
4/4/2016 CS 572: Information Retrieval. Spring 2016
Improving the metrics / prediction function
• Not all neighbor ratings might be equally "valuable"– Agreement on commonly liked items is not so informative as
agreement on controversial items– Possible solution: Give more weight to items that have a higher
variance
• Value of number of co-rated items– Use "significance weighting", by e.g., linearly reducing the weight
when the number of co-rated items is low
• Case amplification– Intuition: Give more weight to "very similar" neighbors, i.e., where
the similarity value is close to 1.
• Neighborhood selection– Use similarity threshold or fixed number of neighbors
4/4/2016 CS 572: Information Retrieval. Spring 2016 37
Memory-based and model-based approaches
• User-based CF is said to be "memory-based"– the rating matrix is directly used to find neighbors / make
predictions
– does not scale for most real-world scenarios
– large e-commerce sites have tens of millions of customers and millions of items
• Model-based approaches– based on an offline pre-processing or "model-learning" phase
– at run-time, only the learned model is used to make predictions
– models are updated / re-trained periodically
– large variety of techniques used
– model-building and updating can be computationally expensive
4/4/2016 CS 572: Information Retrieval. Spring 2016 38
2001: Item-based collaborative filtering recommendation algorithms, B.
Sarwar et al., WWW 2001
• Scalability issues arise with U2U if many more users than items (m >> n , m = |users|, n = |items|)
– e.g. Amazon.com– Space complexity O(m2) when pre-computed– Time complexity for computing Pearson O(m2n)
• High sparsity leads to few common ratings between two users
• Basic idea: "Item-based CF exploits relationships between items first, instead of relationships between users"
4/4/2016 CS 572: Information Retrieval. Spring 2016 39
Item-based collaborative filtering
• Basic idea:
– Use the similarity between items (and not users) to make predictions
• Example:
– Look for items that are similar to Item5
– Take Alice's ratings for these items to predict the rating for Item5
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
4/4/2016 CS 572: Information Retrieval. Spring 2016 40
The cosine similarity measure
• Produces better results in item-to-item filtering– for some datasets, no consistent picture in literature
• Ratings are seen as vector in n-dimensional space• Similarity is calculated based on the angle between the vectors
• Adjusted cosine similarity– take average user ratings into account, transform the original ratings– U: set of users who have rated both items a and b
4/4/2016 CS 572: Information Retrieval. Spring 2016 41
Pre-processing for item-based filtering
• Item-based filtering does not solve the scalability problem itself• Pre-processing approach by Amazon.com (in 2003)
– Calculate all pair-wise item similarities in advance– The neighborhood to be used at run-time is typically rather small,
because only items are taken into account which the user has rated– Item similarities are supposed to be more stable than user similarities
• Memory requirements– Up to N2 pair-wise similarities to be memorized (N = number of items) in
theory– In practice, this is significantly lower (items with no co-ratings)– Further reductions possible
• Minimum threshold for co-ratings (items, which are rated at least by n users)• Limit the size of the neighborhood (might affect recommendation accuracy)
4/4/2016 CS 572: Information Retrieval. Spring 2016 42
More on ratings
• Pure CF-based systems only rely on the rating matrix• Explicit ratings
– Most commonly used (1 to 5, 1 to 7 Likert response scales)– Research topics
• "Optimal" granularity of scale; indication that 10-point scale is better accepted in movie domain
• Multidimensional ratings (multiple ratings per movie)
– Challenge• Users not always willing to rate many items; sparse rating matrices• How to stimulate users to rate more items?
• Implicit ratings– clicks, page views, time spent on some page, demo downloads …– Can be used in addition to explicit ones; question of correctness of
interpretation
4/4/2016 CS 572: Information Retrieval. Spring 2016 43
Data sparsity problems
• Cold start problem– How to recommend new items? What to recommend to new
users?
• Straightforward approaches– Ask/force users to rate a set of items– Use another method (e.g., content-based, demographic or
simply non-personalized) in the initial phase
• Alternatives– Use better algorithms (beyond nearest-neighbor approaches)– Example:
• In nearest-neighbor approaches, the set of sufficiently similar neighbors might be to small to make good predictions
• Assume "transitivity" of neighborhoods
4/4/2016 CS 572: Information Retrieval. Spring 2016 44
Example algorithms for sparse datasets
• Recursive CF– Assume there is a very close neighbor n of u who however has not
rated the target item i yet.
– Idea: • Apply CF-method recursively and predict a rating for item i for the neighbor
• Use this predicted rating instead of the rating of a more distant direct neighbor
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 ?
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
sim = 0,85
Predict rating forUser1
4/4/2016 CS 572: Information Retrieval. Spring 2016 45
Graph-based methods
• "Spreading activation" (sketch) – Idea: Use paths of lengths > 3
to recommend items– Length 3: Recommend Item3 to User1– Length 5: Item1 also recommendable
4/4/2016 CS 572: Information Retrieval. Spring 2016 46
More model-based approaches
• Plethora of different techniques proposed in the last years, e.g.,– Matrix factorization techniques, statistics
• singular value decomposition, principal component analysis
– Association rule mining• compare: shopping basket analysis
– Probabilistic models• clustering models, Bayesian networks, probabilistic Latent Semantic
Analysis
– Various other machine learning approaches
• Costs of pre-processing – Usually not discussed– Incremental updates possible?
4/4/2016 CS 572: Information Retrieval. Spring 2016 47
2000: Application of Dimensionality Reduction in
Recommender System, B. Sarwar et al., WebKDD Workshop
• Basic idea: Trade more complex offline model building for faster online prediction generation
• Singular Value Decomposition for dimensionality reduction of rating matrices
– Captures important factors/aspects and their weights in the data
– factors can be genre, actors but also non-understandable ones
– Assumption that k dimensions capture the signals and filter out noise (K = 20 to 100)
• Constant time to make recommendations
• Approach also popular in IR (Latent Semantic Indexing), data compression, …
4/4/2016 CS 572: Information Retrieval. Spring 2016 48
A picture says …
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
BobMary
Alice
Sue
4/4/2016 CS 572: Information Retrieval. Spring 2016 49
Matrix factorization
VkT
Dim1 -0.44 -0.57 0.06 0.38 0.57
Dim2 0.58 -0.66 0.26 0.18 -0.36
Uk Dim1 Dim2
Alice 0.47 -0.30
Bob -0.44 0.23
Mary 0.70 -0.06
Sue 0.31 0.93 Dim1 Dim2
Dim1 5.63 0
Dim2 0 3.23
T
kkkk VUM
k
• SVD:
• Prediction:
= 3 + 0.84 = 3.84
)()(ˆ EPLVAliceUrr Tkkkuui
4/4/2016 CS 572: Information Retrieval. Spring 2016 50
2008: Factorization meets the neighborhood: a multifaceted
collaborative filtering model, Y. Koren, ACM SIGKDD
Stimulated by work on Netflix competition– Prize of $1,000,000 for accuracy improvement of 10% RMSE
compared to own Cinematch system
– Very large dataset (~100M ratings, ~480K users , ~18K movies)
– Last ratings/user withheld (set K)
Root mean squared error metric optimized to 0.8567
K
rr
RMSEKiu
uiui
),(
2)ˆ(
4/4/2016 CS 572: Information Retrieval. Spring 2016 51
Merges neighborhood models with latent factor models
Latent factor models
– good to capture weak signals in the overall data
Neighborhood models
– good at detecting strong relationships between close items
Combination in one prediction single function
– Local search method such as stochastic gradient descent to determine parameters
– Add penalty for high values to avoid over-fitting
2008: Factorization meets the neighborhood: a multifaceted
collaborative filtering model, Y. Koren, ACM SIGKDD
Kiu
iuiui
T
uiuuibqp
bbqpqpbbr),(
22222
,,)()(min
***
iTuiuui qpbbr ˆ
4/4/2016 CS 572: Information Retrieval. Spring 2016 52
Collaborative Filtering Issues
• Pros: – well-understood, works well in some domains, no knowledge engineering required
• Cons:– requires user community, sparsity problems, no integration of other knowledge sources, no
explanation of results
• What is the best CF method?– In which situation and which domain? Inconsistent findings; always the same domains and data
sets; differences between methods are often very small (1/100)
• How to evaluate the prediction quality?– MAE / RMSE: What does an MAE of 0.7 actually mean?
– Serendipity: Not yet fully understood
• What about multi-dimensional ratings?
4/4/2016 CS 572: Information Retrieval. Spring 2016 53
54
Content-based recommendations
• Main idea: recommend items to customer C similar to previous items rated highly by C
• Movie recommendations
– recommend movies with same actor(s), director, genre, …
• Websites, blogs, news
– recommend other sites with “similar” content
4/4/2016 CS 572: Information Retrieval. Spring 2016
55
Overall idea
likes
Item profiles
Red
Circles
Triangles
User profile
match
recommendbuild
4/4/2016 CS 572: Information Retrieval. Spring 2016
56
Item Profiles
• For each item, create an item profile
• Profile is a set of features– movies: author, title, actor, director,…
– text: set of “important” words in document
• How to pick important words?– Usual heuristic is TF.IDF (Term Frequency times
Inverse Doc Frequency)
4/4/2016 CS 572: Information Retrieval. Spring 2016
57
TF.IDF Strikes Again!
fij = frequency of term ti in document dj
ni = number of docs that mention term i
N = total number of docs
TF.IDF score wij = TFij x IDFi
Doc profile = set of words with highest TF.IDF scores, together with their scores
4/4/2016 CS 572: Information Retrieval. Spring 2016
58
User profiles and prediction
• User profile possibilities:
– Weighted average of rated item profiles
– Variation: weight by difference from average rating for item
– …
• Prediction heuristic
– Given user profile c and item profile s, estimate u(c,s) = cos(c,s) = c.s/(|c||s|)
– Need efficient method to find items with high utility
4/4/2016 CS 572: Information Retrieval. Spring 2016
59
Model-based approaches
• For each user, learn a classifier that classifies items into rating classes
– liked by user and not liked by user
– e.g., Bayesian, regression, SVM
• Apply classifier to each item to find recommendation candidates
• Problem: scalability
– Can revisit when we cover classifiers in detail
4/4/2016 CS 572: Information Retrieval. Spring 2016
60
Advantages of Content-Based Approach
• No need for data on other users.
– No cold-start or sparsity problems.
• Able to recommend to users with unique tastes.
• Able to recommend new and unpopular items
– No first-rater problem.
• Can provide explanations of recommended items by listing content-features that caused an item to be recommended.
4/4/2016 CS 572: Information Retrieval. Spring 2016
61
Disadvantages of Content-Based Approach
• Requires content that can be encoded as meaningful features.
• Users’ tastes must be represented as a learnable function of these content features.
• Unable to exploit quality judgments of other users.
– Unless these are somehow included in the content features.
4/4/2016 CS 572: Information Retrieval. Spring 2016
62
Combining Content and Collaboration
• Content-based and collaborative methods have complementary strengths and weaknesses.
• Combine methods to obtain the best of both.
• Various hybrid approaches:
– Apply both methods and combine recommendations.
– Use collaborative data as content.
– Use content-based predictor as another collaborator.
– Use content-based predictor to complete collaborative data.
4/4/2016 CS 572: Information Retrieval. Spring 2016
63
Movie Domain• EachMovie Dataset [Compaq Research Labs]
– Contains user ratings for movies on a 0–5 scale.
– 72,916 users (avg. 39 ratings each).
– 1,628 movies.
– Sparse user-ratings matrix – (2.6% full).
• Crawled Internet Movie Database (IMDb)
– Extracted content for titles in EachMovie.
• Basic movie information:
– Title, Director, Cast, Genre, etc.
• Popular opinions:
– User comments, Newspaper and Newsgroup reviews, etc.
4/4/2016 CS 572: Information Retrieval. Spring 2016
64
Content-Boosted Collaborative Filtering (Movie domain,
IMDbEachMovie Web Crawler
Movie
Content
Database
Full User
Ratings Matrix
Collaborative
FilteringActive
User Ratings
User Ratings
Matrix (Sparse)Content-based
Predictor
Recommendations
4/4/2016 CS 572: Information Retrieval. Spring 2016
65
Content-Boosted CF - I
Content-Based
Predictor
Training Examples
Pseudo User-ratings Vector
Items with Predicted Ratings
User-ratings Vector
User-rated Items
Unrated Items
4/4/2016 CS 572: Information Retrieval. Spring 2016
66
Content-Boosted CF - II
• Compute pseudo user ratings matrix
– Full matrix – approximates actual full user ratings matrix
• Perform CF
– Using Pearson corr. between pseudo user-rating vectors
User Ratings
Matrix
Pseudo User
Ratings Matrix
Content-Based
Predictor
4/4/2016 CS 572: Information Retrieval. Spring 2016
67
Classic Experiments (Konstan)
• Used subset of EachMovie (7,893 users; 299,997 ratings)
• Test set: 10% of the users selected at random.– Test users that rated at least 40 movies.
– Train on the remainder sets.
• Hold-out set: 25% items for each test user.– Predict rating of each item in the hold-out set.
• Compared CBCF to other prediction approaches:– Pure CF
– Pure Content-based
– Naïve hybrid (averages CF and content-based predictions)
4/4/2016 CS 572: Information Retrieval. Spring 2016
68
Metrics
• Mean Absolute Error (MAE)
– Compares numerical predictions with user ratings
• ROC sensitivity [Herlocker 99]
– How well predictions help users select high-quality items
– Ratings 4 considered “good”; < 4 considered “bad”
• Paired t-test for statistical significance
4/4/2016 CS 572: Information Retrieval. Spring 2016
69
Results - I
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
MA
E
Algorithm
MAE
CF
Content
Naïve
CBCF
CBCF is significantly better (4% over CF) at (p < 0.001)
4/4/2016 CS 572: Information Retrieval. Spring 2016
Results - II
0.58
0.6
0.62
0.64
0.66
0.68
RO
C-4
Algorithm
ROC Sensitivity
CF
Content
Naïve
CBCF
70
CBCF outperforms rest (5% improvement over CF)
4/4/2016 CS 572: Information Retrieval. Spring 2016
71
Novelty versus Trust
• There is a trade-off– High confidence recommendations
• Recommendations are obvious
• Low utility for user
• However, they build trust– Users like to see some recommendations that they know are
right
– Recommendations with high prediction yet lower confidence
• Higher variability of error
• Higher novelty → higher utility for user– McLaughlin and Herlocker argue that “very obscure”
recommendations are often bad (e.g., hard to obtain)
4/4/2016 CS 572: Information Retrieval. Spring 2016
Sample Results: [Mooney et al, SIGIR 04]
• Much better at predicting top-rated movies
• Drawback: tends to often predict blockbuster/already popular movies
• A serendipity/ trust trade-off
72
Modified Precis ion at Top-N
0
0.05
0.1
0.15
0.2
0.25
0.3
Top 1 Top 5 Top 10 Top 15 Top 20
Mo
dif
ied
Pre
cis
ion
User-to-User Item-Item Distri buti on
4/4/2016 CS 572: Information Retrieval. Spring 2016
4/4/2016 CS 572: Information Retrieval. Spring 2016 73
• One Recommender Systems research question
– What should be in that list?
Recommender Systems in e-Commerce
4/4/2016 CS 572: Information Retrieval. Spring 2016 74
• Another question both in research and practice
– How do we know that these are good recommendations?
Recommender Systems in e-Commerce
4/4/2016 CS 572: Information Retrieval. Spring 2016 75
• This might lead to …– What is a good recommendation?
– What is a good recommendation strategy?
– What is a good recommendation strategy for my business?
Recommender Systems in e-Commerce
We hope you will buy also …These have been in stock for quite a while now …
4/4/2016 CS 572: Information Retrieval. Spring 2016 76
• Total sales numbers
• Promotion of certain items
• …
• Click-through-rates
• Interactivity on platform
• …
• Customer return rates
• Customer satisfaction and loyalty
What is a good recommendation?
What are the measures in practice?
4/4/2016 CS 572: Information Retrieval. Spring 2016 77
Purpose and success criteria (1)
Different perspectives/aspects– Depends on domain and purpose– No holistic evaluation scenario exists
• Retrieval perspective– Reduce search costs– Provide "correct" proposals– Assumption: Users know in advance what they want
• Recommendation perspective– Serendipity – identify items from the Long Tail– Users did not know about existence
4/4/2016 CS 572: Information Retrieval. Spring 2016 78
When does a RS do its job well?
"Recommend widely unknown items that users might actually like!"
20% of items accumulate 74% of all positive ratings
Recommend items from the long tail
4/4/2016 CS 572: Information Retrieval. Spring 2016 79
Purpose and success criteria (2)
• Prediction perspective– Predict to what degree users like an item– Most popular evaluation scenario in research
• Interaction perspective– Give users a "good feeling"– Educate users about the product domain– Convince/persuade users - explain
• Finally, conversion perspective – Commercial situations– Increase "hit", "clickthrough", "lookers to bookers" rates– Optimize sales margins and profit
4/4/2016 CS 572: Information Retrieval. Spring 2016 80
• Test with real users
– A/B tests
– Example measures: sales increase, click through rates
• Laboratory studies
– Controlled experiments
– Example measures: satisfaction with the system (questionnaires)
• Offline experiments
– Based on historical data
– Example measures: prediction accuracy, coverage
How do we as researchers know?
4/4/2016 CS 572: Information Retrieval. Spring 2016 81
Empirical research
• Characterizing dimensions:
– Who is the subject that is in the focus of research?
– What research methods are applied?
– In which setting does the research take place?
Subject Online customers, students, historical online sessions, computers, …
Research method Experiments, quasi-experiments, non-experimental research
Setting Lab, real-world scenarios
4/4/2016 CS 572: Information Retrieval. Spring 2016 82
Research methods
• Experimental vs. non-experimental (observational) research methods– Experiment (test, trial):
• "An experiment is a study in which at least one variable is manipulated and units are randomly assigned to different levels or categories of manipulated variable(s)."
• Units: users, historic sessions, …
• Manipulated variable: type of RS, groups of recommended items,
explanation strategies …
• Categories of manipulated variable(s): content-based RS, collaborative RS
4/4/2016 CS 572: Information Retrieval. Spring 2016 83
Experiment designs
4/4/2016 CS 572: Information Retrieval. Spring 2016 84
Evaluation ~ IR Evaluation
• Recommendation is viewed as information retrieval task:– Retrieve (recommend) all items which are predicted to be
"good" or "relevant".
• Common protocol :– Hide some items with known ground truth
– Rank items or predict ratings -> Count -> Cross-validate
• Ground truth established by human domain experts
Reality
Actually Good Actually Bad
Pre
dic
tio
n Rated Good
True Positive (tp) False Positive (fp)
Rated Bad
False Negative (fn) True Negative (tn)
4/4/2016 CS 572: Information Retrieval. Spring 2016 85
Dilemma of IR measures in RS
IR measures are frequently applied, however:
Ground truth for most items actually unknown
What is a relevant item?
Different ways of measuring precision possible
Results from offline experimentation may have limited predictive power for
online user behavior.
4/4/2016 CS 572: Information Retrieval. Spring 2016 86
• Rank Score extends recall and precision to take the positions of correct items in a ranked list into account– Particularly important in recommender systems as lower
ranked items may be overlooked by users– Learning-to-rank: Optimize models for such measures (e.g.,
AUC)
Metrics: Rank Score – position matters
Actually good
Item 237
Item 899
Recommended (predicted as good)
Item 345
Item 237
Item 187
For a user:
hit
4/4/2016 CS 572: Information Retrieval. Spring 2016 87
Accuracy measures• Datasets with items rated by users
– MovieLens datasets 100K-10M ratings
– Netflix 100M ratings
• Historic user ratings constitute ground truth
• Metrics measure error rate
– Mean Absolute Error (MAE) computes the deviation between predicted ratings and actual ratings
Root Mean Square Error (RMSE) is similar to
4/4/2016 CS 572: Information Retrieval. Spring 2016 88
Offline experimentation example
• Netflix competition– Web-based movie rental
– Prize of $1,000,000 for accuracy improvement (RMSE) of 10% compared to own Cinematch system.
• Historical dataset
– ~480K users rated ~18K movies on a scale of 1 to 5 (~100M ratings)
– Last 9 ratings/user withheld• Probe set – for teams for evaluation
• Quiz set – evaluates teams’ submissions for leaderboard
• Test set – used by Netflix to determine winner
• Today
– Rating prediction only seen as an additional input into the recommendation process
4/4/2016 CS 572: Information Retrieval. Spring 2016 89
• Offline evaluation is the cheapest variant
– Still, gives us valuable insights
– and lets us compare our results (in theory)
• Dangers and trends:
– Domination of accuracy measures
– Focus on small set of domains (40% on movies in CS)
• Alternative and complementary measures:
– Diversity, Coverage, Novelty, Familiarity, Serendipity, Popularity, Concentration effects (Long tail)
An imperfect world
4/4/2016 CS 572: Information Retrieval. Spring 2016 90
4/4/2016 CS 572: Information Retrieval. Spring 2016 91
Netflix Challenge
scoremovieuser
1211
52131
43452
41232
37682
5763
4454
15685
23425
22345
5766
4566
scoremovieuser
?621
?961
?72
?32
?473
?153
?414
?284
?935
?745
?696
?836
Training data Test data
Movie rating data
4/4/2016 CS 572: Information Retrieval. Spring 2016 92
Netflix Prize
• Training data
– 100 million ratings
– 480,000 users
– 17,770 movies
– 6 years of data: 2000-2005
• Test data
– Last few ratings of each user (2.8 million)
– Evaluation criterion: root mean squared error (RMSE)
– Netflix Cinematch RMSE: 0.9514
• Competition
– 2700+ teams
– $1 million grand prize for 10% improvement on Cinematch result
– $50,000 2007 progress prize for 8.43% improvement
4/4/2016 CS 572: Information Retrieval. Spring 2016 93
Overall rating distribution
• Third of ratings are 4s
• Average rating is 3.68From TimelyDevelopment.com
4/4/2016 CS 572: Information Retrieval. Spring 2016 94
#ratings per movie
• Avg #ratings/movie: 56274/4/2016 CS 572: Information Retrieval. Spring 2016 95
#ratings per user
• Avg #ratings/user: 2084/4/2016 CS 572: Information Retrieval. Spring 2016 96
Average movie rating by movie count
More ratings to better moviesFrom TimelyDevelopment.com
4/4/2016 CS 572: Information Retrieval. Spring 2016 97
Most loved movies
CountAvg ratingTitle
137812 4.593 The Shawshank Redemption
133597 4.545 Lord of the Rings: The Return of the King
180883 4.306 The Green Mile
150676 4.460 Lord of the Rings: The Two Towers
139050 4.415 Finding Nemo
117456 4.504 Raiders of the Lost Ark
180736 4.299 Forrest Gump
147932 4.433 Lord of the Rings: The Fellowship of the ring
149199 4.325 The Sixth Sense
144027 4.333 Indiana Jones and the Last Crusade
4/4/2016CS 572: Information Retrieval. Spring 201698
Important RMSEs
Grand Prize: 0.8563; 10% improvement
BellKor: 0.8693; 8.63% improvement
Cinematch: 0.9514; baseline
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Inherent noise: ????
Personalization
erroneous
accurate
4/4/2016 CS 572: Information Retrieval. Spring 2016 99
Challenges
• Size of data
– Scalability
– Keeping data in memory
• Missing data
– 99 percent missing
– Very imbalanced
• Avoiding overfitting
• Test and training data differ significantly
movie #16322
4/4/2016 CS 572: Information Retrieval. Spring 2016 100
101
The Rules
• Improve RMSE metric by 10%
– RMSE= square root of the averaged squared difference between each prediction and the actual rating
– amplifies the contributions of egregious errors, both false positives ("trust busters") and false negatives ("missed opportunities").
• 100 million “training” ratings provided
– User, Rating, Movie name, date
4/4/2016 CS 572: Information Retrieval. Spring 2016
4/4/2016 CS 572: Information Retrieval. Spring 2016 102
4/4/2016 CS 572: Information Retrieval. Spring 2016 103
movie #7614 movie #3250
movie #15868
Lessons from the Netflix Prize
Yehuda Koren
The BellKor team
(with Robert Bell & Chris Volinsky)
4/4/2016 CS 572: Information Retrieval. Spring 2016 104
• Use an ensemble of complementing predictors
• Two, half tuned models worth more than a single, fully tuned model
The BellKor recommender system
4/4/2016 CS 572: Information Retrieval. Spring 2016 105
Effect of ensemble size
0.87
0.8725
0.875
0.8775
0.88
0.8825
0.885
0.8875
0.89
0.8925
1 8 15 22 29 36 43 50
#Predictors
Err
or
- R
MS
E
4/4/2016 CS 572: Information Retrieval. Spring 2016 106
• Use an ensemble of complementing predictors
• Two, half tuned models worth more than a single, fully tuned model
• But:Many seemingly different models expose similar characteristics of the data, and won’t mix well
• Concentrate efforts along three axes...
The BellKor recommender system
4/4/2016 CS 572: Information Retrieval. Spring 2016 107
The first axis:
• Multi-scale modeling of the data
• Combine top level, regional modeling of the data, with a refined, local view:
– k-NN: Extracting local patterns
– Factorization: Addressing regional effects
The three dimensions of the BellKor system
Global effects
Factorization
k-NN
4/4/2016 CS 572: Information Retrieval. Spring 2016 108
Multi-scale modeling – 1st tier
• Mean rating: 3.7 stars
• The Sixth Sense is 0.5 stars above avg
• Joe rates 0.2 stars below avg
Baseline estimation:Joe will rate The Sixth Sense 4 stars
Global effects:
4/4/2016 CS 572: Information Retrieval. Spring 2016 109
Multi-scale modeling – 2nd tier
• Both The Sixth Sense and Joe are placed high on the “Supernatural Thrillers” scale
Adjusted estimate:Joe will rate The Sixth Sense 4.5 stars
Factors model:
4/4/2016 CS 572: Information Retrieval. Spring 2016 110
Multi-scale modeling – 3rd tier
• Joe didn’t like related movie Signs
• Final estimate:Joe will rate The Sixth Sense 4.2 stars
Neighborhood model:
4/4/2016 CS 572: Information Retrieval. Spring 2016 111
The second axis:
• Quality of modeling
• Make the best out of a model
• Strive for:
– Fundamental derivation
– Simplicity
– Avoid overfitting
– Robustness against #iterations, parameter setting, etc.
• Optimizing is good, but don’t overdo it!
The three dimensions of the BellKor system
global
local
quality4/4/2016 CS 572: Information Retrieval. Spring 2016 112
• The third dimension will be discussed later...
• Next:Moving the multi-scale view along the quality axis
The three dimensions of the BellKor system
global
local
quality
???
4/4/2016 CS 572: Information Retrieval. Spring 2016 113
• Earliest and most popular collaborative filtering method
• Derive unknown ratings from those of “similar” items (movie-movievariant)
• A parallel user-user flavor: rely on ratings of like-minded users (not in this talk)
Local modeling through k-NN
4/4/2016 CS 572: Information Retrieval. Spring 2016 114
k-NN
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
movie
s
- unknown rating - rating between 1 to 5
4/4/2016 CS 572: Information Retrieval. Spring 2016 115
k-NN
121110987654321
455 ?311
3124452
534321423
245424
5224345
423316
users
movie
s
- estimate rating of movie 1 by user 5
4/4/2016 CS 572: Information Retrieval. Spring 2016 116
k-NN
121110987654321
455 ?311
3124452
534321423
245424
5224345
423316
users
Neighbor selection:
Identify movies similar to 1, rated by user 5
movie
s
4/4/2016 CS 572: Information Retrieval. Spring 2016 117
k-NN
121110987654321
455 ?311
3124452
534321423
245424
5224345
423316
users
Compute similarity weights:
s13=0.2, s16=0.3
movie
s
4/4/2016 CS 572: Information Retrieval. Spring 2016 118
k-NN
121110987654321
4552.6311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average:
(0.2*2+0.3*3)/(0.2+0.3)=2.6
movie
s
4/4/2016 CS 572: Information Retrieval. Spring 2016 119
Properties of k-NN
• Intuitive
• No substantial preprocessing is required
• Easy to explain reasoning behind a recommendation
• Accurate?
4/4/2016 CS 572: Information Retrieval. Spring 2016 120
Grand Prize: 0.8563
BellKor: 0.8693
Cinematch: 0.9514
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Inherent noise: ????
0.96
0.91
k-NN on the RMSE scale
k-NN
erroneous
accurate
4/4/2016 CS 572: Information Retrieval. Spring 2016 121
k-NN - Common practice
1. Define a similarity measure between items: sij
2. Select neighbors -- N(i;u): items most similar to i, that were rated by u
3. Estimate unknown rating, rui, as the weighted average:
( ; )
( ; )
ij uj ujj N i u
ui ui
ijj N i u
s r br b
s
baseline estimate for rui
4/4/2016 CS 572: Information Retrieval. Spring 2016 122
k-NN - Common practice1. Define a similarity measure between items: sij
2. Select neighbors -- N(i;u): items similar to i, rated by u
3. Estimate unknown rating, rui, as the weighted average:
Problems:
1. Similarity measures are arbitrary; no fundamental
justification
2. Pairwise similarities isolate each neighbor; neglect
interdependencies among neighbors
3. Taking a weighted average is restricting; e.g., when
neighborhood information is limited
( ; )
( ; )
ij uj ujj N i u
ui ui
ijj N i u
s r br b
s
4/4/2016 CS 572: Information Retrieval. Spring 2016 123
(We allow )
Interpolation weights
• Use a weighted sum rather than a weighted average:
( ; )
1ij
j N i u
w
( ; )ui ui ij uj ujj N i u
r b w r b
• Model relationships between item i and its neighbors
• Can be learnt through a least squares problem from all
other users that rated i:
2
( ; )Minw vi vi ij vj vjv u j N i u
r b w r b
4/4/2016 CS 572: Information Retrieval. Spring 2016 124
Interpolation weights
2
( ; )Minw vi vi ij vj vjv u j N i u
r b w r b
• Interpolation weights derived based on their role;
no use of an arbitrary similarity measure
• Explicitly account for interrelationships among the
neighbors
Challenges:
• Deal with missing values
• Avoid overfitting
• Efficient implementation
Mostly
unknown
Estimate inner-products
among movie ratings
4/4/2016 CS 572: Information Retrieval. Spring 2016 125
Geared
towards
females
Geared
towards
males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Gus
Dave
Latent factor models
4/4/2016 CS 572: Information Retrieval. Spring 2016 126
Latent factor models
45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
item
s
users
users
A rank-3 SVD approximation
4/4/2016 CS 572: Information Retrieval. Spring 2016 127
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
item
s
users
A rank-3 SVD approximation
users
?
4/4/2016 CS 572: Information Retrieval. Spring 2016 128
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
item
s
users
A rank-3 SVD approximation
users
?
4/4/2016 CS 572: Information Retrieval. Spring 2016 129
Estimate unknown ratings as inner-products of factors:
45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
item
s
users
2.4
A rank-3 SVD approximation
users
4/4/2016 CS 572: Information Retrieval. Spring 2016 130
Latent factor models45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
Properties:• SVD isn’t defined when entries are unknown use specialized
methods
• Very powerful model can easily overfit, sensitive to regularization
• Probably most popular model among contestants
– 12/11/2006: Simon Funk describes an SVD based method
– 12/29/2006: Free implementation at timelydevelopment.com
4/4/2016 CS 572: Information Retrieval. Spring 2016 131
Grand Prize: 0.8563
BellKor: 0.8693
Cinematch: 0.9514
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Inherent noise: ????
0.93
0.89
Factorization on the RMSE scale
factorization
4/4/2016 CS 572: Information Retrieval. Spring 2016 132
Our approach
• User factors:Model a user u as a vector pu ~ Nk(, )
• Movie factors:Model a movie i as a vector qi ~ Nk(γ, Λ)
• Ratings:Measure “agreement” between u and i: rui ~ N(pu
Tqi, ε2)
• Maximize model’s likelihood:– Alternate between recomputing user-factors,
movie-factors and model parameters– Special cases:
• Alternating Ridge regression• Nonnegative matrix factorization
4/4/2016 CS 572: Information Retrieval. Spring 2016 133
Combining multi-scale views
global effects
regional effects
local effects
Residual fitting
factorization
k-NN
Weighted average
A unified model
k-NN
factorization
4/4/2016 CS 572: Information Retrieval. Spring 2016 134
Localized factorization model
• Standard factorization: User u is a linear function parameterized by pu
rui = puTqi
• Allow user factors – pu – to depend on the item being predicted
rui = pu(i)Tqi
• Vector pu(i) models behavior of u on items like i
4/4/2016 CS 572: Information Retrieval. Spring 2016 135
RMSE vs. #Factors
0.905
0.91
0.915
0.92
0.925
0.93
0.935
0 20 40 60 80
#Factors
RM
SE
Factorization
Localized factorization
Results on Netflix Probe set
More
accurate
4/4/2016 CS 572: Information Retrieval. Spring 2016 136
Top contenders for Progress Prize 2007
0
1
2
3
4
5
6
7
8
9
10
10/2
/2006
11/2
/2006
12/2
/2006
1/2
/2007
2/2
/2007
3/2
/2007
4/2
/2007
5/2
/2007
6/2
/2007
7/2
/2007
8/2
/2007
9/2
/2007
10/2
/2007
% i
mp
rovem
en
t
ML@Toronto
How low can he go?
wxyzConsulting
Gravity
BellKor
Grand prize
multi-scale
modeling
4/4/2016 CS 572: Information Retrieval. Spring 2016 137
• Can exploit movie titles and release year
• But movies side is pretty much covered anyway...
• It’s about the users!
• Turning to the third dimension...
Seek alternative perspectives of the data
global
local
quality
???
4/4/2016 CS 572: Information Retrieval. Spring 2016 138
The third dimension of the BellKor system
• A powerful source of information:Characterize users by which movies they rated, rather than how they rated
• A binary representation of the data
45531
312445
53432142
24542
522434
42331
users
mo
vie
s
010100100101
111001001100
011101011011
010010010110
110000111100
010010010101
usersm
ovie
s
4/4/2016 CS 572: Information Retrieval. Spring 2016 139
The third dimension of the BellKor system
movie #17270
Great news to recommender systems:
• Works even better on real life datasets (?)
• Improve accuracy by exploiting implicit feedback
• Implicit behavior is abundant and easy to collect:
– Rental history
– Search patterns
– Browsing history
– ...
• Implicit feedback allows predicting personalized ratings for users that never rated!
4/4/2016 CS 572: Information Retrieval. Spring 2016 140
binaryimplicit
global
local
quality
The three dimensions of the BellKor system
Where do you want to be?
• All over the global-local axis
• Relatively high on the quality axis
• Can we go in between the explicit-implicit axis?
• Yes! Relevant methods:
– Conditional RBMs [ML@UToronto]
– NSVD [Arek Paterek]
– Asymmetric factor models
ratingsexplicit
4/4/2016 CS 572: Information Retrieval. Spring 2016 141
LessonsWhat it takes to win:
1. Think deeper – design better algorithms
2. Think broader – use an ensemble of multiple predictors
3. Think different – model the data from different perspectives
At the personal level:
1. Have fun with the data
2. Work hard, long breath
3. Good teammates
Rapid progress of science:
1. Availability of large, real life data
2. Challenge, competition
3. Effective collaboration
movie #13043
4/4/2016 CS 572: Information Retrieval. Spring 2016 142
45531
312445
53432142
24542
522434
42331
34321
454
2434
32
5
24
Yehuda Koren
AT&T Labs – Research
BellKor homepage:
www.research.att.com/~volinsky/netflix/4/4/2016 CS 572: Information Retrieval. Spring 2016 143
Summary
• Most effective approaches are often hybrid between collaborative and content-based filtering– c.f. Netflix, Google News recommendations
• Lots of data available, but algorithm scalability critical – (use Map/Reduce, Hadoop)
• Suggested rules-of-thumb (take w/ grain of salt):– If have moderate amounts of rating data for each user, use
collaborative filtering with kNN (reasonable baseline)– When little data available, use content (model-) based
recommendations
4/4/2016 CS 572: Information Retrieval. Spring 2016 144
Resources
• Recommender Systems Survey: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975
• Koren et al. on winning the NetFlix challenge:
– http://videolectures.net/kdd09_koren_cftd/
– http://www.searchenginecaffe.com/2009/09/winning-
1m-netflix-prize-methods.html
• http://www.recommenderbook.net/
4/4/2016 CS 572: Information Retrieval. Spring 2016 145