data mining techniques › home › jwvdm › teaching › cs6220 › ...• 3 feb: form teams of...
TRANSCRIPT
-
Data Mining TechniquesCS 6220 - Section 2 - Spring 2017
Lecture 10Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS246)
-
Project
-
• 3 Feb: Form teams of 2-4 people • 17 Feb: Submit abstract (1 paragraph) • 3 Mar: Submit proposals (2 pages) • 13 Mar: Milestone 1 (exploratory analysis) • 31 Mar: Milestone 2 (statistical analysis) • 16 Apr (Sun): Submit reports (10 pages) • 21 Apr (Fri): Submit peer reviews
Project Deadlines
-
• ~10 pages (rough guideline) • Guidelines for contents
• Introduction / Motivation • Exploratory analysis (if applicable) • Data mining analysis • Discussion of results
Project Reports
-
• 2 per person (randomly assigned) • Reviews should discuss 4 aspects
of the report • Clarity
(is the writing clear?) • Technical merit
(are methods valid?) • Reproducibility
(is it clear how results were obtained?) • Discussion
(are results interpretable?)
Project Review
-
Final Exam
-
Emphasis on post-midterm topics (but some pre-midterm topics included)
Topic Listhttp://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html
http://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html
-
Recommender Systems
-
The Long Tail
(from: https://www.wired.com/2004/10/tail/)
https://www.wired.com/2004/10/tail/
-
The Long Tail
(from: https://www.wired.com/2004/10/tail/)
https://www.wired.com/2004/10/tail/
-
The Long Tail
(from: https://www.wired.com/2004/10/tail/)
https://www.wired.com/2004/10/tail/
-
Problem Setting
-
Problem Setting
-
Problem Setting
-
Problem Setting
• Task: Predict user preferences for unseen items
-
Content-based Filtering
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Gus
Dave
Latent factor models
-
Content-based Filtering
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Gus
Dave
Latent factor models
Idea: Predict rating using item features on a per-user basis
-
Content-based Filtering
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Gus
Dave
Latent factor models
Idea: Predict rating using user features on a per-item basis
-
Collaborative FilteringPart I:
Basic neighborhood methods
Joe
#2
#3
#1
#4
Idea: Predict rating based on similarity to other users
-
Problem Setting
• Task: Predict user preferences for unseen items • Content-based filtering: Model user/item features • Collaborative filtering: Implicit similarity of users items
-
Recommender Systems
• Movie recommendation (Netflix)• Related product recommendation (Amazon)• Web page ranking (Google)• Social recommendation (Facebook)• Priority inbox & spam filtering (Google)• Online dating (OK Cupid)• Computational Advertising (Everyone)
-
Challenges• Scalability
• Millions of objects• 100s of millions of users
• Cold start• Changing user base• Changing inventory
• Imbalanced dataset• User activity / item reviews
power law distributed• Ratings are not missing at random
-
Running Example: Netflix Datascoredatemovieuser
15/7/02211
58/2/042131
43/6/013452
45/1/051232
37/15/027682
51/22/01763
48/3/00454
19/10/055685
23/5/033425
212/28/002345
58/11/02766
46/15/03566
scoredatemovieuser?1/6/05621
?9/13/04961
?8/18/0572
?11/22/0532
?6/13/02473
?8/12/01153
?9/1/00414
?8/27/05284
?4/4/05935
?7/16/03745
?2/14/04696
?10/3/03836
Training data Test data
Movie rating data
• Released as part of $1M competition by Netflix in 2006 • Prize awarded to BellKor in 2009
-
Running Yardstick: RMSE
rmse(S) =s|S|�1
X
(i,u)2S
(r̂ui � rui)2
-
Running Yardstick: RMSE
rmse(S) =s|S|�1
X
(i,u)2S
(r̂ui � rui)2
(doesn’t tell you how to actually do recommendation)
-
Content-based Filtering
-
Item-based Features
-
Item-based Features
-
Item-based Features
-
wu = argminw|ru � X w |2
Per-user Regression
Learn a set of regression coefficients for each user
-
User Bias and Item Popularity
-
Bias
-
Bias
Moonrise Kingdom 4 5 4 4 0.3 0.2
-
Bias
Moonrise Kingdom 4 5 4 4 0.3 0.2
Problem: Some movies are universally loved / hated
-
Bias
Moonrise Kingdom 4 5 4 4 0.3 0.2
Problem: Some movies are universally loved / hated some users are more picky than others
3
3
3
-
Bias
Solution: Introduce a per-movie and per-user bias
Problem: Some movies are universally loved / hated some users are more picky than others
Moonrise Kingdom 4 5 4 4 0.3 0.2
-
Collaborative Filtering
-
Neighborhood Based MethodsPart I:Basic neighborhood methods
Joe
#2
#3
#1
#4
Users and items form a bipartite graph (edges are ratings)
-
Neighborhood Based Methods(user, user) similarity
• predict rating based on average from k-nearest users
• good if item base is small• good if item base changes rapidly
(item,item) similarity• predict rating based on average
from k-nearest items• good if the user base is small• good if user base changes rapidly
-
Parzen-Window Style CF
• Define a similarity sij between items• Find set εk(i,u) of k-nearest neighbors
to i that were rated by user u• Predict rating using weighted average over set• How should we define sij?
-
Pearson Correlation Coefficient
sij =Cov[rui, ruj ]
Std[rui]Std[ruj ]
Estimating item-item similarities
• Common practice – rely on Pearson correlation coeff• Challenge – non-uniform user support of item ratings,
each item rated by a distinct set of users
1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ?
? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4
User ratings for item i:
User ratings for item j:
• Compute correlation over shared support
-
(item,item) similarity
⇢̂ij =
Pu2U(i,j)(rui � bui)(ruj � buj)qP
u2U(i,j)(rui � bui)2P
u2U(i,j)(ruj � buj)2
Empirical estimate of Pearson correlation coefficient
sij =|U(i, j)|� 1
|U(i, j)|� 1 + �⇢̂ij
Regularize towards 0 for small support
Regularize towards baseline for small neighborhood
-
Similarity for binary labels
mi users acting on i
mij users acting on both i and j
m total number of users
sij =mij
↵+mi +mj �mijsij =
observed
expected
⇡ mij↵+mimj/m
Jaccard similarity Observed / Expected ratio
Pearson correlation not meaningful for binary labels (e.g. Views, Purchases, Clicks)
-
Matrix Factorization Methods
-
Matrix Factorization
Moonrise Kingdom 4 5 4 4 0.3 0.2
-
Matrix Factorization
Moonrise Kingdom 4 5 4 4 0.3 0.2
Idea: pose as (biased) matrix factorization problem
-
Matrix FactorizationBasic matrix factorization model45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
users
A rank-3 SVD approximation
-
PredictionEstimate unknown ratings as inner-products of factors:45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
A rank-3 SVD approximation
users
?
-
PredictionEstimate unknown ratings as inner-products of factors:45531
312445
53432142
24542
522434
42331
items
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
~
~
items
users
2.4
A rank-3 SVD approximation
users
-
SVD with missing valuesMatrix factorization model
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Properties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization• Probably most popular model among Netflix contestants
– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com
Pose as regression problem
Regularize using Frobenius norm
-
Alternating Least SquaresMatrix factorization model
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Properties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization• Probably most popular model among Netflix contestants
– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com
(regress wu given X)
-
Alternating Least SquaresMatrix factorization model
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Properties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization• Probably most popular model among Netflix contestants
– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com
(regress wu given X)
Regularization
L2: closed form solution
w = (XTX+ �I)�1XTy
L1: No closed form solution. Use quadraticprogramming:
minimize k Xw � y k2 s.t. k w k1 s
Yijun Zhao Linear Regression
Remember ridge regression?
-
Alternating Least SquaresMatrix factorization model
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Properties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization• Probably most popular model among Netflix contestants
– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com
(regress xi given W)
(regress wu given X)
-
Stochastic Gradient DescentMatrix factorization model
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1~
Properties:• SVD isn’t defined when entries are unknown use
specialized methods• Very powerful model can easily overfit, sensitive to
regularization• Probably most popular model among Netflix contestants
– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com
• No need for locking• Multicore updates asynchronously
(Recht, Re, Wright, 2012 - Hogwild)
-
Sampling Bias
-
Ratings are not given at randomRatings are not given at random!
B. Marlin et al., “Collaborative Filtering and the Missing at Random Assumption” UAI 2007
Yahoo! survey answersYahoo! music ratingsNetflix ratings
Distribution of ratings
-
Ratings are not given at random
• A powerful source of information:Characterize users by which movies they rated, rather than how they rated
• A dense binary representation of the data:
45531
312445
53432142
24542
522434
42331
users
movies
010100100101
111001001100
011101011011
010010010110
110000111100
010010010101
users
movies
Which movies users rate?
,ui u iR r
,ui u iB brui cui
matrix factorization
regression
data
-
Temporal Effects
-
Changes in user behaviorSomething Happened in Early 2004…
2004
Netflix ratings by date
Netflix changedrating labels
-
Are movies getting better with time?Movies get better with time?
-
Temporal Effects
Solution: Model temporal effects in bias not weights
Are movies getting better with time?
-
Netflix Prize
-
Netflix PrizeTraining data • 100 million ratings, 480,000 users, 17,770 movies • 6 years of data: 2000-2005 Test data • Last few ratings of each user (2.8 million) • Evaluation criterion: Root Mean Square Error (RMSE) Competition • 2,700+ teams• Netflix’s system RMSE: 0.9514• $1 million prize for 10% improvement on Netflix
-
Improvements40
6090
12818050
100200
50
100200
50
100 200 500
100200 500
50100 200 500 1000 1500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF
BiasSVD
SVD++
SVD v.2
SVD v.3
SVD v.4
Add biases
Do SGD, but also learn biases μ, bu and bi
-
Improvements
Account for fact that ratings are not missing at random.
40
6090
12818050
100200
50
100200
50
100 200 500
100200 500
50100 200 500 1000 1500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF
BiasSVD
SVD++
SVD v.2
SVD v.3
SVD v.4
“who rated what”
-
40
6090
12818050
100200
50
100200
50
100 200 500
100200 500
50100 200 500 1000 1500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF
BiasSVD
SVD++
SVD v.2
SVD v.3
SVD v.4temporal effects
Improvements
-
40
6090
12818050
100200
50
100200
50
100 200 500
100200 500
50100 200 500 1000 1500
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF
BiasSVD
SVD++
SVD v.2
SVD v.3
SVD v.4temporal effects
Improvements
Still pretty far from 0.8563 grand prize
-
Winning Solution from BellKor
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52
-
Last 30 days
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53
June 26th submission triggers 30-day “last call”
-
BellKor fends off competitors by a hair
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56
-
BellKor fends off competitors by a hair
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57
-
Ratings aren’t everythingNetflix then Netflix now
• Only simpler submodels (SVD, RBMs) implemented • Ratings eventually proved to be only weakly informative
Netflix now