recommendation with a reason - meetupfiles.meetup.com/1542972/collaborative-topic-modeling...2...
TRANSCRIPT
Recommendation with a Reason:
Collaborative Topic Modeling for Recommending Scientific Articles
Chong WangJoint work with David M. Blei
Computer Science Department, Princeton University
NYC Predictive Analytics, New York
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 1 / 68
Outline
1 Overview for Recommender Systems
2 Matrix factorization for recommendation
3 Topic modeling
4 Collaborative topic models
5 Empirical Results
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 2 / 68
Outline
1 Overview for Recommender Systems
2 Matrix factorization for recommendation
3 Topic modeling
4 Collaborative topic models
5 Empirical Results
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 3 / 68
Too many choices to choose from
Type “laptop” in Amazon,
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 4 / 68
What would you do?
Read every description yourself.
Ask a good friend to do it.
Ask many “friends” to do it.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 5 / 68
What do they say?
Sorted by Avg. Customer Review
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 6 / 68
So, where are those recommender systems?
and more …..
But I am a graduate student and I also do research ...
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 7 / 68
Today’s focus — Recommending Scientific articles
A search of “machine learning” in Google scholar gives 2,520,000 results.
already read
If I have read article A, B and C, what should I read next?
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 8 / 68
The problem of finding relevant articles
1 Finding relevant articles is an important task for researchers tolearn about the general ideas in an area.keep up to the-state-of-art of an area.
2 Two popular existing approaches:following article references: easily missing relevant citations.using keyword search:
difficult to form queries,only good for directed exploration.
3 Our approach is to develop recommendation algorithms given onlinecommunities sharing reference libraries. (www.citeulike.org)
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 9 / 68
Example: articles from a user’s library
1 Probabilistic latent semantic indexing
2 Optimizing search engines using clickthrough data
3 A re-examination of text categorization methods
4 On Discriminative vs. Generative Classifiers: A comparison of logisticregression and naive Bayes
5 A Comparative Study on Feature Selection in Text Categorization
6 ...
What should we recommend to this user?
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 10 / 68
Demo
http://www.cs.princeton.edu/˜chongw/citeulike/
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 11 / 68
User can browse his/her interests
User’s interests are summarized using top topics he/she is interested in.
http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 12 / 68
User can read the recommendations
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 13 / 68
When a user clicks on one recommendation — the topics
How word content is different from the people’s view.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 14 / 68
Outline
1 Overview for Recommender Systems
2 Matrix factorization for recommendation
3 Topic modeling
4 Collaborative topic models
5 Empirical Results
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 15 / 68
Matrix factorization for recommendation
Three important elements:
users: you and me
items: article
ratings: a user likes/dislikes some of the articles
Popular solutions: collaborative filtering (CF)
matrix factorization: one of the most popular algorithms forrecommender system.
The user-item matrix,
useritem
1
2
3
1 2 3
?
?
?
1 0 ?
1 ? 0
? 1 0
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 16 / 68
Matrix factorization
Users and items are represented in a shared but unknown latentspace,
user i — ui ∈ RK .item j — vj ∈ RK .
Each dimension of the latent space is assumed to represent some kindof unknown common interest.
The rating of item j by user i is achieved by the dot product,
rij = uTi vj ,
where rij = 1 indicates like and 0 dislike. In the matrix form,
R = UTV .
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 17 / 68
Probabilistic matrix factorization (PMF)
1 For each user i , draw user latent vector ui ∼ N (0,λ−1u IK ).
2 For each item j , draw item latent vector vj ∼ N (0,λ−1v IK ).
3 For each user-item pair (i , j), draw rating rij ∼ N (uTi vj ,c−1ij ), where
cij is the precision parameter.
The expectation of the rating is,
E[rij ] = uTi vj .
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 18 / 68
Learning and Prediction
Learning the latent vectors for users and items,
minU,V
∑i ,j
(rij −uTi vj)2+λu||ui ||2+λv ||vj ||2,
where λu and λv are regularization parameters.
Prediction: for user i on item j (not rated by user i before),
rij ≈ uTi vj .
How do we understand these latent vectors for users and items???
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 19 / 68
An example commonly used to explain matrix factorization
machine learning
information retrieval
!
"#
$%
&
&
&
&
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 20 / 68
Understanding can be very helpful
Provide new ways of improving end-user interactions —
not just list of articles (items), more like a exploratory tool.
Increase users’ confidence of the system —
more likely to accept recommendations.
Provide new ways of diagnosing the systems —
not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 21 / 68
Our criteria for an article recommender system
It should be able to
1 recommend old articles (easy)
2 recommend new articles (not that easy, but doable)
3 provide the interpretability — not just a list of items (challenging)
So, what distinguishes our approach? Our goal is not only to improve
the performance, but also the interpretability.
Topic modeling is one way to achieve interpretability.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 22 / 68
Outline
1 Overview for Recommender Systems
2 Matrix factorization for recommendation
3 Topic modeling
4 Collaborative topic models
5 Empirical Results
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 23 / 68
Topic modeling
gene 0.04dna 0.02genetic 0.01
…
gene 0.04dna 0.02genetic 0.01.,,
life 0.02evolve 0.01organism 0.01.,,
brain 0.04neuron 0.02nerve 0.01...
data 0.02number 0.02computer 0.01.,,
Topics Documents Topic proportions andassignments
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.
model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)
We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.
1. Randomly choose a distribution over topics.
2. For each word in the document
(a) Randomly choose a topic from the distribution over topics in step #1.
(b) Randomly choose a word from the corresponding distribution over the vocabulary.
This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document
1Technically, the model assumes that the topics are generated first, before the documents.
3
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topics Documents Topic proportions
Each topic is a distribution over words
Each document is a mixture of corpus-wide topics
Each word is drawn from one of those topics
Adapted from Introduction to Probabilistic Topic Models, D. Blei 2011.Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 24 / 68
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a popular topic model. It assumes
There are K topics.
For each article, topic proportions θ ∼ Dirichlet(α).
gene 0.04dna 0.02genetic 0.01
…
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topicstopic proportions θj
Note that θ can explain the topics that article talks about!
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 25 / 68
LDA as a graphical modelLDA as a graphical model
θd Zd,n Wd,nN
D Kβk
α !
Dirichlet
parameter
Per-document
topic proportions
Per-word
topic assignment
Observed
word Topics
Topic
hyperparameter
• Vertices denote random variables.• Edges denote dependence between random variables.• Shading denotes observed variables.• Plates denote replicated variables.
Adapted from David Blei’s slides.Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 26 / 68
Running a topic model
Data: article titles+abstracts from CiteUlike.16,980 articles1.6M words8K unique terms
Model: 200-topic LDA model with variational inference.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 27 / 68
Inferred topics for the whole corpus
gene genes
expression tissues
regulationcoexpressiontissuespecific
expressed tissue
regulatory
nodes wireless protocol routing
protocols node
sensor peertopeer
scalable hoc
distribution random
probability distributions
sampling stochastic
markov density
estimation statistics
learning machine training
vector learn
machines kernel
learned classifiers classifier
relative importance
give original respect obtain
ranking metric
weighted compute
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 28 / 68
Inferred topic proportions for article I
relative importance give original respect obtain ranking
large small numbers larger extremely amounts smallerweb semantic pages page metadata standards rdf xml
topic proportions
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 29 / 68
Inferred topic proportions for article II
estimate estimates likelihood maximum estimated missing
distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music
topic proportions
Maximum Likelihood from Incomplete Data via the EM Algorithm
By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service
[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE
1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.
The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.
We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by
(1.1)
The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).
Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 30 / 68
Outline
1 Overview for Recommender Systems
2 Matrix factorization for recommendation
3 Topic modeling
4 Collaborative topic models
5 Empirical Results
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 31 / 68
Comparison of the article representations
estimate estimates likelihood maximum estimated missing
distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music
topic modeling
Maximum Likelihood from Incomplete Data via the EM Algorithm
By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service
[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE
1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.
The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.
We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by
(1.1)
The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).
Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
matrix factorization????????????
????????????????????????
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 32 / 68
Collaborative topic models: motivations
Article representation in different methods
?matrix factorization topic modeling
genedna…..
datanum…..?
In matrix factorization, an article has a latent representation v insome unknown latent space.
In topic modeling, an article has topic proportions θ in the learnedtopic space.
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 33 / 68
Collaborative topic models: motivations
Article representation in different methods
?matrix factorization topic modeling
genedna…..
datanum…..?
If we simply fix v = θ , we seem to find a way to explain the unknownspace using the topic space.
Should this work?
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 34 / 68
Collaborative topic models: motivations–cont’
article A article B
!"
machine learning social network analysis
Two topics:
Two articles:
Two people:
The boy likes A and B --- no problem.The girl likes A and B --- problem?
boy girl
Preferences:
Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 35 / 68