recommendation with a reason - meetupfiles.meetup.com/1542972/collaborative-topic-modeling...2...

Recommendation with a Reason:

Collaborative Topic Modeling for Recommending Scientific Articles

Chong WangJoint work with David M. Blei

Computer Science Department, Princeton University

NYC Predictive Analytics, New York

Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 1 / 68

Outline

1 Overview for Recommender Systems

2 Matrix factorization for recommendation

3 Topic modeling

4 Collaborative topic models

5 Empirical Results


Outline



3 Topic modeling


5 Empirical Results


Too many choices to choose from

Type “laptop” in Amazon,


What would you do?

Read every description yourself.

Ask a good friend to do it.

Ask many “friends” to do it.


What do they say?

Sorted by Avg. Customer Review


So, where are those recommender systems?

and more …..

But I am a graduate student and I also do research ...


Today’s focus — Recommending Scientific articles

A search of “machine learning” in Google scholar gives 2,520,000 results.

already read

If I have read article A, B and C, what should I read next?


The problem of finding relevant articles

1 Finding relevant articles is an important task for researchers tolearn about the general ideas in an area.keep up to the-state-of-art of an area.

2 Two popular existing approaches:following article references: easily missing relevant citations.using keyword search:

difficult to form queries,only good for directed exploration.

3 Our approach is to develop recommendation algorithms given onlinecommunities sharing reference libraries. (www.citeulike.org)


Example: articles from a user’s library

1 Probabilistic latent semantic indexing

2 Optimizing search engines using clickthrough data

3 A re-examination of text categorization methods

4 On Discriminative vs. Generative Classifiers: A comparison of logisticregression and naive Bayes

5 A Comparative Study on Feature Selection in Text Categorization

6 ...

What should we recommend to this user?


Demo

http://www.cs.princeton.edu/˜chongw/citeulike/


User can browse his/her interests

User’s interests are summarized using top topics he/she is interested in.

http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html


User can read the recommendations


When a user clicks on one recommendation — the topics

How word content is different from the people’s view.


Outline



3 Topic modeling


5 Empirical Results


Matrix factorization for recommendation

Three important elements:

users: you and me

items: article

ratings: a user likes/dislikes some of the articles

Popular solutions: collaborative filtering (CF)

matrix factorization: one of the most popular algorithms forrecommender system.

The user-item matrix,

useritem

1

2

3

1 2 3

?

?

?

1 0 ?

1 ? 0

? 1 0


Matrix factorization

Users and items are represented in a shared but unknown latentspace,

user i — ui ∈ RK .item j — vj ∈ RK .

Each dimension of the latent space is assumed to represent some kindof unknown common interest.

The rating of item j by user i is achieved by the dot product,

rij = uTi vj ,

where rij = 1 indicates like and 0 dislike. In the matrix form,

R = UTV .


Probabilistic matrix factorization (PMF)

1 For each user i , draw user latent vector ui ∼ N (0,λ−1u IK ).

2 For each item j , draw item latent vector vj ∼ N (0,λ−1v IK ).

3 For each user-item pair (i , j), draw rating rij ∼ N (uTi vj ,c−1ij ), where

cij is the precision parameter.

The expectation of the rating is,

E[rij ] = uTi vj .


Learning and Prediction

Learning the latent vectors for users and items,

minU,V

∑i ,j

(rij −uTi vj)2+λu||ui ||2+λv ||vj ||2,

where λu and λv are regularization parameters.

Prediction: for user i on item j (not rated by user i before),

rij ≈ uTi vj .

How do we understand these latent vectors for users and items???


An example commonly used to explain matrix factorization

machine learning

information retrieval

!

"#

$%

&

&

&

&


Understanding can be very helpful

Provide new ways of improving end-user interactions —

not just list of articles (items), more like a exploratory tool.

Increase users’ confidence of the system —

more likely to accept recommendations.

Provide new ways of diagnosing the systems —

not just accuracy, etc.

Unfortunately, matrix factorization does not have it through an easy way.


Our criteria for an article recommender system

It should be able to

1 recommend old articles (easy)

2 recommend new articles (not that easy, but doable)

3 provide the interpretability — not just a list of items (challenging)

So, what distinguishes our approach? Our goal is not only to improve

the performance, but also the interpretability.

Topic modeling is one way to achieve interpretability.


Outline



3 Topic modeling


5 Empirical Results


Topic modeling

gene 0.04dna 0.02genetic 0.01

…

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

life 0.02 evolve 0.01 organism 0.01 …

data 0.02number 0.02computer 0.01 …

Topics Documents Topic proportions

Each topic is a distribution over words

Each document is a mixture of corpus-wide topics

Each word is drawn from one of those topics

Adapted from Introduction to Probabilistic Topic Models, D. Blei 2011.Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 24 / 68

Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a popular topic model. It assumes

There are K topics.

For each article, topic proportions θ ∼ Dirichlet(α).

gene 0.04dna 0.02genetic 0.01

…

life 0.02 evolve 0.01 organism 0.01 …

data 0.02number 0.02computer 0.01 …

Topicstopic proportions θj

Note that θ can explain the topics that article talks about!


LDA as a graphical modelLDA as a graphical model

θd Zd,n Wd,nN

D Kβk

α !

Dirichlet

parameter

Per-document

topic proportions

Per-word

topic assignment

Observed

word Topics

Topic

hyperparameter

• Vertices denote random variables.• Edges denote dependence between random variables.• Shading denotes observed variables.• Plates denote replicated variables.

Adapted from David Blei’s slides.Wang and Blei (Princeton) Recommending Scientific Articles December 01, 2011 26 / 68

Running a topic model

Data: article titles+abstracts from CiteUlike.16,980 articles1.6M words8K unique terms

Model: 200-topic LDA model with variational inference.


Inferred topics for the whole corpus

gene genes

expression tissues

regulationcoexpressiontissuespecific

expressed tissue

regulatory

nodes wireless protocol routing

protocols node

sensor peertopeer

scalable hoc

distribution random

probability distributions

sampling stochastic

markov density

estimation statistics

learning machine training

vector learn

machines kernel

learned classifiers classifier

relative importance

give original respect obtain

ranking metric

weighted compute


Inferred topic proportions for article I

relative importance give original respect obtain ranking

large small numbers larger extremely amounts smallerweb semantic pages page metadata standards rdf xml

topic proportions


Inferred topic proportions for article II

estimate estimates likelihood maximum estimated missing

distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music

topic proportions

Maximum Likelihood from Incomplete Data via the EM Algorithm

By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service

[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE

1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.

The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.

We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by

(1.1)

The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).

Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.


Outline



3 Topic modeling


5 Empirical Results


Comparison of the article representations

estimate estimates likelihood maximum estimated missing

distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music

topic modeling

Maximum Likelihood from Incomplete Data via the EM Algorithm

By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service

[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE

1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.

The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.

We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by

(1.1)

The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).

Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.

matrix factorization????????????

????????????????????????


Collaborative topic models: motivations

Article representation in different methods

?matrix factorization topic modeling

genedna…..

datanum…..?

In matrix factorization, an article has a latent representation v insome unknown latent space.

In topic modeling, an article has topic proportions θ in the learnedtopic space.


Collaborative topic models: motivations

Article representation in different methods

?matrix factorization topic modeling

genedna…..

datanum…..?

If we simply fix v = θ , we seem to find a way to explain the unknownspace using the topic space.

Should this work?


Collaborative topic models: motivations–cont’

article A article B

!"

machine learning social network analysis

Two topics:

Two articles:

Two people:

The boy likes A and B --- no problem.The girl likes A and B --- problem?

boy girl

Preferences: