probabilistic graphical modelsepxing/class/10708-14/lectures/lecture15-mf-topicmodel.pdflecture 15,...

56
School of Computer Science Probabilistic Graphical Models Mean Fiend Approximation & Topic Models Eric Xing Lecture 15, March 5, 2014 Reading: See class website 1 © Eric Xing @ CMU, 2005-2014

Upload: others

Post on 21-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

School of Computer Science

Probabilistic Graphical Models

Mean Fiend Approximation &

Topic Models

Eric XingLecture 15, March 5, 2014

Reading: See class website1© Eric Xing @ CMU, 2005-2014

Page 2: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Variational Principle Exact variational formulation

: the marginal polytope, difficult to characterize : the negative entropy function, no explicit form

Mean field method: non-convex inner bound and exact form of entropy

Bethe approximation and loopy belief propagation: polyhedral outer bound and non-convex Bethe approximation

© Eric Xing @ CMU, 2005-2014 2

Page 3: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Mean Field Approximation

© Eric Xing @ CMU, 2005-2014 3

Page 4: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Mean Field Methods

For a given tractable subgraph F, a subset of canonical parameters is

Inner approximation

Mean field solves the relaxed problem

is the exact dual function restricted to

© Eric Xing @ CMU, 2005-2014 4

Page 5: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

For an exponential family with sufficient statistics defined on graph G, the set of realizable mean parameter set

Idea: restrict p to a subset of distributions associated with a tractable subgraph

Tractable Subgraphs

© Eric Xing @ CMU, 2005-2014 5

Page 6: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Example: Naïve Mean Field for Ising Model

Ising model in {0,1} representation

Mean parameters

For fully disconnected graph F,

The dual decomposes into sum, one for each node

µs = Ep[Xs] = P[Xs = 1] for all s�V, and

µst = Ep[XsXt] = P[(Xs,Xt) = (1,1)] for all (s,t) �E.

© Eric Xing @ CMU, 2005-2014 6

Page 7: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Naïve Mean Field for Ising Model Optimization Problem

Update Rule

resembles “message” sent from node to

forms the “mean field” applied to from its neighborhood

Also yields lower bound on log partition function

7© Eric Xing @ CMU, 2005-2014

ZXfEXHPQKLFf

aaQQa

log)(log)()||(

Page 8: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Geometry of Mean Field Mean field optimization is always non-convex for any

exponential family in which the state space is finite

Recall the marginal polytope is a convex hull

contains all the extreme points If it is a strict subset, then it must be non-convex

Example: two-node Ising model

It has a parabolic cross section along , hence non-convex

© Eric Xing @ CMU, 2005-2014 8

Page 9: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

)]([ XpG

)}]([{ cc XqG

Exact:

Clusters:

(intractable)

Cluster-based approx. to the Gibbs free energy (Wiegerinck 2001,

Xing et al 03,04)

9© Eric Xing @ CMU, 2005-2014

Page 10: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Mean field approx. to Gibbs free energy Given a disjoint clustering, {C1, … , CI}, of all variables Let

Mean-field free energy

Will never equal to the exact Gibbs free energy no matter what clustering is used, but it does always define a lower bound of the likelihood

Optimize each qi(xc)'s. Variational calculus … Do inference in each qi(xc) using any tractable algorithm

),()( ii

iqq CXX

i

CiCii

Ci

CiiC

iii

iC

iqqEqG

xxxxxx ln)(MF

i x

iii

ix

iji

jixx

jiiiji

xqxqxxqxxxqxqG ln)()( e.g., MF (naïve mean field)

10© Eric Xing @ CMU, 2005-2014

Page 11: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

),|()( ,,,,*

ijiiii qMBHCECHCHi pq

XxXX

Theorem: The optimum GMF approximation to the cluster marginal is isomorphic to the cluster posterior of the original distribution given internal evidence and its generalized mean fields:

GMF algorithm: Iterate over each qi

The Generalized Mean Field theorem

11© Eric Xing @ CMU, 2005-2014

Page 12: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

[xing et al. UAI 2003]

A generalized mean field algorithm

12© Eric Xing @ CMU, 2005-2014

Page 13: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

[xing et al. UAI 2003]

A generalized mean field algorithm

13© Eric Xing @ CMU, 2005-2014

Page 14: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Theorem: The GMF algorithm is guaranteed to converge to a local optimum, and provides a lower bound for the likelihood of evidence (or partition function) the model.

Convergence theorem

14© Eric Xing @ CMU, 2005-2014

Page 15: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Gibbs predictive distribution:

ij

ijiijiiii AxXXxXpN

0exp)|(

}):{|( iji jxXp N

jx

jx

mean field equation:

}):{|(

exp)(

iji

jiqjiijiiii

jXXp

AXXXXq

jq

ij

NN

0jq

jX

}:{ ij jXjq

N

Xi

Approximate p(X) by fully factorized q(X)=Piqi(Xi)

For Boltzmann distribution p(X)=exp{i < j qijXiXj+qioXi}/Z :

Xi

xjqj resembles a “message” sent from node j to i

{xjqj : j Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jXj

NjqjX

The naive mean field approximation

15© Eric Xing @ CMU, 2005-2014

Page 16: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Cluster marginal of a square block Ck:

kkMBCk

kMBjkCi kCk

kCji

XqjiijCi

iijiijC XXXXXXq,

)(

',, '

exp)( 0

Virtually a reparameterized Ising model of small size.

Example 1: Generalized MF approximations to Ising models

16© Eric Xing @ CMU, 2005-2014

Page 17: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

GMF approximation to Isingmodels

GMF2x2

GMF4x4

BP

Attractive coupling: positively weightedRepulsive coupling: negatively weighted 17

© Eric Xing @ CMU, 2005-2014

Page 18: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

GMFr

GMFb

BP

Example 2: Sigmoid belief network

18

© Eric Xing @ CMU, 2005-2014

Page 19: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Example 3: Factorial HMM

19

© Eric Xing @ CMU, 2005-2014

Page 20: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Automatic Variational Inference

Currently for each new model we have to derive the variational update equations write application-specific code to find the solution

Each can be time consuming and error prone

Can we build a general-purpose inference engine which automates these procedures?

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

fHMM Mean field approx. Structured variational approx.

20© Eric Xing @ CMU, 2005-2014

Page 21: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Probabilistic Topic Models

Humans cannot afford to deal with (e.g., search, browse, or measure similarity) a huge number of text documents

We need computers to help out …

21© Eric Xing @ CMU, 2005-2014

Page 22: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

How to get started? Here are some important elements to consider before you start:

Task: Embedding? Classification? Clustering? Topic extraction? …

Data representation: Input and output (e.g., continuous, binary, counts, …)

Model: BN? MRF? Regression? SVM?

Inference: Exact inference? MCMC? Variational?

Learning: MLE? MCLE? Max margin?

Evaluation: Visualization? Human interpretability? Perperlexity? Predictive accuracy?

It is better to consider one element at a time!

22© Eric Xing @ CMU, 2005-2013

Page 23: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Tasks: document embedding Say, we want to have a mapping …, so that

Compare similarity Classify contents Cluster/group/categorizing Distill semantics and perspectives ..

23© Eric Xing @ CMU, 2005-2014

Page 24: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Summarizing the data using topics

© Eric Xing @ CMU, 2005-2013 24

Page 25: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

See how data changes over time

© Eric Xing @ CMU, 2005-2013 25

Page 26: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

User interest modeling using topics

© Eric Xing @ CMU, 2005-2013 26

http://cogito-demos.ml.cmu.edu/cgi-bin/recommendation.cgi

Page 27: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Representation: Data:

Each document is a vector in the word space Ignore the order of words in a document. Only count matters!

A high-dimensional and sparse representation– Not efficient text processing tasks, e.g., search, document

classification, or similarity measure– Not effective for browsing

As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their well-founded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs. An 18 months of "negotiations" in Madrid, and Washington proved these predictions. Now many will jump on me saying why are you blaming israelis for no-result negotiations. I would say why would the Arabs stall the negotiations, what do they have to loose ?

Arabian

negotiationsagainst

peaceIsrael

Arabs blaming

Bag of Words Representation

27© Eric Xing @ CMU, 2005-2014

Page 28: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

How to Model Semantic? Q: What is it about? A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical

Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

SourceTargetSMT

AlignmentScoreBLEU

ParseTreeNoun

PhraseGrammar

CFG

likelihoodEM

HiddenParametersEstimation

argMax

MT Syntax Learning

0.6 0.3 0.1

Unigram over vocabulary

Topi

cs

Mixing Proportion

Topic Models

28© Eric Xing @ CMU, 2005-2014

Page 29: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Why this is Useful? Q: What is it about? A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical

Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

MT Syntax Learning

Mixing Proportion

0.6 0.3 0.1

Q: give me similar document? Structured way of browsing the collection

Other tasks Dimensionality reduction

TF-IDF vs. topic mixing proportion

Classification, clustering, and more …

29© Eric Xing @ CMU, 2005-2014

Page 30: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Topic Models: The Big Picture

Unstructured Collection Structured Topic Network

Topic Discovery

Dimensionality Reduction

w1

w2

wn

xx

xx

T1

Tk T2x x x

x

Word Simplex Topic Simplex

30© Eric Xing @ CMU, 2005-2014

Page 31: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

wor

ds

documents

W Dw

ords

topic

topic

topi

c

topi

c

documents

LSI

Topic models

wor

ds

documents

wor

ds

topicsto

pics

documents

P(w

|z)

P(z)P(w)Topic-Mixing is via repeated word labeling

dWx '=

LSI versus Topic Model (probabilistic LSI)

31© Eric Xing @ CMU, 2005-2014

Page 32: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Words in Contexts

“It was a nice shot. ”

32© Eric Xing @ CMU, 2005-2014

Page 33: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Words in Contexts (con'd) the opposition Labor Party fared even worse, with a

predicted 35 seats, seven less than last election.

33© Eric Xing @ CMU, 2005-2014

Page 34: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

"Words" in Contexts (con'd)

Sivic et al. ICCV 200534© Eric Xing @ CMU, 2005-2014

Page 35: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Objects are bags of elements

Mixtures are distributions over elements

Objects have mixing vector Represents each mixtures’ contributions

Object is generated as follows: Pick a mixture component from Pick an element from that component

Admixture Models

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2

0.1 0.1 0.5…..

0.1 0.5 0.1…..

0.5 0.1 0.1…..

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2

35© Eric Xing @ CMU, 2005-2014

Page 36: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Topic ModelsGenerating a document

Prior

θ

z

w β Nd

N

K

from ,| Draw -

from Draw- each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazwlmultinomiaz

nDraw

Which prior to use?

36© Eric Xing @ CMU, 2005-2014

Page 37: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Choices of Priors Dirichlet (LDA) (Blei et al. 2003)

Conjugate prior means efficient inference Can only capture variations in each topic’s

intensity independently

Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006) Capture the intuition that some topics are highly

correlated and can rise up in intensity together Not a conjugate prior implies hard inference

37

Page 38: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Generative Semantic of LoNTAMGenerating a document

from ,| Draw -

from Draw- each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazwlmultinomiaz

nDraw

- Log Partition Function- Normalization Constant

log

logexp

,~,~

1

1

1

1

1

1

1

0

K

i

K

iii

KK

K

i

i

eC

e

NLN

z

w

β

Nd

N

K

μ Σ

38

Page 39: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Posterior inference

© Eric Xing @ CMU, 2005-2013 39

Page 40: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Posterior inference results

z w bN

a ✓K

Bayesianmodelinference…..

inputoutputsystem…..

cortexcorticalareas…..

Topics

Topic proportions

Topic assignments

D

© Eric Xing @ CMU, 2005-2013 40

Page 41: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Joint likelihood of all variables

© Eric Xing @ CMU, 2005-2013 41

We are interested in computing the posterior, and the data likelihood!

Page 42: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

A possible query:

Close form solution?

Sum in the denominator over Tn terms, and integrate over n k-dimensional topic vectors

Learning: What to learn? What is the objective function?

Inference and Learning are both intractable

}{

1,,,

)|()|()|()|()(mn

nz

Nn

nm

nmnzmn dddppzpxpDp

)(

)|()|()|()|(

)(),()|(

}{,,

,

Dp

ddppzpwp

DpDpDp

mn

nz

in

nm

nmnzmn

nn

?)|(?)|(

,

DzpDp

mn

n

42© Eric Xing @ CMU, 2005-2014

Page 43: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Approximate Inference

Variational Inference

Mean field approximation (Blei et al) Expectation propagation (Minka et al) Variational 2nd-order Taylor approximation (Xing)

Markov Chain Monte Carlo

Gibbs sampling (Griffiths et al)

43© Eric Xing @ CMU, 2005-2014

Page 44: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Mean-field assumption True posterior

Break the dependency using the fully factorized distribution

Mean-field family usually does NOT include the true posterior.

© Eric Xing @ CMU, 2005-2013 44

μ μ

Page 45: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Update each marginals Update

In LDA,

We obtain

© Eric Xing @ CMU, 2005-2013 45

This is also a Dirichlet---the same as its prior!

Page 46: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Coordinate ascent algorithm for LDA

© Eric Xing @ CMU, 2005-2013 46

Page 47: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

)|}{,( DzP

β

γ

z

w

μ Σ

Ahmed&Xing Blei&Lafferty

Σ* is assumed to be diagonalΣ* is full matrix

Log Partition Function

1log1

1

K

i

ie

nnn zqqzq **,, :1

γz

w

μ* Σ*

φ β

MultivariateQuadratic Approx.

Tangent Approx.

Closed Form Solution for μ*, Σ*

Numerical Optimization to fit μ*, Diag(Σ*)

Choice of q() does matter

47© Eric Xing @ CMU, 2005-2014

Page 48: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Tangent Approximation

48© Eric Xing @ CMU, 2005-2014

Page 49: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

How to evaluate? Empirical Visualization: e.g., topic discovery on New

York Times

© Eric Xing @ CMU, 2005-2014 49

Page 50: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

How to evaluate?

w

β

z

μ Σ

50© Eric Xing @ CMU, 2005-2014

• Test on Synthetic Text where ground truth is known:

Page 51: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Comparison: accuracy and speedL2 error in topic vector est. and # of iterations

Varying Num. of Topics

Varying Voc. Size

Varying Num. Words Per Document

51© Eric Xing @ CMU, 2005-2014

Page 52: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Comparison: perplexity

52© Eric Xing @ CMU, 2005-2014

Page 53: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Classification Result on PNAS collection PNAS abstracts from 1997-2002

2500 documents Average of 170 words per document

Fitted 40-topics model using both approaches Use low dimensional representation to predict the abstract category

Use SVM classifier 85% for training and 15% for testing

Classification Accuracy

-Notable Difference-Examine the low dimensionalrepresentations below

53© Eric Xing @ CMU, 2005-2014

Page 54: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

What makes topic models useful --- The Zoo of Topic Models! It is a building block of many models.

© Eric Xing @ CMU, 2005-2013 54

Williamson et al. 2010 Chang & Blei, 2009

Boyd-Graber & Blei, 2008 Wang & Blei, 2008McCallum et al. 2007

Titov & McDonald, 2008

!

Nd

"d

wd,n

zd,n

K#k

yd,d'

$

Nd'

"d'

wd',n

zd',n

(b)

x

z

wN

D

T

A,A

rdad

d

!

!T

"

#k

$k% M

&d

!D

'

Parse trees grouped into M documents

Page 55: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Conclusion GM-based topic models are cool

Flexible Modular Interactive

There are many ways of implementing topic models unsupervised supervised

Efficient Inference/learning algorithms GMF, with Laplace approx. for non-conjugate dist. MCMC

Many applications … Word-sense disambiguation Image understanding Network inference

55© Eric Xing @ CMU, 2005-2014

Page 56: Probabilistic Graphical Modelsepxing/Class/10708-14/lectures/lecture15-MF-topicmodel.pdfLecture 15, March 5, 2014 Reading: ... 2005-2014 1. Variational Principle ... on graph G, the

Summary on VI Variational methods in general turn inference into an optimization

problem via exponential families and convex duality

The exact variational principle is intractable to solve; there are two distinct components for approximations: Either inner or outer bound to the marginal polytope Various approximation to the entropy function

Mean field: non-convex inner bound and exact form of entropy BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better

entropy approximations (Yedidia et. al. 2002)

© Eric Xing @ CMU, 2005-2014 56