probabilistic graphical modelsepxing/class/10708-20/lectures/lecture07-vi1.pdf · abc fundraising...

Probabilistic Graphical Models

Variational Inference 1Eric XingLecture 7, February 5, 2020

Reading: see class homepage© Eric Xing @ CMU, 2005-2020 1

Inference Problems in Graphical Models

q E.g.: A general undirected graphical model (MRF):

q The quantities of interest:

q marginal distributions:

q normalization constant (partition function):

q Exact inference: tree graph, discrete scope or known integral, …

q What if exact inference is expensive or even impossible? (when this can happen?)

© Eric Xing @ CMU, 2005-2020 2

p(x) =1

Z

Y

C2C C(xC)

p(xi

) =X

xj ,j 6=i

p(x)

Z

Approximate Inference: The Big Picture

q Variational Inferenceq Mean-field (inner approximation)q Loopy Belief Propagation (outer approximation)q Kikuchi and variants (tighter outer approximation)q Expectation Propagation (reverse KL)q …

q Samplingq Monte Carloq Sequential Monte Carlo (Particle Filters)q Markov Chain Monte Carloq Hybrid Monte Carloq …

© Eric Xing @ CMU, 2005-2020 3

Variational Methods

q “Variational”: fancy name for optimization-based formulationsq i.e., represent the quantity of interest as the solution to an optimization problemq approximate the desired solution by relaxing/approximating the intractable

optimization problem

q Examples:q Courant-Fischer for eigenvalues:

q Linear system of equations:q variational formulation:

q for large system, apply conjugate gradient method© Eric Xing @ CMU, 2005-2019 4

�

max

(A) = max

kxk2=1

x

T

Ax

Ax = b, A � 0, x⇤ = A

�1b

x

⇤ = argminx

⇢1

2x

T

Ax� b

T

x

�

Variational Inference: High-level Idea

q Inference: answer queries of Pq Challenge: direct inference on P is often intractableq Indirect approach:

q Project P to a tractable family of distributions Qq Perform inference on the projected Q

q Projection requires a measure of distanceq A convenient choice: KL(Q, P)

q Mean-field: Assume Q is fully factorizedq The simplest possible family of distributions

q Example: Latent Dirichlet Allocation (LDA)

© Eric Xing @ CMU, 2005-2020 5

3.4 Mean Parameterization and Inference Problems 55

Fig. 3.5 Generic illustration of M for a discrete random variable with |X m| finite. In thiscase, the set M is a convex polytope, corresponding to the convex hull of {φ(x) | x ∈ X m}.By the Minkowski–Weyl theorem, this polytope can also be written as the intersectionof a finite number of half-spaces, each of the form {µ ∈ Rd | ⟨aj , µ⟩ ≥ bj} for some pair(aj , bj) ∈ Rd × R.

Example 3.8 (Ising Mean Parameters). Continuing from Exam-ple 3.1, the sufficient statistics for the Ising model are the singletonfunctions (xs, s ∈ V ) and the pairwise functions (xsxt, (s, t) ∈ E). Thevector of sufficient statistics takes the form:

φ(x) :=!xs,s ∈ V ; xsxt, (s, t) ∈ E

"∈ R|V |+|E|. (3.30)

The associated mean parameters correspond to particular marginalprobabilities, associated with nodes and edges of the graph G as

µs = Ep[Xs] = P[Xs = 1] for all s ∈ V , and (3.31a)

µst = Ep[XsXt] = P[(Xs,Xt) = (1,1)] for all (s, t) ∈ E. (3.31b)

Consequently, the mean parameter vector µ ∈ R|V |+|E| consists ofmarginal probabilities over singletons (µs), and pairwise marginalsover variable pairs on graph edges (µst). The set M consists of theconvex hull of {φ(x),x ∈ {0,1}m}, where φ is given in Equation (3.30).In probabilistic terms, the set M corresponds to the set of allsingleton and pairwise marginal probabilities that can be realizedby some distribution over (X1, . . . ,Xm) ∈ {0,1}m. In the polyhedralcombinatorics literature, this set is known as the correlation polytope,or the cut polytope [69, 187].

qqqHEq -=

Î minarg*

T! ∈ #$(&)

( ), + = -.() ∥ +)

)

+

Probabilistic Topic Models

q Humans cannot afford to deal with (e.g., search, browse, or measure similarity) a huge number of text documents

q We need computers to help out …

© Eric Xing @ CMU, 2005-2020 6

How to get started for a new modeling task?

Here are some important elements to consider before you start:q Task:

q Embedding? Classification? Clustering? Topic extraction? …q Data representation:

q Input and output (e.g., continuous, binary, counts, …) q Model:

q BN? MRF? Regression? SVM? q Inference:

q Exact inference? MCMC? Variational? q Learning:

q MLE? MCLE? Max margin? q Evaluation:

q Visualization? Human interpretability? Perperlexity? Predictive accuracy? It is better to consider one element at a time!

© Eric Xing @ CMU, 2005-2020 7

Tasks: document embedding

q Say, we want to have a mapping …, so that

q Compare similarity q Classify contentsq Cluster/group/categorizingq Distill semantics and perspectives q ..

Þ

© Eric Xing @ CMU, 2005-2020 8

Summarizing the data using topicsSome example topics

studentseducationlearningeducationalteachingschoolstudentskillsteacheracademic

marketeconomicfinancialeconomicsmarketsreturnspricestockvalue

investment

cortexcorticalareasvisualareaprimary

connectionsventralcerebralsensory

BayesianmodelinferencemodelsprobabilityprobabilisticMarkovpriorhiddenapproach

Education MarketBayesian modeling

Visual cortex

Chong Wang Probabilistic Modeling for Large-scale Data Exploration 12

© Eric Xing @ CMU, 2005-2020 9

See how data changes over time1 topic 3 topics 5 topics 10 topics

0

100

200

300

400

500

600

700

800

900

aver

age

abso

lute

erro

r (ho

ur)

small AP, time stamp prediction

baseline

w d h w+d w+d+h

(a)

1 topic 3 topics 5 topics 10 topics0

50

100

150

200

aver

age

abso

lute

erro

r (da

y)

election 08, time stamp prediction

baseline

m w d m+w m+w+d

(b)

Figure 6: Time stamp prediction. ‘m’ stands for flat approach of ‘month’, ‘w’ for ‘week’, ‘d’ for ‘day’ and and ‘h’for ‘hour.’ ‘m+w’ stands for the hierarchical approach of combining ‘month’ and ‘week’, and ‘m+w+d’, ‘w+d’,‘w+d+h’ are similarly defined. The baseline is the expectation of the error by randomly assigning a time. (a)AP data. 5-topic and 10-topic models perform better than others, and the hierarchical approach always achievesthe best performance. (b) Election 08 data. 1-topic model performs best due to the short documents. Thehierarchical approach achieves comparable performances.

2/27/2007

healthcareabc

wisconsinvegas

superdelegatenevada

delegatecivil

recountflorida

4/24/2007

healthcareabc

wisconsinvegas

superdelegatenevada

delegatecivil

fundraisingrecount

6/26/2007

healthcarewisconsin

vegassuperdelegate

nevadaabc

fundraisingdelegate

civilflorida

8/28/2007

healthcarewisconsin

vegassuperdelegate

kucinichnevada

fundraisingdelegatefloridacivil

10/23/2007

kucinichron

obama healthcare

paulwisconsin

vegas superdelegate

iowanevada

12/25/2007

obamaclinton

paulron

kucinichhillaryiowa

campaignnew

barack

2/19/2008

obamaclintonhillarybarack

campaigndemocratic

iowakucinich

paulron

Figure 7: Examples from a 3-topic cDTM using the week model in the Election 08 data. In year 2007, the topicswere more about general issues, while around year 2008, were more about candidates and changing faster.

[15] A. McCallum, A. Corrada-Emmanuel, andX. Wang. Topic and role discovery in social net-works. In IJCAI, 2005.

[16] J. Nocedal and S. J. Wright. Numerical Optimiza-tion. Springer, 2006.

[17] M. Opper and G. Sanguinetti. Variational infer-ence for markov jump processes. In NIPS, 2007.

[18] S. Rogers, M. Girolami, C. Campbell, and R. Bre-itling. The latent process decomposition of cDNAmicroarray data sets. IEEE/ACM Trans. onComp. Bio. and Bioinf., 2(2):143–156, 2005.

[19] M. Rosen-Zvi, T. Gri�ths, M. Steyvers, andP. Smyth. The author-topic model for authorsand documents. In UAI, 2004.

[20] Y. W. Teh, D. Newman, and M. Welling. Acollapsed variational bayesian inference algorithmfor latent Dirichlet allocation. In NIPS, 2006.

[21] G. E. Uhlenbeck and L. S. Ornstein. On the the-ory of Brownian motion. Phys. Rev., 36:823–41,1930.

[22] M. J. Wainwright and M. I. Jordan. Graphicalmodels, exponential families and variational infer-ence. Technical Report 649, UC Berkeley, Dept.of Statistics, 2003.

[23] X. Wang and A. McCallum. Topics over time:A non-Markov continuous-time model of topicaltrends. In SIGKDD, 2006.

[24] X. Wei and B. Croft. LDA-based document mod-els for ad-hoc retrieval. In SIGIR, 2006.

[25] X. Wei, J. Sun, and X. Wang. Dynamic mixturemodels for multiple time series. In IJCAI, 2007.

[26] M. West and J. Harrison. Bayesian Forecastingand Dynamic Models. Springer, 1997.

© Eric Xing @ CMU, 2005-2020 10

User interest modeling using topics

http://cogito-demos.ml.cmu.edu/cgi-bin/recommendation.cgi

© Eric Xing @ CMU, 2005-2020 11

Representation:

q Data:

q Each document is a vector in the word spaceq Ignore the order of words in a document. Only count matters!

q A high-dimensional and sparse representation– Not efficient text processing tasks, e.g., search, document

classification, or similarity measure– Not effective for browsing

As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their well-founded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs. An 18 months of "negotiations" in Madrid, and Washington proved these predictions. Now many will jump on me saying why are you blaming israelis for no-result negotiations. I would say why would the Arabs stall the negotiations, what do they have to loose ?

Arabian

negotiationsagainst

peaceIsrael

Arabsblaming

Bag of Words Representation

© Eric Xing @ CMU, 2005-2020 12

How to Model Semantic?

q Q: What is it about?q A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical

Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

SourceTargetSMT

AlignmentScoreBLEU

ParseTreeNoun

PhraseGrammar

CFG

likelihoodEM

HiddenParametersEstimation

argMax

MT Syntax Learning

0.6 0.3 0.1

Unigram over vocabulary

Topi

cs

Mixing Proportion

Topic Models© Eric Xing @ CMU, 2005-2020 13

Why this is Useful?

q Q: What is it about?q A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical

Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

MT Syntax Learning

Mixing Proportion

0.6 0.3 0.1

l Q: give me similar document?l Structured way of browsing the collection

l Other tasksl Dimensionality reduction

l TF-IDF vs. topic mixing proportion

l Classification, clustering, and more …

© Eric Xing @ CMU, 2005-2020 14

Topic Models: The Big Picture

Unstructured Collection Structured Topic Network

Topic Discovery

Dimensionality Reduction

w1

w2

wn

xx

xx

T1

Tk T2x x x

x

Word Simplex Topic Simplex

© Eric Xing @ CMU, 2005-2020 15

wor

dsdocuments

C = W L DT

wor

ds

topic

topic

topi

c

topi

c

documents

LSI

Topic models

wor

ds =

documents

wor

ds

topics

topi

cs

documents

P(w

|z) P(z)P(w)Topic-Mixing is via repeated word labeling

dWx!! '=

LSI versus Topic Model (probabilistic LSI)

© Eric Xing @ CMU, 2005-2020 16

Words in Contexts

•�It was a nice shot. ”

© Eric Xing @ CMU, 2005-2020 17

Words in Contexts (con'd)

q The opposition Labor Party fared even worse, with a predicted 35

seats, seven less than last election.

© Eric Xing @ CMU, 2005-2020 18

"Words" in Contexts (con'd)

Sivic et al. ICCV 2005© Eric Xing @ CMU, 2005-2020 19

More Generally: Admixture Models

q Objects are bags of elements

q Mixtures are distributions over elements

q Objects have mixing vector qq Represents each mixtures’ contributions

q Object is generated as follows:q Pick a mixture component from qq Pick an element from that component

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2



…

0.1 0.1 0.5…..

0.1 0.5 0.1…..

0.5 0.1 0.1…..


© Eric Xing @ CMU, 2005-2020 20

Topic Models Represented as a GM

Generating a documentPrior

θ

z

w β Nd

N

K

( ){ } ( )

from ,| Draw -

from Draw- each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazwlmultinomiaz

nDraw

bbq

q-

Which prior to use?

© Eric Xing @ CMU, 2005-2020 21

Choices of Priors

q Dirichlet (LDA) (Blei et al. 2003)q Conjugate prior means efficient inferenceq Can only capture variations in each topic’s

intensity independently

q Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)

q Capture the intuition that some topics are highly correlated and can rise up in intensity together

q Not a conjugate prior implies hard inference

© Eric Xing @ CMU, 2005-2020 22

Generative Semantic of LoNTAM

( ){ } ( )

from ,| Draw -

from Draw- each wordFor

prior thefrom

:1 nzknn

n

lmultinomiazwlmultinomiaz

nDraw

bbq

q-

( )( )

( ) log

logexp

,~,~

÷ø

öçè

æ+=

þýü

îíì

÷ø

öçè

æ+-=

=SS

å

å-

=

-

=

-

1

1

1

1

1

1

1

0

K

i

K

iii

KK

K

i

i

eC

e

NLN

g

g

g

gq

gµgµq

© Eric Xing @ CMU, 2005-2016

- Log Partition Function- Normalization Constant

Problem

g

z

w

β

Nd

N

K

μ Σ Generating a document

© Eric Xing @ CMU, 2005-2020 23

Posterior inference

z w bN

✓K

Topics

Topic proportions

Topic assignments

D

a ⌘

© Eric Xing @ CMU, 2005-2020 24

Posterior inference results

z w bN

a ✓K

Bayesianmodelinference…..

inputoutputsystem…..

cortexcorticalareas…..

Topics

Topic proportions

Topic assignments

D

⌘

© Eric Xing @ CMU, 2005-2020 25

Joint likelihood of all variables

z w bN

✓K

D

a ⌘

We are interested in computing the posterior, and the data likelihood!

© Eric Xing @ CMU, 2005-2020 26

Inference and Learning are both intractable

q A possible query:

q Close form solution?

q Sum in the denominator over Tn terms, and integrate over n k-dimensional topic vectors

q Learning: What to learn? What is the objective function?

å Õ Õò ò ÷÷ø

öççè

æ÷ø

öçè

æ=}{

1,,,

)|()|()|()|()(mn

nz

Nn

nm

nmnzmn dddppzpxpDp bqqhbaqqb !!

)(

)|()|()|()|(

)(),()|(

}{,,

,

Dp

ddppzpwp

DpDpDp

mn

nz

in

nm

nmnzmn

nn

å ò Õ Õ -÷÷ø

öççè

æ÷ø

öçè

æ

=

=

bqhbaqqb

qq

?)|(?)|(

, ==DzpDp

mn

nq

© Eric Xing @ CMU, 2005-2020 27

Approximate Inference

q Variational Inference

q Mean field approximation (Blei et al.)q Expectation propagation (Minka et al.)q Variational 2nd-order Taylor approximation (Xing)

q Markov Chain Monte Carlo

q Gibbs sampling (Griffiths et al)

© Eric Xing @ CMU, 2005-2020 28

Variational Inference

q Consider a generative model !" #|% , and prior ! %q Joint distribution: !" #, % = !" #|% ! %

q Assume variational distribution () %|#q Objective: Maximize lower bound for log likelihood

q Equivalently, minimize free energy

log ! #= -. () % # || !/ % # + 1

%() % # log !" #, %() % #

≥ 1%() % # log !" #, %() % #

≔ ℒ(/,6; #)

9 /, :; # = −log ! # + -.(() % # || !/(%|#))© Eric Xing @ CMU, 2005-2020 29

Variational Inference

Maximize the variational lower bound:

• E-step: maximize ℒ w.r.t. ", with # fixed

• If closed form solutions exist:

• M-step: maximize ℒ w.r.t. #, with " fixed

max' ℒ #,"; *

max+ ℒ #,"; *

,'∗ (/|1) ∝ exp[log :+(1, /)]

ℒ #,"; * = =>?(@|A) log :+ 1 / + CD ,' / 1 ||: /= log : * − CD(,' F * || :#(F|*))

© Eric Xing @ CMU, 2005-2020 30

Mean-field assumption (in topic models)

q True posterior

q Break the dependency using the fully factorized distribution

q Mean-field family usually does NOT include the true posterior.

p(�, ✓, z|w) =p(�, ✓, z,w)

p(w)

q.ˇ; ✓;z/ DY

k

q.ˇk/Yd

q.✓d /Y

n

q.zdn/

© Eric Xing @ CMU, 2005-2020 31

Mean Field Approximation

q Parametric form for each marginal factor in ! ", $, % &, ', ():

q Learning parameters of the variational distribution (E-step):

q For LDA, we can compute the optimal MF approximation in closed form.© Eric Xing @ CMU, 2005-2020 32

q(�k | �k) = Dirichlet(�k | �k)

q(✓d | �d) = Dirichlet(✓d | �d)q(zdn | �dn) = Multinomial(zdn | �dn)

<latexit sha1_base64="HofzDxlOux3vDigpHMNbhMdgid0=">AAAFaXicdZNdb9MwFIbTbYURPrbBDYIbi2poXGxKmnTrDdLSJNUuGBpiX9JSVY7jttacD2wHKFF+KBf8Av4EbpqypV0sRTp+z3OOX8e2n1DChab9bqytbzQfPd58oj599vzF1vbOy0sepwzhCxTTmF37kGNKInwhiKD4OmEYhj7FV/6tPctffceMkzg6F9MED0I4jsiIICikNNz+823P87GAw1vghSQAHpW1gZx+AO8/SgmKCQszhzCCJhSLvJb2PFW2EpNZNiizYxiGclbbqoYuWv0aZkGUl6lkQopppdNpSgWJ4pBAmtfgw+2WdqAVA6wGehm0lHKcDXfWhRfEKA1xJBCFnN/oWiIGGWSCIIpz1Us5TiC6hWN8I8MIhpgPsuIccrArlQCMYia/SIBCvV+RwZDzaejnAIDd2S74cnYm1mRniohjyvMHsjepGHUHGYmSVOAIza2MUgpEDGbHDgLCMBJ0KgOIGJG7AWgCGURCXg61stDP+W5UdRecMRJCNlW9AI/kBSsSWTIX932a4jw7OT/9lGdmu9vpm/nDIMPBgnMt0zA6NdwUUxr/WKD9Xs/W6lqOGcbRgjRMq9sxCsNfJRcFK5b5Qq6Y1rsdq2fndeg92z1DN/V2LVk17lrdQ02rhSvWdePIMObWz7G8YyvORalWjDtt1+gvG/9PVhawXddyDosFPuOUQVotGjM43del28VP1/vGyjkWUPsOcruu5ToPQcYd5FiO7S7/hQIy76CeY+u2dCffqb78KleDy/aBbhxoX8zWca98sZvKW+WdsqfoypFyrJwoZ8qFghp2gzRYg2/8be40XzffzNG1RlnzSqmMZusfDsrPLw==</latexit>

�?,�?,�? = arg min�,�,�

KL(q(�, ✓, z | �,�) k p(�, ✓, z | w,↵, ⌘))<latexit sha1_base64="x0l7+m6Vouyo4TfXV7Av+FfbEOk=">AAAGOXichZNbb9MwFMezjsIot4098mIxDa3SNiVNu5UHpLVpqklsaIjdpKWrnMRtrTkXHIfRBX8jvgGfhEfeEK98AZxLu6VdhKVIx+f/O8d/O7bpExwwWf65UFp8UH74aOlx5cnTZ89fLK+8PA28kFroxPKIR89NGCCCXXTCMCPo3KcIOiZBZ+aVFutnXxANsOces7GPeg4cuniALchEqr+y8H0dfN4wTMRg/woYDraBQUS5LaZV8OadSEE2ok7UwRRbI4IYL6QNo5I0Y6NYtzN9CB1HzAqbFdBZs5t+ZLs8E/0RTqa5XochYdj1HAwJL8AradtLI2CQbk4sT6eCS2MgukI6jMvdfpQWTfEU5NN13x9wsJEdndCSbWymqjmIbvjdDaW1VQMY3wzg/7cmm19zoUHij2JGkNVqf3lN3paTAeYDJQvWpGwc9VcWmWF7Vuggl1kEBsGFIvusF0HKsEUQrxhhgHxoXcEhuhChCx0U9KLkXnGwLjI2GHhUfC4DSfZuRQSdIBg7JgcArMemg1k1ThaocYZ5Hgn4PepFyAbNXoRdP2TItVIrg5AA5oH4GgMbU2QxMhYBtCgWuwHWCFJoMXHZK7mFvqa7qYj7dESxA+m4YthoIB5MIkR+mtwySYh4tH98eMCjeq3Z6Nb5/SBF9oTTW3VVbRRwY0SIdz1Bu+22Jhe1HFKE3Amp1lvNhpoY/iQ4156zHEzSOdNKs9Fqa7wIvWO7rSp1pVZI5o3rreaOLBfCOeuKuquqqfVjJO7YnHOWZXPGOzVd7c4an5K5BTRdb3V2kgU+oJBCki8aUjjeUoTbyaErXXXuPyZQ7RbSm3pL79wHqbdQp9XR9NlTSKD6LdTuaIom3Il3qsy+yvngtLatqNvyx/raXjt7sUvSK+m1tCEp0q60J+1LR9KJZJVWS29L7ZJW/lH+Vf5d/pOipYWsZlXKjfLff/tgGQI=</latexit>

z w bN

✓K

D

a ⌘

$*+

Update each marginal

q Update:

q Where in LDA:

q And we obtain:

This is also a Dirichlet — the same as its prior!© Eric Xing @ CMU, 2005-2020 33

Update each marginal

q Similarly to ! "# $#), we obtain optimal parameters &#'⋆ for ! )#' &#'):

q And optimal parameters *+⋆ for ! ,+ *+):

q Iterating these equations to convergence yields the MF approximation to the posterior distribution.

© Eric Xing @ CMU, 2005-2020 34

�k(j) = ⌘(j) +DX

d=1

NdX

n=1

�?dn(k)1[wdn = j]

<latexit sha1_base64="uuBItDiVAYxf+mMr7sWWrJ4+f08=">AAAG7XichZNJT9tAFMdNaFOabtAeexkVUQUVkB0HyAWJLI6QWERVNgkHd2JPkoHx0vG4ENz5GL1Vvfaj9DP0Q/TaXjteEjDBqqVIb/m9N/95k9f1CPaZLP+aKkw/eFh8NPO49OTps+cvZudeHvluQE10aLrEpSdd6COCHXTIMCPoxKMI2l2CjrsXzSh//BlRH7vOARt6qGPDvoN72IRMhIy5wscF8KmsdxGDxgXQbWwBnYhyS7iL4O2GCEE2oHbYwhSbA4IYz6V1vRQ3Y4Mob6X5PrRt4eU2y6HTZtdGaDk8TXoDHLuZXrsBYdhxbQwJz8FFp6Txme4zSJdGoseuIBMbiL6Q9qMGjhEmRWM8Afn45O0dDsrp8EQuvshSku32wmt++0pJ7aIO9C868P5bk/qXXOQg8QYRI8jF+Cqjewmx4wFtAPEAQnz8NKEFLo1oDOjK00Og7/u4fDPZ5TTgB7YRnm8o/GxbjJuXxi9ZPo9biU6R9Q4kpBWRrdRxIifcMyx+a3ixkrLQoZxejlSdd4zZeXlFjj8waSipMS+l374xN810yzUDGznMJND3TxXZY50QUoZNgoTQwEceNC9gH50K04E28jthvAscLIiIBXouFT+HgTh6uyKEtu8P7S4HACxEY/bvZqNgTjaKMNclPr8nexqwXq0TYscLGHLMREovIIC5IFo9YGGKTEaGwoAmxeI2wBxACk0mFrSUOegquU1JPPc+xTakw5JuoZ5Y8jgReklwuUsCxMOtg90dHlYrtdV2ld8PUmSNOK1eVdXVHG6ICHEvR2i70WjKeS37FCFnRKrVem1VjQV/EJxjTUj2R+GMaKW2Wm80eR56S3ZDVapKJZfMCtfqtTVZzoUz0hV1XVUT6QdI/McmlLM0mhHeqmhq+67wMZk5oKlp9dZafMAeCigk2aI+hcNlRagdDV1pqxPvGEOVG0iraXWtdR+k3kCtequp3Z1CDFVvoEarqTSFOrGnyt2tnDSOKiuKuiK/r85vNtKNnZFeS2+ksqRI69KmtCXtS4eSWfhZ+F34U/hbdItfi9+K3xO0MJXWvJIyX/HHP8e7Vl0=</latexit>

q(zdn = k | �dn) = �dn(k) = �k(wdn) exp

8<

: (�d(k))� (KX

j=1

�d(j))

9=

;<latexit sha1_base64="i9pUGCzXtCn6FQowZnfBZCT62CY=">AAAHG3ichZRbb9MwFMfDCmWU2waPe7EYQ61gU9J00JdJ6yXTpF00xC5Ic1e5idt6dS44DqML/gp8Az4Nb4hXHvggvONc2i3rIixVOpffOf77uE7Po8Tnqvrnzlzh7r3i/fkHpYePHj95urD47Nh3A2biI9OlLvvYQz6mxMFHnHCKP3oMI7tH8Ulv1IryJ58x84nrHPKxhzs2GjikT0zEZai7OPdtBXwqwx7mqDsC0CYWgFSWW9KtgFcbMoT4kNlhmzBiDinmIpeGsBQ348Mob6X5AbJt6eU2y6HTZpfd0HJEmvSGJHYzvfYCyonj2gRRkYPLTknjM+hzxN5MRE9dSSY2kH0RG0QNnG6YFE3xBBTTnXd2BSinw5O5+CBvkmyvH16K60dKaisQwK8QeP+tSf0LIXOIesOIkWSlUopGYjlS52hmKBtXTnkUu/FFhSNRvkgRiL94kOI+hyGABz4pTwYuCypgNY35gd0Nzzc0cbYzvZHyuQQgI4MhhyIa6OTeZSLaSu4UWa9BUm1F1e3UcSIn3O9a4tqopzq100Sd7HLe6S4sq2tqvMCsoaXGspKug+5igUPLNQMbO9ykyPdPNdXjnRAxTkyKRQkGPvaQOUIDfCpNB9nY74TxyxFgRUYs0HeZ/DkcxNHrFSGyfX9s9wQAYCW6FP9mNgrmZKMId13qi1uypwHv1zshcbyAY8dMpPQDCrgLoocKLMKwyelYGshkRJ4GmEPEkMnlcy5lNvqSnKYkr+WAERuxcQlauC8/CXEi9JLgao8GWITbh3u7IqxV6+tbNXE7yLA14YxGTdfXc7gxptS9mKBbzWZLzWs5YBg7E1KvNerreiz4g+Qca0ayPwlnRGv19UazJfLQa7KbulbTqrlkVrjRqL9V1Vw4I13T3+l6Iv0Qy//YjHKeRjPC21VD37opfEpmNmgZRqP9Nt5gHwcM0WzRgKHxqibVToaubekz9xhD1SvIqBsNo30bpF9B7Ua7ZdycQgzVrqBmu6W1pDr5TrWbr3LWOK6uafqa+r62vNlMX+y8sqS8UMqKprxTNpVt5UA5Usy5v4WlwsvCSvF78UfxZ/FXgs7dSWueK5lV/P0P44xlWg==</latexit>

Coordinate ascent algorithm for LDA

1: Initialize variational topics q(Øk

), k = 1, ...,K .2: repeat3: for each document d 2 {1,2, ...,D} do4: Initialize variational topic assigments q(z

dn

), n = 1, ...,N

5: repeat6: Update variational topic proportions q(µ

d

)7: Update variational topic assigments q(z

dn

), n = 1, ...,N

8: until Change of q(µd

) is small enough9: end for

10: Update variational topics q(Øk

), k = 1, ...,K .11: until Lower bound L(q) converges

Chong Wang (CMU) Large-scale Latent Variable Models February 28, 2013 50 / 50

© Eric Xing @ CMU, 2005-2020 35

Conclusion

q GM-based topic models are coolq Flexible q Modularq Interactive

q There are many ways of implementing topic modelsq unsupervisedq supervised

q Efficient Inference/learning algorithmsq GMF, with Laplace approx. for non-conjugate dist.q MCMC

q Many applicationsq …q Word-sense disambiguationq Image understandingq Network inference

© Eric Xing @ CMU, 2005-2020 36

Supplementary

© Eric Xing @ CMU, 2005-2020 37

Supplementary: More on strategies in VI

q Alternative approximation scheme

q How to evaluate: empirical (ground truth unknown) vs. simulation (ground truth known)

q Comparison (of what)

q Building blocks

© Eric Xing @ CMU, 2005-2020 38

Recall Choices of Priors

q Dirichlet (LDA) (Blei et al. 2003)q Conjugate prior means efficient inferenceq Can only capture variations in each topic’s

intensity independently

q Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)

q Capture the intuition that some topics are highly correlated and can rise up in intensity together

q Not a conjugate prior implies hard inference

© Eric Xing @ CMU, 2005-2020 39

)|}{,( DzP g

β

γ

z

w

μ Σ

Ahmed&Xing Blei&Lafferty

Σ* is assumed to be diagonalΣ* is full matrix

Log Partition Function

Choice of q() does matter

( ) ( ) ( )ÕS= nnn zqqzq fµgg **,, :1

γ

z

w

μ* Σ*

φ

β

Multivariate

Quadratic Approx.Tangent Approx.

Closed Form

Solution for μ*, Σ*

Numerical

Optimization to

fit μ*, Diag(Σ*)

1log1

1

÷ø

öçè

æ+å

-

=

K

i

ieg

© Eric Xing @ CMU, 2005-2020 40

Tangent Approximation

© Eric Xing @ CMU, 2005-2020 41

How to evaluate?

q Empirical Visualization: e.g., topic discovery on New York TimesTopic discovery

The 5 most frequent topics from the HDP on the New York Times.game

second

seasonteam

play

games

players

points

coach

giants

streetschool

house

life

children

familysays

night

man

know

life

says

show

mandirector

television

film

story

movie

films

house

life

children

man

war

book

story

books

author

novel

street

house

nightplace

park

room

hotel

restaurant

garden

wine

house

bush

political

party

clintoncampaign

republican

democratic

senatordemocrats percent

street

house

building

real

spacedevelopment

squarehousing

buildings

game

secondteam

play

won

open

race

win

roundcup

game

season

team

runleague

gameshit

baseball

yankees

mets

government

officials

warmilitary

iraq

army

forces

troops

iraqi

soldiers

school

life

children

family

says

women

helpmother

parentschild

percent

business

market

companies

stock

bank

financial

fund

investorsfunds government

life

warwomen

politicalblack

church

jewish

catholic

pope

street

show

artmuseum

worksartists

artist

gallery

exhibitionpaintings street

yesterdaypolice

man

casefound

officer

shot

officers

charged

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

Chong Wang Probabilistic Modeling for Large-scale Data Exploration 60

© Eric Xing @ CMU, 2005-2020 42

How to evaluate?

q Test on Synthetic Text where ground truth is known:

w

β

q

z

μ Σ

© Eric Xing @ CMU, 2005-2020 43

Comparison: accuracy and speed

q L2 error in topic vector est. and # of iterations

q Varying Num. of Topics

q Varying Voc. Size

q Varying Num. Words Per Document

© Eric Xing @ CMU, 2005-2020 44

Classification Result on PNAS collection

q PNAS abstracts from 1997-2002q 2500 documentsq Average of 170 words per document

q Fitted 40-topics model using both approachesq Use low dimensional representation to predict the abstract category

q Use SVM classifierq 85% for training and 15% for testing

Classification Accuracy

-Notable Difference-Examine the low dimensionalrepresentations below

© Eric Xing @ CMU, 2005-2020 46

What makes topic models useful --- The Zoo of Topic Models!

q It is a building block of many models.Williamson et al. 2010 Chang & Blei, 2009

Boyd-Graber & Blei, 2008 Wang & Blei, 2008McCallum et al. 2007

Titov & McDonald, 2008

What makes topic models useful?

!  It is a building block of many models.

© Eric Xing @ CMU, 2005-2013 37

Extending LDAby emerging groups. Both modalities are driven by thecommon goal of increasing data likelihood. Consider thevoting example again; resolutions that would have been as-signed the same topic in a model using words alone maybe assigned to di�erent topics if they exhibit distinct votingpatterns. Distinct word-based topics may be merged if theentities vote very similarly on them. Likewise, multiple dif-ferent divisions of entities into groups are made possible byconditioning them on the topics.

The importance of modeling the language associated withinteractions between people has recently been demonstratedin the Author-Recipient-Topic (ART) model [16]. In ARTthe words in a message between people in a network aregenerated conditioned on the author, recipient and a setof topics that describes the message. The model thus cap-tures both the network structure within which the peopleinteract as well as the language associated with the inter-actions. In experiments with Enron and academic email,the ART model is able to discover role similarity of peoplebetter than SNA models that consider network connectivityalone. However, the ART model does not explicitly capturegroups formed by entities in the network.

The GT model simultaneously clusters entities to groupsand clusters words into topics, unlike models that gener-ate topics solely based on word distributions such as LatentDirichlet Allocation [4]. In this way the GT model discov-ers salient topics relevant to relationships between entitiesin the social network—topics which the models that onlyexamine words are unable to detect.

We demonstrate the capabilities of the GT model by ap-plying it to two large sets of voting data: one from US Sen-ate and the other from the General Assembly of the UN.The model clusters voting entities into coalitions and si-multaneously discovers topics for word attributes describingthe relations (bills or resolutions) between entities. We findthat the groups obtained from the GT model are signifi-cantly more cohesive (p-value < .01) than those obtainedfrom the Blockstructures model. The GT model also dis-covers new and more salient topics in both the UN and Sen-ate datasets—in comparison with topics discovered by onlyexamining the words of the resolutions, the GT topics areeither split or joined together as influenced by the voters’patterns of behavior.

2. GROUP-TOPIC MODELThe Group-Topic Model is a directed graphical model that

clusters entities with relations between them, as well as at-tributes of those relations. The relations may be either di-rected or undirected and have multiple attributes. In thispaper, we focus on undirected relations and have words asthe attributes on relations.

In the generative process for each event (an interactionbetween entities), the model first picks the topic t of theevent and then generates all the words describing the eventwhere each word is generated independently according toa multinomial distribution �t, specific to the topic t. Togenerate the relational structure of the network, first thegroup assignment, gst for each entity s is chosen condition-ally on the topic, from a particular multinomial distribution�t over groups for each topic t. Given the group assignmentson an event b, the matrix V (b) is generated where each cellV (b)

gigj represents how often the groups of two senators be-haved the same or not during the event b, (e.g., voted the

SYMBOL DESCRIPTIONgit entity i’s group assignment in topic ttb topic of an event b

w(b)k the kth token in the event b

V (b)ij entity i and j’s groups behaved same (1)

or di�erently (2) on the event bS number of entitiesT number of topicsG number of groupsB number of eventsV number of unique wordsNb number of word tokens in the event bSb number of entities who participated in the event b

Table 1: Notation used in this paper

! " w v # $

t g % &

Nb S

b2

T BG2

S T

B

Figure 1: The Group-Topic model

same or not on a bill). The elements of V are sampled from

a binomial distribution �(b)gigj . Our notation is summarized

in Table 1, and the graphical model representation of themodel is shown in Figure 1.

Without considering the topic of an event, or by treat-ing all events in a corpus as reflecting a single topic, thesimplified model (only the right part of Figure 1) becomesequivalent to the stochastic Blockstructures model [17]. Tomatch the Blockstructures model, each event defines a re-lationship, e.g., whether in the event two entities’ groupsbehave the same or not. On the other hand, in our model arelation may have multiple attributes (which in our exper-iments are the words describing the event, generated by aper-topic multinomial).

When we consider the complete model, the dataset is dy-namically divided into T sub-blocks each of which corre-sponds to a topic. The complete GT model is as follows,

tb � Uniform(1T

)

wit|�t � Multinomial(�t)

�t|� � Dirichlet(�)

git|�t � Multinomial(�t)


V (b)ij |�(b)

gigj� Binomial(�(b)

gigj)

�(b)gh |� � Beta(�).

We want to perform joint inference on (text) attributesand relations to obtain topic-wise group memberships. Sinceinference can not be done exactly on such complicated prob-abilistic graphical models, we employ Gibbs sampling to con-duct inference. Note that we adopt conjugate priors in our

Indian Buffet Process Compound Dirichlet Process

B selects a subset of atoms for each distribution, and thegamma random variables � determine the relative massesassociated with these atoms.

2.4. Focused Topic Models

Suppose H parametrizes distributions over words. Then,the ICD defines a generative topic model, where it is usedto generate a set of sparse distributions over an infinite num-ber of components, called “topics.” Each topic is drawnfrom a Dirichlet distribution over words. In order to specifya fully generative model, we sample the number of wordsfor each document from a negative binomial distribution,n(m)

· � NB(�

k bmk�k, 1/2).2

The generative model for M documents is

1. for k = 1, 2, . . . ,

(a) Sample the stick length �k according to Eq. 1.(b) Sample the relative mass �k � Gamma(�, 1).(c) Draw the topic distribution over words,

�k � Dirichlet(�).

2. for m = 1, . . . , M ,

(a) Sample a binary vector bm according to Eq. 1.(b) Draw the total number of words,

n(m)· � NB(

�k bmk�k, 1/2).

(c) Sample the distribution over topics,�m � Dirichlet(bm · �).

(d) For each word wmi, i = 1, . . . , n(m)· ,

i. Draw the topic index zmi � Discrete(�m).ii. Draw the word wmi � Discrete(�zmi

).

We call this the focused topic model (FTM) because theinfinite binary matrix B serves to focus the distributionover topics onto a finite subset (see Figure 1). The numberof topics within a single document is almost surely finite,though the total number of topics is unbounded. The topicdistribution for the mth document, �m, is drawn from aDirichlet distribution over the topics selected by bm. TheDirichlet distribution models uncertainty about topic pro-portions while maintaining the restriction to a sparse set oftopics.

The ICD models the distribution over the global topic pro-portion parameters � separately from the distribution overthe binary matrix B. This captures the idea that a topic mayappear infrequently in a corpus, but make up a high propor-tion of those documents in which it occurs. Conversely, atopic may appear frequently in a corpus, but only with lowproportion.

2Notation n(m)k is the number of words assigned to the kth

topic of the mth document, and we use a dot notation to representsummation - i.e. n(m)

· =P

k n(m)k .

Figure 1. Graphical model for the focused topic model

3. Related ModelsTitsias (2007) introduced the infinite gamma-Poisson pro-cess, a distribution over unbounded matrices of non-negative integers, and used it as the basis for a topic modelof images. In this model, the distribution over featuresfor the mth image is given by a Dirichlet distribution overthe non-negative elements of the mth row of the infinitegamma-Poisson process matrix, with parameters propor-tional to the values at these elements. While this results ina sparse matrix of distributions, the number of zero entriesin any column of the matrix is correlated with the valuesof the non-zero entries. Columns which have entries withlarge values will not typically be sparse. Therefore, thismodel will not decouple across-data prevalence and within-data proportions of topics. In the ICD the number of zeroentries is controlled by a separate process, the IBP, fromthe values of the non-zero entries, which are controlled bythe gamma random variables.

The sparse topic model (SparseTM, Wang & Blei, 2009)uses a finite spike and slab model to ensure that each topicis represented by a sparse distribution over words. Thespikes are generated by Bernoulli draws with a single topic-wide parameter. The topic distribution is then drawn from asymmetric Dirichlet distribution defined over these spikes.The ICD also uses a spike and slab approach, but allowsan unbounded number of “spikes” (due to the IBP) and amore globally informative “slab” (due to the shared gammarandom variables). We extend the SparseTM’s approxima-tion of the expectation of a finite mixture of Dirichlet dis-tributions, to approximate the more complicated mixture ofDirichlet distributions given in Eq. 2.

Recent work by Fox et al. (2009) uses draws from an IBPto select subsets of an infinite set of states, to model multi-ple dynamic systems with shared states. (A state in the dy-namic system is like a component in a mixed membershipmodel.) The probability of transitioning from the ith stateto the jth state in the mth dynamic system is drawn from aDirichlet distribution with parameters bmj� + ��i,j , where

Chang, Blei

!

Nd

"d

wd,n

zd,n

K

#k

yd,d'

$

Nd'

"d'

wd',n

zd',n

Figure 2: A two-document segment of the RTM. The variable y indicates whether the two documents are linked. The complete modelcontains this variable for each pair of documents. The plates indicate replication. This model captures both the words and the linkstructure of the data shown in Figure 1.

formulation, inspired by the supervised LDA model (Bleiand McAuliffe 2007), ensures that the same latent topic as-signments used to generate the content of the documentsalso generates their link structure. Models which do notenforce this coupling, such as Nallapati et al. (2008), mightdivide the topics into two independent subsets—one forlinks and the other for words. Such a decomposition pre-vents these models from making meaningful predictionsabout links given words and words given links. In Sec-tion 4 we demonstrate empirically that the RTM outper-forms such models on these tasks.

3 INFERENCE, ESTIMATION, ANDPREDICTION

With the model defined, we turn to approximate poste-rior inference, parameter estimation, and prediction. Wedevelop a variational inference procedure for approximat-ing the posterior. We use this procedure in a variationalexpectation-maximization (EM) algorithm for parameterestimation. Finally, we show how a model whose parame-ters have been estimated can be used as a predictive modelof words and links.

Inference In posterior inference, we seek to computethe posterior distribution of the latent variables condi-tioned on the observations. Exact posterior inference is in-tractable (Blei et al. 2003; Blei and McAuliffe 2007). Weappeal to variational methods.

In variational methods, we posit a family of distributionsover the latent variables indexed by free variational pa-rameters. Those parameters are fit to be close to the trueposterior, where closeness is measured by relative entropy.See Jordan et al. (1999) for a review. We use the fully-factorized family,

q(�,Z|�,�) =�

d [q�(�d|�d)�

n qz(zd,n|�d,n)] , (3)

where � is a set of Dirichlet parameters, one for each doc-

ument, and � is a set of multinomial parameters, one foreach word in each document. Note that Eq [zd,n] = �d,n.

Minimizing the relative entropy is equivalent to maximiz-ing the Jensen’s lower bound on the marginal probability ofthe observations, i.e., the evidence lower bound (ELBO),

L =�

(d1,d2)Eq [log p(yd1,d2 |zd1 , zd2 , �, �)] +

�d

�n Eq [log p(wd,n|�1:K , zd,n)] +

�d

�n Eq [log p(zd,n|�d)] +�

d Eq [log p(�d|�)] + H(q), (4)

where (d1, d2) denotes all document pairs. The first termof the ELBO differentiates the RTM from LDA (Blei et al.2003). The connections between documents affect the ob-jective in approximate posterior inference (and, below, inparameter estimation).

We develop the inference procedure under the assumptionthat only observed links will be modeled (i.e., yd1,d2 is ei-ther 1 or unobserved).1 We do this for two reasons.

First, while one can fix yd1,d2 = 1 whenever a link is ob-served between d1 and d2 and set yd1,d2 = 0 otherwise, thisapproach is inappropriate in corpora where the absence ofa link cannot be construed as evidence for yd1,d2 = 0. Inthese cases, treating these links as unobserved variables ismore faithful to the underlying semantics of the data. Forexample, in large social networks such as Facebook the ab-sence of a link between two people does not necessarilymean that they are not friends; they may be real friendswho are unaware of each other’s existence in the network.Treating this link as unobserved better respects our lack ofknowledge about the status of their relationship.

Second, treating non-links links as hidden decreases thecomputational cost of inference; since the link variables areleaves in the graphical model they can be removed when-

1Sums over document pairs (d1, d2) are understood to rangeover pairs for which a link has been observed.

!

!T

"

#k

$k

% M

&d

!D

'

Parse trees

grouped into M

documents

(a) Overall Graphical Model

w1:laid

w2:phrases

w6:forw5:his

w4:some w5:mind

w7:years

w3:in

z1

z2 z3

z4

z5

z5

z6

z7

(b) Sentence Graphical Model

Figure 1: In the graphical model of the STM, a document is made up of a number of sentences,represented by a tree of latent topics z which in turn generate words w. These words’ topics arechosen by the topic of their parent (as encoded by the tree), the topic weights for a document �,and the node’s parent’s successor weights �. (For clarity, not all dependencies of sentence nodesare shown.) The structure of variables for sentences within the document plate is on the right, asdemonstrated by an automatic parse of the sentence “Some phrases laid in his mind for years.” TheSTM assumes that the tree structure and words are given, but the latent topics z are not.

is going to be a noun consistent as the object of the preposition “of.” Thematically, because it is ina travel brochure, we would expect to see words such as “Acapulco,” “Costa Rica,” or “Australia”more than “kitchen,” “debt,” or “pocket.” Our model can capture these kinds of regularities andexploit them in predictive problems.

Previous efforts to capture local syntactic context include semantic space models [6] and similarityfunctions derived from dependency parses [7]. These methods successfully determine words thatshare similar contexts, but do not account for thematic consistency. They have difficulty with pol-ysemous words such as “fly,” which can be either an insect or a term from baseball. With a senseof document context, i.e., a representation of whether a document is about sports or animals, themeaning of such terms can be distinguished.

Other techniques have attempted to combine local context with document coherence using linearsequence models [8, 9]. While these models are powerful, ordering words sequentially removesthe important connections that are preserved in a syntactic parse. Moreover, these models gener-ate words either from the syntactic or thematic context. In the syntactic topic model, words areconstrained to be consistent with both.

The remainder of this paper is organized as follows. We describe the syntactic topic model, anddevelop an approximate posterior inference technique based on variational methods. We study itsperformance both on synthetic data and hand parsed data [10]. We show that the STM capturesrelationships missed by other models and achieves lower held-out perplexity.

2 The syntactic topic model

We describe the syntactic topic model (STM), a document model that combines observed syntacticstructure and latent thematic structure. To motivate this model, we return to the travel brochuresentence “In the near future, you could find yourself in .”. The word that fills in the blank isconstrained by its syntactic context and its document context. The syntactic context tells us that it isan object of a preposition, and the document context tells us that it is a travel-related word.

The STM attempts to capture these joint influences on words. It models a document corpus asexchangeable collections of sentences, each of which is associated with a tree structure such as a

2

This provides an inferential speed-up that makes itpossible to fit models at varying granularities. As ex-amples, journal articles might be exchangeable withinan issue, an assumption which is more realistic thanone where they are exchangeable by year. Other data,such as news, might experience periods of time withoutany observation. While the dDTM requires represent-ing all topics for the discrete ticks within these periods,the cDTM can analyze such data without a sacrificeof memory or speed. With the cDTM, the granularitycan be chosen to maximize model fitness rather thanto limit computational complexity.

We note that the cDTM and dDTM are not the onlytopic models to take time into consideration. Topicsover time models (TOT) [23] and dynamic mixturemodels (DMM) [25] also include timestamps in theanalysis of documents. The TOT model treats thetime stamps as observations of the latent topics, whileDMM assumes that the topic mixture proportions ofeach document is dependent on previous topic mix-ture proportions. In both TOT and DMM, the topicsthemselves are constant, and the time information isused to better discover them. In the setting here, weare interested in inferring evolving topics.

The rest of the paper is organized as follows. In sec-tion 2 we describe the dDTM and develop the cDTMin detail. Section 3 presents an e�cient posterior in-ference algorithm for the cDTM based on sparse varia-tional methods. In section 4, we present experimentalresults on two news corpora.

2 Continuous time dynamic topicmodels

In a time stamped document collection, we would liketo model its latent topics as changing through thecourse of the collection. In news data, for example, asingle topic will change as the stories associated withit develop. The discrete-time dynamic topic model(dDTM) builds on the exchangeable topic model toprovide such machinery [2]. In the dDTM, documentsare divided into sequential groups, and the topics ofeach slice evolve from the topics of the previous slice.Documents in a group are assumed exchangeable.

More specifically, a topic is represented as a distribu-tion over the fixed vocabulary of the collection. ThedDTM assumes that a discrete-time state space modelgoverns the evolution of the natural parameters of themultinomial distributions that represent the topics.(Recall that the natural parameters of the multino-mial are the logs of the probabilities of each item.)This is a time-series extension to the logistic normaldistribution [26].

Figure 1: Graphical model representation of thecDTM. The evolution of the topic parameters �t isgoverned by Brownian motion. The variable st is theobserved time stamp of document dt.

A drawback of the dDTM is that time is discretized.If the resolution is chosen to be too coarse, then theassumption that documents within a time step are ex-changeable will not be true. If the resolution is toofine, then the number of variational parameters will ex-plode as more time points are added. Choosing the dis-cretization should be a decision based on assumptionsabout the data. However, the computational concernsmight prevent analysis at the appropriate time scale.

Thus, we develop the continuous time dynamic topicmodel (cDTM) for modeling sequential time-seriesdata with arbitrary granularity. The cDTM can beseen as a natural limit of the dDTM at its finest pos-sible resolution, the resolution at which the documenttime stamps are measured.

In the cDTM, we still represent topics in their naturalparameterization, but we use Brownian motion [14] tomodel their evolution through time. Let i, j (j > i >0) be two arbitrary time indexes, si and sj be the timestamps, and �sj ,si be the elapsed time between them.In a K-topic cDTM model, the distribution of the kth

(1 � k � K) topic’s parameter at term w is:

�0,k,w � N (m, v0)

�j,k,w|�i,k,w, s � N��i,k,w, v�sj ,si

�, (1)

where the variance increases linearly with the lag.

This construction is used as a component in the fullgenerative process. (Note: if j = i+1, we write �sj ,si

as �sj for short.)

1. For each topic k, 1 � k � K,

(a) Draw �0,k � N (m, v0I).

(a) (b)

Figure 1: (a) LDA model. (b) MG-LDA model.

is still not directly dependent on the number of documentsand, therefore, the model is not expected to su�er from over-fitting. Another approach is to use a Markov chain MonteCarlo algorithm for inference with LDA, as proposed in [14].In section 3 we will describe a modification of this samplingmethod for the proposed Multi-grain LDA model.

Both LDA and PLSA methods use the bag-of-words rep-resentation of documents, therefore they can only exploreco-occurrences at the document level. This is fine, providedthe goal is to represent an overall topic of the document,but our goal is di�erent: extracting ratable aspects. Themain topic of all the reviews for a particular item is virtu-ally the same: a review of this item. Therefore, when suchtopic modeling methods are applied to a collection of re-views for di�erent items, they infer topics corresponding todistinguishing properties of these items. E.g. when appliedto a collection of hotel reviews, these models are likely to in-fer topics: hotels in France, New York hotels, youth hostels,or, similarly, when applied to a collection of Mp3 players’reviews, these models will infer topics like reviews of iPodor reviews of Creative Zen player. Though these are all validtopics, they do not represent ratable aspects, but rather de-fine clusterings of the reviewed items into specific types. Infurther discussion we will refer to such topics as global topics,because they correspond to a global property of the objectin the review, such as its brand or base of operation. Dis-covering topics that correlate with ratable aspects, such ascleanliness and location for hotels, is much more problem-atic with LDA or PLSA methods. Most of these topics arepresent in some way in every review. Therefore, it is di�cultto discover them by using only co-occurrence information atthe document level. In this case exceedingly large amountsof training data is needed and as well as a very large num-ber of topics K. Even in this case there is a danger thatthe model will be overflown by very fine-grain global topicsor the resulting topics will be intersection of global topicsand ratable aspects, like location for hotels in New York.We will show in Section 4 that this hypothesis is confirmedexperimentally.

One way to address this problem would be to consider co-occurrences at the sentence level, i.e., apply LDA or PLSA toindividual sentences. But in this case we will not have a suf-ficient co-occurrence domain, and it is known that LDA andPLSA behave badly when applied to very short documents.This problem can be addressed by explicitly modeling topictransitions [5, 15, 33, 32, 28, 16], but these topic n-gram

models are considerably more computationally expensive.Also, like LDA and PLSA, they will not be able to distin-guish between topics corresponding to ratable aspects andglobal topics representing properties of the reviewed item.In the following section we will introduce a method whichexplicitly models both types of topics and e�ciently infersratable aspects from limited amount of training data.

2.2 MG-LDAWe propose a model called Multi-grain LDA (MG-LDA),

which models two distinct types of topics: global topics andlocal topics. As in PLSA and LDA, the distribution of globaltopics is fixed for a document. However, the distribution oflocal topics is allowed to vary across the document. A wordin the document is sampled either from the mixture of globaltopics or from the mixture of local topics specific for thelocal context of the word. The hypothesis is that ratableaspects will be captured by local topics and global topicswill capture properties of reviewed items. For example con-sider an extract from a review of a London hotel: “. . . publictransport in London is straightforward, the tube station isabout an 8 minute walk . . . or you can get a bus for £1.50”.It can be viewed as a mixture of topic London shared bythe entire review (words: “London”, “tube”, “£”), and theratable aspect location, specific for the local context of thesentence (words: “transport”, “walk”, “bus”). Local topicsare expected to be reused between very di�erent types ofitems, whereas global topics will correspond only to partic-ular types of items. In order to capture only genuine localtopics, we allow a large number of global topics, e�ectively,creating a bottleneck at the level of local topics. Of course,this bottleneck is specific to our purposes. Other applica-tions of multi-grain topic models conceivably might preferthe bottleneck reversed. Finally, we note that our definitionof multi-grain is simply for two-levels of granularity, globaland local. In principle though, there is nothing preventingthe model described in this section from extending beyondtwo levels. One might expect that for other tasks even morelevels of granularity could be beneficial.

We represent a document as a set of sliding windows, eachcovering T adjacent sentences within it. Each window v indocument d has an associated distribution over local topics�loc

d,v and a distribution defining preference for local topicsversus global topics �d,v. A word can be sampled using anywindow covering its sentence s, where the window is chosenaccording to a categorical distribution �s. Importantly, thefact that the windows overlap, permits to exploit a largerco-occurrence domain. These simple techniques are capableof modeling local topics without more expensive modeling oftopics transitions used in [5, 15, 33, 32, 28, 16]. Introductionof a symmetrical Dirichlet prior Dir(�) for the distribution�s permits to control smoothness of topic transitions in ourmodel.

The formal definition of the model with Kgl global andKloc local topics is the following. First, draw Kgl worddistributions for global topics �gl

z from a Dirichlet priorDir(�gl) and Kloc word distributions for local topics �loc

z�

from Dir(�loc). Then, for each document d:

• Choose a distribution of global topics �gld � Dir(�gl).

• For each sentence s choose a distribution �d,s(v) �Dir(�).

• For each sliding window v

113

WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China

McCallum, Wang, & Corrada-Emmanuel

!

"

z

w

Latent Dirichlet Allocation

(LDA)[Blei, Ng, Jordan, 2003]

N

D

x

z

w

Author-Topic Model

(AT)[Rosen-Zvi, Griffiths, Steyvers, Smyth 2004]

N

D

"

#

!

$

T

A

#$

T

x

z

w

Author-Recipient-Topic Model

(ART)[This paper]

N

D

"

#

!

$

T

A,A

z

w

Author Model

(Multi-label Mixture Model)[McCallum 1999]

N

D

#$

A

ad ad rda

ddd

d

d

Figure 1: Three related models, and the ART model. In all models, each observed word,w, is generated from a multinomial word distribution, �z, specific to a particulartopic/author, z, however topics are selected di�erently in each of the models.In LDA, the topic is sampled from a per-document topic distribution, �, whichin turn is sampled from a Dirichlet over topics. In the Author Model, there isone topic associated with each author (or category), and authors are sampleduniformly. In the Author-Topic model, the topic is sampled from a per-authormultinomial distribution, �, and authors are sampled uniformly from the observedlist of the document’s authors. In the Author-Recipient-Topic model, there isa separate topic-distribution for each author-recipient pair, and the selection oftopic-distribution is determined from the observed author, and by uniformly sam-pling a recipient from the set of recipients for the document.

its generative process for each document d, a set of authors, ad, is observed. To generateeach word, an author x is chosen uniformly from this set, then a topic z is selected from atopic distribution �x that is specific to the author, and then a word w is generated from atopic-specific multinomial distribution �z. However, as described previously, none of thesemodels is suitable for modeling message data.

An email message has one sender and in general more than one recipients. We couldtreat both the sender and the recipients as “authors” of the message, and then employ theAT model, but this does not distinguish the author and the recipients of the message, whichis undesirable in many real-world situations. A manager may send email to a secretary andvice versa, but the nature of the requests and language used may be quite di�erent. Evenmore dramatically, consider the large quantity of junk email that we receive; modeling thetopics of these messages as undistinguished from the topics we write about as authors wouldbe extremely confounding and undesirable since they do not reflect our expertise or roles.

Alternatively we could still employ the AT model by ignoring the recipient informationof email and treating each email document as if it only has one author. However, in thiscase (which is similar to the LDA model) we are losing all information about the recipients,and the connections between people implied by the sender-recipient relationships.

252

• The posterior can be used in creative ways.

• E.g., we can use inferences in information retrieval, recommendation,similarity, visualization, summarization, and other applications.



© Eric Xing @ CMU, 2005-2013 37














! " w v # $

t g % &

Nb S

b2

T BG2

S T

B







tb � Uniform(1T

)





V (b)ij |�(b)


gigj)

�(b)gh |� � Beta(�).






· � NB(�

k bmk�k, 1/2).2


1. for k = 1, 2, . . . ,



2. for m = 1, . . . , M ,


n(m)· � NB(

�k bmk�k, 1/2).




).





· =P

k n(m)k .





Chang, Blei

!

Nd

"d

wd,n

zd,n

K

#k

yd,d'

$

Nd'

"d'

wd',n

zd',n







q(�,Z|�,�) =�

d [q�(�d|�d)�

n qz(zd,n|�d,n)] , (3)




L =�


�d


�d


d Eq [log p(�d|�)] + H(q), (4)






!

!T

"

#k

$k

% M

&d

!D

'

Parse trees

grouped into M

documents


w1:laid

w2:phrases

w6:forw5:his

w4:some w5:mind

w7:years

w3:in

z1

z2 z3

z4

z5

z5

z6

z7










2












�0,k,w � N (m, v0)


�, (1)





(a) Draw �0,k � N (m, v0I).

(a) (b)












z�





113



!

"

z

w



N

D

x

z

w

Author-Topic Model


N

D

"

#

!

$

T

A

#$

T

x

z

w


(ART)[This paper]

N

D

"

#

!

$

T

A,A

z

w

Author Model


N

D

#$

A

ad ad rda

ddd

d

d





252





© Eric Xing @ CMU, 2005-2013 37














! " w v # $

t g % &

Nb S

b2

T BG2

S T

B







tb � Uniform(1T

)





V (b)ij |�(b)


gigj)

�(b)gh |� � Beta(�).






· � NB(�

k bmk�k, 1/2).2


1. for k = 1, 2, . . . ,



2. for m = 1, . . . , M ,


n(m)· � NB(

�k bmk�k, 1/2).




).





· =P

k n(m)k .





Chang, Blei

!

Nd

"d

wd,n

zd,n

K

#k

yd,d'

$

Nd'

"d'

wd',n

zd',n







q(�,Z|�,�) =�

d [q�(�d|�d)�

n qz(zd,n|�d,n)] , (3)




L =�


�d


�d


d Eq [log p(�d|�)] + H(q), (4)






!

!T

"

#k

$k

% M

&d

!D

'

Parse trees

grouped into M

documents


w1:laid

w2:phrases

w6:forw5:his

w4:some w5:mind

w7:years

w3:in

z1

z2 z3

z4

z5

z5

z6

z7










2












�0,k,w � N (m, v0)


�, (1)





(a) Draw �0,k � N (m, v0I).

(a) (b)












z�





113



!

"

z

w



N

D

x

z

w

Author-Topic Model


N

D

"

#

!

$

T

A

#$

T

x

z

w


(ART)[This paper]

N

D

"

#

!

$

T

A,A

z

w

Author Model


N

D

#$

A

ad ad rda

ddd

d

d





252





© Eric Xing @ CMU, 2005-2013 37














! " w v # $

t g % &

Nb S

b2

T BG2

S T

B







tb � Uniform(1T

)





V (b)ij |�(b)


gigj)

�(b)gh |� � Beta(�).






· � NB(�

k bmk�k, 1/2).2


1. for k = 1, 2, . . . ,



2. for m = 1, . . . , M ,


n(m)· � NB(

�k bmk�k, 1/2).




).





· =P

k n(m)k .





Chang, Blei

!

Nd

"d

wd,n

zd,n

K

#k

yd,d'

$

Nd'

"d'

wd',n

zd',n







q(�,Z|�,�) =�

d [q�(�d|�d)�

n qz(zd,n|�d,n)] , (3)




L =�


�d


�d


d Eq [log p(�d|�)] + H(q), (4)






!

!T

"

#k

$k

% M

&d

!D

'

Parse trees

grouped into M

documents


w1:laid

w2:phrases

w6:forw5:his

w4:some w5:mind

w7:years

w3:in

z1

z2 z3

z4

z5

z5

z6

z7










2












�0,k,w � N (m, v0)


�, (1)





(a) Draw �0,k � N (m, v0I).

(a) (b)












z�





113



!

"

z

w



N

D

x

z

w

Author-Topic Model


N

D

"

#

!

$

T

A

#$

T

x

z

w


(ART)[This paper]

N

D

"

#

!

$

T

A,A

z

w

Author Model


N

D

#$

A

ad ad rda

ddd

d

d





252



© Eric Xing @ CMU, 2005-2020 47

Gibbs predictive distribution:

ïþ

ïýü

ïî

ïíì

++= åÎ

-ij

ijiijiiii AxXXxXpN

qq 0exp)|(

}):{|( iji jxXp NÎ=

jx

jx

mean field equation:

}):{|(

exp)(

iji

jiqjiijiiii

jXXp

AXXXXq

jq

ij

NN

Îñá=

ïþ

ïýü

ïî

ïíì

++= åÎ

qq 0jqjX

}:{ ij jXjq

NÎñáXi

The naive mean field approximation

q Approximate p(X) by fully factorized q(X)=Piqi(Xi)

q For Boltzmann distribution p(X)=exp{åi < j qijXiXj+qioXi}/Z :

Xi

§ Áxjñqj resembles a “message” sent from node j to i

§ {áxjñqj : j Î Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jXj

NÎñájqjX

© Eric Xing @ CMU, 2005-2020 49

)]([ XpG)}]([{ cc XqG

Exact:

Clusters:

(intractable)

Cluster-based approx. to the Gibbs free energy(Wiegerinck 2001, Xing et al 03,04)

© Eric Xing @ CMU, 2005-2020 50

Mean field approx. to Gibbs free energy

q Given a disjoint clustering, {C1, … , CI}, of all variablesq Let

q Mean-field free energy

q Will never equal to the exact Gibbs free energy no matter what clustering is used, but it does always define a lower bound of the likelihood

q Optimize each qi(xc)'s. q Variational calculus …q Do inference in each qi(xc) using any tractable algorithm

),()( ii

iqq CXX Õ=

( ) ( ) ( )ååååÕ +=i

CiCii

Ci

CiiC

iii

iCi

qqEqGxx

xxxx ln)(MF

( ) ( ) ( ) ( ) ( )åååååå ++=< i x

iii

ix

iji

jixx

jiiiji

xqxqxxqxxxqxqG ln)()( e.g., MF ff (naïve mean field)

© Eric Xing @ CMU, 2005-2020 51

),|()( ,,,,*

ijiiii qMBHCECHCHi pq

¹

= XxXX

The Generalized Mean Field theorem

Theorem: The optimum GMF approximation to the cluster marginal is isomorphic to the cluster posterior of the original distribution given internal evidence and its generalized mean fields:

GMF algorithm: Iterate over each qi

© Eric Xing @ CMU, 2005-2020 52

Theorem: The GMF algorithm is guaranteed to converge to a local optimum, and provides a lower bound for the likelihood of evidence (or partition function) the model.

Convergence theorem

© Eric Xing @ CMU, 2005-2020 55

Cluster marginal of a square block Ck:

ïþ

ïýü

ïî

ïíì

++µ å ååÎ Î

ÎÎÎkkMBCkkMBjkCi

kCk

kCji

XqjiijCi

iijiijC XXXXXXq,

)(

',, '

exp)( qqq 0

Virtually a reparameterized Ising model of small size.

Example 1: Generalized MF approximations to Isingmodels

© Eric Xing @ CMU, 2005-2020 56

GMF approximation to Ising models

GMF2x2GMF4x4

BP

Attractive coupling: positively weightedRepulsive coupling: negatively weighted

© Eric Xing @ CMU, 2005-2020 57