probabilistic graphical modelsepxing/class/10708-20/lectures/lecture07-vi1.pdf · abc fundraising...
TRANSCRIPT
Probabilistic Graphical Models
Variational Inference 1Eric XingLecture 7, February 5, 2020
Reading: see class homepage© Eric Xing @ CMU, 2005-2020 1
Inference Problems in Graphical Models
q E.g.: A general undirected graphical model (MRF):
q The quantities of interest:
q marginal distributions:
q normalization constant (partition function):
q Exact inference: tree graph, discrete scope or known integral, …
q What if exact inference is expensive or even impossible? (when this can happen?)
© Eric Xing @ CMU, 2005-2020 2
p(x) =1
Z
Y
C2C C(xC)
p(xi
) =X
xj ,j 6=i
p(x)
Z
Approximate Inference: The Big Picture
q Variational Inferenceq Mean-field (inner approximation)q Loopy Belief Propagation (outer approximation)q Kikuchi and variants (tighter outer approximation)q Expectation Propagation (reverse KL)q …
q Samplingq Monte Carloq Sequential Monte Carlo (Particle Filters)q Markov Chain Monte Carloq Hybrid Monte Carloq …
© Eric Xing @ CMU, 2005-2020 3
Variational Methods
q “Variational”: fancy name for optimization-based formulationsq i.e., represent the quantity of interest as the solution to an optimization problemq approximate the desired solution by relaxing/approximating the intractable
optimization problem
q Examples:q Courant-Fischer for eigenvalues:
q Linear system of equations:q variational formulation:
q for large system, apply conjugate gradient method© Eric Xing @ CMU, 2005-2019 4
�
max
(A) = max
kxk2=1
x
T
Ax
Ax = b, A � 0, x⇤ = A
�1b
x
⇤ = argminx
⇢1
2x
T
Ax� b
T
x
�
Variational Inference: High-level Idea
q Inference: answer queries of Pq Challenge: direct inference on P is often intractableq Indirect approach:
q Project P to a tractable family of distributions Qq Perform inference on the projected Q
q Projection requires a measure of distanceq A convenient choice: KL(Q, P)
q Mean-field: Assume Q is fully factorizedq The simplest possible family of distributions
q Example: Latent Dirichlet Allocation (LDA)
© Eric Xing @ CMU, 2005-2020 5
3.4 Mean Parameterization and Inference Problems 55
Fig. 3.5 Generic illustration of M for a discrete random variable with |X m| finite. In thiscase, the set M is a convex polytope, corresponding to the convex hull of {φ(x) | x ∈ X m}.By the Minkowski–Weyl theorem, this polytope can also be written as the intersectionof a finite number of half-spaces, each of the form {µ ∈ Rd | ⟨aj , µ⟩ ≥ bj} for some pair(aj , bj) ∈ Rd × R.
Example 3.8 (Ising Mean Parameters). Continuing from Exam-ple 3.1, the sufficient statistics for the Ising model are the singletonfunctions (xs, s ∈ V ) and the pairwise functions (xsxt, (s, t) ∈ E). Thevector of sufficient statistics takes the form:
φ(x) :=!xs,s ∈ V ; xsxt, (s, t) ∈ E
"∈ R|V |+|E|. (3.30)
The associated mean parameters correspond to particular marginalprobabilities, associated with nodes and edges of the graph G as
µs = Ep[Xs] = P[Xs = 1] for all s ∈ V , and (3.31a)
µst = Ep[XsXt] = P[(Xs,Xt) = (1,1)] for all (s, t) ∈ E. (3.31b)
Consequently, the mean parameter vector µ ∈ R|V |+|E| consists ofmarginal probabilities over singletons (µs), and pairwise marginalsover variable pairs on graph edges (µst). The set M consists of theconvex hull of {φ(x),x ∈ {0,1}m}, where φ is given in Equation (3.30).In probabilistic terms, the set M corresponds to the set of allsingleton and pairwise marginal probabilities that can be realizedby some distribution over (X1, . . . ,Xm) ∈ {0,1}m. In the polyhedralcombinatorics literature, this set is known as the correlation polytope,or the cut polytope [69, 187].
qqqHEq -=
Î minarg*
T! ∈ #$(&)
( ), + = -.() ∥ +)
)
+
Probabilistic Topic Models
q Humans cannot afford to deal with (e.g., search, browse, or measure similarity) a huge number of text documents
q We need computers to help out …
© Eric Xing @ CMU, 2005-2020 6
How to get started for a new modeling task?
Here are some important elements to consider before you start:q Task:
q Embedding? Classification? Clustering? Topic extraction? …q Data representation:
q Input and output (e.g., continuous, binary, counts, …) q Model:
q BN? MRF? Regression? SVM? q Inference:
q Exact inference? MCMC? Variational? q Learning:
q MLE? MCLE? Max margin? q Evaluation:
q Visualization? Human interpretability? Perperlexity? Predictive accuracy? It is better to consider one element at a time!
© Eric Xing @ CMU, 2005-2020 7
Tasks: document embedding
q Say, we want to have a mapping …, so that
q Compare similarity q Classify contentsq Cluster/group/categorizingq Distill semantics and perspectives q ..
Þ
© Eric Xing @ CMU, 2005-2020 8
Summarizing the data using topicsSome example topics
studentseducationlearningeducationalteachingschoolstudentskillsteacheracademic
marketeconomicfinancialeconomicsmarketsreturnspricestockvalue
investment
cortexcorticalareasvisualareaprimary
connectionsventralcerebralsensory
BayesianmodelinferencemodelsprobabilityprobabilisticMarkovpriorhiddenapproach
Education MarketBayesian modeling
Visual cortex
Chong Wang Probabilistic Modeling for Large-scale Data Exploration 12
© Eric Xing @ CMU, 2005-2020 9
See how data changes over time1 topic 3 topics 5 topics 10 topics
0
100
200
300
400
500
600
700
800
900
aver
age
abso
lute
erro
r (ho
ur)
small AP, time stamp prediction
baseline
w d h w+d w+d+h
(a)
1 topic 3 topics 5 topics 10 topics0
50
100
150
200
aver
age
abso
lute
erro
r (da
y)
election 08, time stamp prediction
baseline
m w d m+w m+w+d
(b)
Figure 6: Time stamp prediction. ‘m’ stands for flat approach of ‘month’, ‘w’ for ‘week’, ‘d’ for ‘day’ and and ‘h’for ‘hour.’ ‘m+w’ stands for the hierarchical approach of combining ‘month’ and ‘week’, and ‘m+w+d’, ‘w+d’,‘w+d+h’ are similarly defined. The baseline is the expectation of the error by randomly assigning a time. (a)AP data. 5-topic and 10-topic models perform better than others, and the hierarchical approach always achievesthe best performance. (b) Election 08 data. 1-topic model performs best due to the short documents. Thehierarchical approach achieves comparable performances.
2/27/2007
healthcareabc
wisconsinvegas
superdelegatenevada
delegatecivil
recountflorida
4/24/2007
healthcareabc
wisconsinvegas
superdelegatenevada
delegatecivil
fundraisingrecount
6/26/2007
healthcarewisconsin
vegassuperdelegate
nevadaabc
fundraisingdelegate
civilflorida
8/28/2007
healthcarewisconsin
vegassuperdelegate
kucinichnevada
fundraisingdelegatefloridacivil
10/23/2007
kucinichron
obama healthcare
paulwisconsin
vegas superdelegate
iowanevada
12/25/2007
obamaclinton
paulron
kucinichhillaryiowa
campaignnew
barack
2/19/2008
obamaclintonhillarybarack
campaigndemocratic
iowakucinich
paulron
Figure 7: Examples from a 3-topic cDTM using the week model in the Election 08 data. In year 2007, the topicswere more about general issues, while around year 2008, were more about candidates and changing faster.
[15] A. McCallum, A. Corrada-Emmanuel, andX. Wang. Topic and role discovery in social net-works. In IJCAI, 2005.
[16] J. Nocedal and S. J. Wright. Numerical Optimiza-tion. Springer, 2006.
[17] M. Opper and G. Sanguinetti. Variational infer-ence for markov jump processes. In NIPS, 2007.
[18] S. Rogers, M. Girolami, C. Campbell, and R. Bre-itling. The latent process decomposition of cDNAmicroarray data sets. IEEE/ACM Trans. onComp. Bio. and Bioinf., 2(2):143–156, 2005.
[19] M. Rosen-Zvi, T. Gri�ths, M. Steyvers, andP. Smyth. The author-topic model for authorsand documents. In UAI, 2004.
[20] Y. W. Teh, D. Newman, and M. Welling. Acollapsed variational bayesian inference algorithmfor latent Dirichlet allocation. In NIPS, 2006.
[21] G. E. Uhlenbeck and L. S. Ornstein. On the the-ory of Brownian motion. Phys. Rev., 36:823–41,1930.
[22] M. J. Wainwright and M. I. Jordan. Graphicalmodels, exponential families and variational infer-ence. Technical Report 649, UC Berkeley, Dept.of Statistics, 2003.
[23] X. Wang and A. McCallum. Topics over time:A non-Markov continuous-time model of topicaltrends. In SIGKDD, 2006.
[24] X. Wei and B. Croft. LDA-based document mod-els for ad-hoc retrieval. In SIGIR, 2006.
[25] X. Wei, J. Sun, and X. Wang. Dynamic mixturemodels for multiple time series. In IJCAI, 2007.
[26] M. West and J. Harrison. Bayesian Forecastingand Dynamic Models. Springer, 1997.
© Eric Xing @ CMU, 2005-2020 10
User interest modeling using topics
http://cogito-demos.ml.cmu.edu/cgi-bin/recommendation.cgi
© Eric Xing @ CMU, 2005-2020 11
Representation:
q Data:
q Each document is a vector in the word spaceq Ignore the order of words in a document. Only count matters!
q A high-dimensional and sparse representation– Not efficient text processing tasks, e.g., search, document
classification, or similarity measure– Not effective for browsing
As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their well-founded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs. An 18 months of "negotiations" in Madrid, and Washington proved these predictions. Now many will jump on me saying why are you blaming israelis for no-result negotiations. I would say why would the Arabs stall the negotiations, what do they have to loose ?
Arabian
negotiationsagainst
peaceIsrael
Arabsblaming
Bag of Words Representation
© Eric Xing @ CMU, 2005-2020 12
How to Model Semantic?
q Q: What is it about?q A: Mainly MT, with syntax, some learning
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical
Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
SourceTargetSMT
AlignmentScoreBLEU
ParseTreeNoun
PhraseGrammar
CFG
likelihoodEM
HiddenParametersEstimation
argMax
MT Syntax Learning
0.6 0.3 0.1
Unigram over vocabulary
Topi
cs
Mixing Proportion
Topic Models© Eric Xing @ CMU, 2005-2020 13
Why this is Useful?
q Q: What is it about?q A: Mainly MT, with syntax, some learning
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experimentsusing BLEU as a metric, the hierarchical
Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
MT Syntax Learning
Mixing Proportion
0.6 0.3 0.1
l Q: give me similar document?l Structured way of browsing the collection
l Other tasksl Dimensionality reduction
l TF-IDF vs. topic mixing proportion
l Classification, clustering, and more …
© Eric Xing @ CMU, 2005-2020 14
Topic Models: The Big Picture
Unstructured Collection Structured Topic Network
Topic Discovery
Dimensionality Reduction
w1
w2
wn
xx
xx
T1
Tk T2x x x
x
Word Simplex Topic Simplex
© Eric Xing @ CMU, 2005-2020 15
wor
dsdocuments
C = W L DT
wor
ds
topic
topic
topi
c
topi
c
documents
LSI
Topic models
wor
ds =
documents
wor
ds
topics
topi
cs
documents
P(w
|z) P(z)P(w)Topic-Mixing is via repeated word labeling
dWx!! '=
LSI versus Topic Model (probabilistic LSI)
© Eric Xing @ CMU, 2005-2020 16
Words in Contexts
•�It was a nice shot. ”
© Eric Xing @ CMU, 2005-2020 17
Words in Contexts (con'd)
q The opposition Labor Party fared even worse, with a predicted 35
seats, seven less than last election.
© Eric Xing @ CMU, 2005-2020 18
"Words" in Contexts (con'd)
Sivic et al. ICCV 2005© Eric Xing @ CMU, 2005-2020 19
More Generally: Admixture Models
q Objects are bags of elements
q Mixtures are distributions over elements
q Objects have mixing vector qq Represents each mixtures’ contributions
q Object is generated as follows:q Pick a mixture component from qq Pick an element from that component
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
…
0.1 0.1 0.5…..
0.1 0.5 0.1…..
0.5 0.1 0.1…..
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2
© Eric Xing @ CMU, 2005-2020 20
Topic Models Represented as a GM
Generating a documentPrior
θ
z
w β Nd
N
K
( ){ } ( )
from ,| Draw -
from Draw- each wordFor
prior thefrom
:1 nzknn
n
lmultinomiazwlmultinomiaz
nDraw
bbq
q-
Which prior to use?
© Eric Xing @ CMU, 2005-2020 21
Choices of Priors
q Dirichlet (LDA) (Blei et al. 2003)q Conjugate prior means efficient inferenceq Can only capture variations in each topic’s
intensity independently
q Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)
q Capture the intuition that some topics are highly correlated and can rise up in intensity together
q Not a conjugate prior implies hard inference
© Eric Xing @ CMU, 2005-2020 22
Generative Semantic of LoNTAM
( ){ } ( )
from ,| Draw -
from Draw- each wordFor
prior thefrom
:1 nzknn
n
lmultinomiazwlmultinomiaz
nDraw
bbq
q-
( )( )
( ) log
logexp
,~,~
÷ø
öçè
æ+=
þýü
îíì
÷ø
öçè
æ+-=
=SS
å
å-
=
-
=
-
1
1
1
1
1
1
1
0
K
i
K
iii
KK
K
i
i
eC
e
NLN
g
g
g
gq
gµgµq
© Eric Xing @ CMU, 2005-2016
- Log Partition Function- Normalization Constant
Problem
g
z
w
β
Nd
N
K
μ Σ Generating a document
© Eric Xing @ CMU, 2005-2020 23
Posterior inference
z w bN
✓K
Topics
Topic proportions
Topic assignments
D
a ⌘
© Eric Xing @ CMU, 2005-2020 24
Posterior inference results
z w bN
a ✓K
Bayesianmodelinference…..
inputoutputsystem…..
cortexcorticalareas…..
Topics
Topic proportions
Topic assignments
D
⌘
© Eric Xing @ CMU, 2005-2020 25
Joint likelihood of all variables
z w bN
✓K
D
a ⌘
We are interested in computing the posterior, and the data likelihood!
© Eric Xing @ CMU, 2005-2020 26
Inference and Learning are both intractable
q A possible query:
q Close form solution?
q Sum in the denominator over Tn terms, and integrate over n k-dimensional topic vectors
q Learning: What to learn? What is the objective function?
å Õ Õò ò ÷÷ø
öççè
æ÷ø
öçè
æ=}{
1,,,
)|()|()|()|()(mn
nz
Nn
nm
nmnzmn dddppzpxpDp bqqhbaqqb !!
)(
)|()|()|()|(
)(),()|(
}{,,
,
Dp
ddppzpwp
DpDpDp
mn
nz
in
nm
nmnzmn
nn
å ò Õ Õ -÷÷ø
öççè
æ÷ø
öçè
æ
=
=
bqhbaqqb
?)|(?)|(
, ==DzpDp
mn
nq
© Eric Xing @ CMU, 2005-2020 27
Approximate Inference
q Variational Inference
q Mean field approximation (Blei et al.)q Expectation propagation (Minka et al.)q Variational 2nd-order Taylor approximation (Xing)
q Markov Chain Monte Carlo
q Gibbs sampling (Griffiths et al)
© Eric Xing @ CMU, 2005-2020 28
Variational Inference
q Consider a generative model !" #|% , and prior ! %q Joint distribution: !" #, % = !" #|% ! %
q Assume variational distribution () %|#q Objective: Maximize lower bound for log likelihood
q Equivalently, minimize free energy
log ! #= -. () % # || !/ % # + 1
%() % # log !" #, %() % #
≥ 1%() % # log !" #, %() % #
≔ ℒ(/,6; #)
9 /, :; # = −log ! # + -.(() % # || !/(%|#))© Eric Xing @ CMU, 2005-2020 29
Variational Inference
Maximize the variational lower bound:
• E-step: maximize ℒ w.r.t. ", with # fixed
• If closed form solutions exist:
• M-step: maximize ℒ w.r.t. #, with " fixed
max' ℒ #,"; *
max+ ℒ #,"; *
,'∗ (/|1) ∝ exp[log :+(1, /)]
ℒ #,"; * = =>?(@|A) log :+ 1 / + CD ,' / 1 ||: /= log : * − CD(,' F * || :#(F|*))
© Eric Xing @ CMU, 2005-2020 30
Mean-field assumption (in topic models)
q True posterior
q Break the dependency using the fully factorized distribution
q Mean-field family usually does NOT include the true posterior.
p(�, ✓, z|w) =p(�, ✓, z,w)
p(w)
q.ˇ; ✓;z/ DY
k
q.ˇk/Yd
q.✓d /Y
n
q.zdn/
© Eric Xing @ CMU, 2005-2020 31
Mean Field Approximation
q Parametric form for each marginal factor in ! ", $, % &, ', ():
q Learning parameters of the variational distribution (E-step):
q For LDA, we can compute the optimal MF approximation in closed form.© Eric Xing @ CMU, 2005-2020 32
q(�k | �k) = Dirichlet(�k | �k)
q(✓d | �d) = Dirichlet(✓d | �d)q(zdn | �dn) = Multinomial(zdn | �dn)
<latexit sha1_base64="HofzDxlOux3vDigpHMNbhMdgid0=">AAAFaXicdZNdb9MwFIbTbYURPrbBDYIbi2poXGxKmnTrDdLSJNUuGBpiX9JSVY7jttacD2wHKFF+KBf8Av4EbpqypV0sRTp+z3OOX8e2n1DChab9bqytbzQfPd58oj599vzF1vbOy0sepwzhCxTTmF37kGNKInwhiKD4OmEYhj7FV/6tPctffceMkzg6F9MED0I4jsiIICikNNz+823P87GAw1vghSQAHpW1gZx+AO8/SgmKCQszhzCCJhSLvJb2PFW2EpNZNiizYxiGclbbqoYuWv0aZkGUl6lkQopppdNpSgWJ4pBAmtfgw+2WdqAVA6wGehm0lHKcDXfWhRfEKA1xJBCFnN/oWiIGGWSCIIpz1Us5TiC6hWN8I8MIhpgPsuIccrArlQCMYia/SIBCvV+RwZDzaejnAIDd2S74cnYm1mRniohjyvMHsjepGHUHGYmSVOAIza2MUgpEDGbHDgLCMBJ0KgOIGJG7AWgCGURCXg61stDP+W5UdRecMRJCNlW9AI/kBSsSWTIX932a4jw7OT/9lGdmu9vpm/nDIMPBgnMt0zA6NdwUUxr/WKD9Xs/W6lqOGcbRgjRMq9sxCsNfJRcFK5b5Qq6Y1rsdq2fndeg92z1DN/V2LVk17lrdQ02rhSvWdePIMObWz7G8YyvORalWjDtt1+gvG/9PVhawXddyDosFPuOUQVotGjM43del28VP1/vGyjkWUPsOcruu5ToPQcYd5FiO7S7/hQIy76CeY+u2dCffqb78KleDy/aBbhxoX8zWca98sZvKW+WdsqfoypFyrJwoZ8qFghp2gzRYg2/8be40XzffzNG1RlnzSqmMZusfDsrPLw==</latexit>
�?,�?,�? = arg min�,�,�
KL(q(�, ✓, z | �,�) k p(�, ✓, z | w,↵, ⌘))<latexit sha1_base64="x0l7+m6Vouyo4TfXV7Av+FfbEOk=">AAAGOXichZNbb9MwFMezjsIot4098mIxDa3SNiVNu5UHpLVpqklsaIjdpKWrnMRtrTkXHIfRBX8jvgGfhEfeEK98AZxLu6VdhKVIx+f/O8d/O7bpExwwWf65UFp8UH74aOlx5cnTZ89fLK+8PA28kFroxPKIR89NGCCCXXTCMCPo3KcIOiZBZ+aVFutnXxANsOces7GPeg4cuniALchEqr+y8H0dfN4wTMRg/woYDraBQUS5LaZV8OadSEE2ok7UwRRbI4IYL6QNo5I0Y6NYtzN9CB1HzAqbFdBZs5t+ZLs8E/0RTqa5XochYdj1HAwJL8AradtLI2CQbk4sT6eCS2MgukI6jMvdfpQWTfEU5NN13x9wsJEdndCSbWymqjmIbvjdDaW1VQMY3wzg/7cmm19zoUHij2JGkNVqf3lN3paTAeYDJQvWpGwc9VcWmWF7Vuggl1kEBsGFIvusF0HKsEUQrxhhgHxoXcEhuhChCx0U9KLkXnGwLjI2GHhUfC4DSfZuRQSdIBg7JgcArMemg1k1ThaocYZ5Hgn4PepFyAbNXoRdP2TItVIrg5AA5oH4GgMbU2QxMhYBtCgWuwHWCFJoMXHZK7mFvqa7qYj7dESxA+m4YthoIB5MIkR+mtwySYh4tH98eMCjeq3Z6Nb5/SBF9oTTW3VVbRRwY0SIdz1Bu+22Jhe1HFKE3Amp1lvNhpoY/iQ4156zHEzSOdNKs9Fqa7wIvWO7rSp1pVZI5o3rreaOLBfCOeuKuquqqfVjJO7YnHOWZXPGOzVd7c4an5K5BTRdb3V2kgU+oJBCki8aUjjeUoTbyaErXXXuPyZQ7RbSm3pL79wHqbdQp9XR9NlTSKD6LdTuaIom3Il3qsy+yvngtLatqNvyx/raXjt7sUvSK+m1tCEp0q60J+1LR9KJZJVWS29L7ZJW/lH+Vf5d/pOipYWsZlXKjfLff/tgGQI=</latexit>
z w bN
✓K
D
a ⌘
$*+
Update each marginal
q Update:
q Where in LDA:
q And we obtain:
This is also a Dirichlet — the same as its prior!© Eric Xing @ CMU, 2005-2020 33
Update each marginal
q Similarly to ! "# $#), we obtain optimal parameters &#'⋆ for ! )#' &#'):
q And optimal parameters *+⋆ for ! ,+ *+):
q Iterating these equations to convergence yields the MF approximation to the posterior distribution.
© Eric Xing @ CMU, 2005-2020 34
�k(j) = ⌘(j) +DX
d=1
NdX
n=1
�?dn(k)1[wdn = j]
<latexit sha1_base64="uuBItDiVAYxf+mMr7sWWrJ4+f08=">AAAG7XichZNJT9tAFMdNaFOabtAeexkVUQUVkB0HyAWJLI6QWERVNgkHd2JPkoHx0vG4ENz5GL1Vvfaj9DP0Q/TaXjteEjDBqqVIb/m9N/95k9f1CPaZLP+aKkw/eFh8NPO49OTps+cvZudeHvluQE10aLrEpSdd6COCHXTIMCPoxKMI2l2CjrsXzSh//BlRH7vOARt6qGPDvoN72IRMhIy5wscF8KmsdxGDxgXQbWwBnYhyS7iL4O2GCEE2oHbYwhSbA4IYz6V1vRQ3Y4Mob6X5PrRt4eU2y6HTZtdGaDk8TXoDHLuZXrsBYdhxbQwJz8FFp6Txme4zSJdGoseuIBMbiL6Q9qMGjhEmRWM8Afn45O0dDsrp8EQuvshSku32wmt++0pJ7aIO9C868P5bk/qXXOQg8QYRI8jF+Cqjewmx4wFtAPEAQnz8NKEFLo1oDOjK00Og7/u4fDPZ5TTgB7YRnm8o/GxbjJuXxi9ZPo9biU6R9Q4kpBWRrdRxIifcMyx+a3ixkrLQoZxejlSdd4zZeXlFjj8waSipMS+l374xN810yzUDGznMJND3TxXZY50QUoZNgoTQwEceNC9gH50K04E28jthvAscLIiIBXouFT+HgTh6uyKEtu8P7S4HACxEY/bvZqNgTjaKMNclPr8nexqwXq0TYscLGHLMREovIIC5IFo9YGGKTEaGwoAmxeI2wBxACk0mFrSUOegquU1JPPc+xTakw5JuoZ5Y8jgReklwuUsCxMOtg90dHlYrtdV2ld8PUmSNOK1eVdXVHG6ICHEvR2i70WjKeS37FCFnRKrVem1VjQV/EJxjTUj2R+GMaKW2Wm80eR56S3ZDVapKJZfMCtfqtTVZzoUz0hV1XVUT6QdI/McmlLM0mhHeqmhq+67wMZk5oKlp9dZafMAeCigk2aI+hcNlRagdDV1pqxPvGEOVG0iraXWtdR+k3kCtequp3Z1CDFVvoEarqTSFOrGnyt2tnDSOKiuKuiK/r85vNtKNnZFeS2+ksqRI69KmtCXtS4eSWfhZ+F34U/hbdItfi9+K3xO0MJXWvJIyX/HHP8e7Vl0=</latexit>
q(zdn = k | �dn) = �dn(k) = �k(wdn) exp
8<
: (�d(k))� (KX
j=1
�d(j))
9=
;<latexit sha1_base64="i9pUGCzXtCn6FQowZnfBZCT62CY=">AAAHG3ichZRbb9MwFMfDCmWU2waPe7EYQ61gU9J00JdJ6yXTpF00xC5Ic1e5idt6dS44DqML/gp8Az4Nb4hXHvggvONc2i3rIixVOpffOf77uE7Po8Tnqvrnzlzh7r3i/fkHpYePHj95urD47Nh3A2biI9OlLvvYQz6mxMFHnHCKP3oMI7tH8Ulv1IryJ58x84nrHPKxhzs2GjikT0zEZai7OPdtBXwqwx7mqDsC0CYWgFSWW9KtgFcbMoT4kNlhmzBiDinmIpeGsBQ348Mob6X5AbJt6eU2y6HTZpfd0HJEmvSGJHYzvfYCyonj2gRRkYPLTknjM+hzxN5MRE9dSSY2kH0RG0QNnG6YFE3xBBTTnXd2BSinw5O5+CBvkmyvH16K60dKaisQwK8QeP+tSf0LIXOIesOIkWSlUopGYjlS52hmKBtXTnkUu/FFhSNRvkgRiL94kOI+hyGABz4pTwYuCypgNY35gd0Nzzc0cbYzvZHyuQQgI4MhhyIa6OTeZSLaSu4UWa9BUm1F1e3UcSIn3O9a4tqopzq100Sd7HLe6S4sq2tqvMCsoaXGspKug+5igUPLNQMbO9ykyPdPNdXjnRAxTkyKRQkGPvaQOUIDfCpNB9nY74TxyxFgRUYs0HeZ/DkcxNHrFSGyfX9s9wQAYCW6FP9mNgrmZKMId13qi1uypwHv1zshcbyAY8dMpPQDCrgLoocKLMKwyelYGshkRJ4GmEPEkMnlcy5lNvqSnKYkr+WAERuxcQlauC8/CXEi9JLgao8GWITbh3u7IqxV6+tbNXE7yLA14YxGTdfXc7gxptS9mKBbzWZLzWs5YBg7E1KvNerreiz4g+Qca0ayPwlnRGv19UazJfLQa7KbulbTqrlkVrjRqL9V1Vw4I13T3+l6Iv0Qy//YjHKeRjPC21VD37opfEpmNmgZRqP9Nt5gHwcM0WzRgKHxqibVToaubekz9xhD1SvIqBsNo30bpF9B7Ua7ZdycQgzVrqBmu6W1pDr5TrWbr3LWOK6uafqa+r62vNlMX+y8sqS8UMqKprxTNpVt5UA5Usy5v4WlwsvCSvF78UfxZ/FXgs7dSWueK5lV/P0P44xlWg==</latexit>
Coordinate ascent algorithm for LDA
1: Initialize variational topics q(Øk
), k = 1, ...,K .2: repeat3: for each document d 2 {1,2, ...,D} do4: Initialize variational topic assigments q(z
dn
), n = 1, ...,N
5: repeat6: Update variational topic proportions q(µ
d
)7: Update variational topic assigments q(z
dn
), n = 1, ...,N
8: until Change of q(µd
) is small enough9: end for
10: Update variational topics q(Øk
), k = 1, ...,K .11: until Lower bound L(q) converges
Chong Wang (CMU) Large-scale Latent Variable Models February 28, 2013 50 / 50
© Eric Xing @ CMU, 2005-2020 35
Conclusion
q GM-based topic models are coolq Flexible q Modularq Interactive
q There are many ways of implementing topic modelsq unsupervisedq supervised
q Efficient Inference/learning algorithmsq GMF, with Laplace approx. for non-conjugate dist.q MCMC
q Many applicationsq …q Word-sense disambiguationq Image understandingq Network inference
© Eric Xing @ CMU, 2005-2020 36
Supplementary
© Eric Xing @ CMU, 2005-2020 37
Supplementary: More on strategies in VI
q Alternative approximation scheme
q How to evaluate: empirical (ground truth unknown) vs. simulation (ground truth known)
q Comparison (of what)
q Building blocks
© Eric Xing @ CMU, 2005-2020 38
Recall Choices of Priors
q Dirichlet (LDA) (Blei et al. 2003)q Conjugate prior means efficient inferenceq Can only capture variations in each topic’s
intensity independently
q Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006)
q Capture the intuition that some topics are highly correlated and can rise up in intensity together
q Not a conjugate prior implies hard inference
© Eric Xing @ CMU, 2005-2020 39
)|}{,( DzP g
β
γ
z
w
μ Σ
Ahmed&Xing Blei&Lafferty
Σ* is assumed to be diagonalΣ* is full matrix
Log Partition Function
Choice of q() does matter
( ) ( ) ( )ÕS= nnn zqqzq fµgg **,, :1
γ
z
w
μ* Σ*
φ
β
Multivariate
Quadratic Approx.Tangent Approx.
Closed Form
Solution for μ*, Σ*
Numerical
Optimization to
fit μ*, Diag(Σ*)
1log1
1
÷ø
öçè
æ+å
-
=
K
i
ieg
© Eric Xing @ CMU, 2005-2020 40
Tangent Approximation
© Eric Xing @ CMU, 2005-2020 41
How to evaluate?
q Empirical Visualization: e.g., topic discovery on New York TimesTopic discovery
The 5 most frequent topics from the HDP on the New York Times.game
second
seasonteam
play
games
players
points
coach
giants
streetschool
house
life
children
familysays
night
man
know
life
says
show
mandirector
television
film
story
movie
films
house
life
children
man
war
book
story
books
author
novel
street
house
nightplace
park
room
hotel
restaurant
garden
wine
house
bush
political
party
clintoncampaign
republican
democratic
senatordemocrats percent
street
house
building
real
spacedevelopment
squarehousing
buildings
game
secondteam
play
won
open
race
win
roundcup
game
season
team
runleague
gameshit
baseball
yankees
mets
government
officials
warmilitary
iraq
army
forces
troops
iraqi
soldiers
school
life
children
family
says
women
helpmother
parentschild
percent
business
market
companies
stock
bank
financial
fund
investorsfunds government
life
warwomen
politicalblack
church
jewish
catholic
pope
street
show
artmuseum
worksartists
artist
gallery
exhibitionpaintings street
yesterdaypolice
man
casefound
officer
shot
officers
charged
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Chong Wang Probabilistic Modeling for Large-scale Data Exploration 60
© Eric Xing @ CMU, 2005-2020 42
How to evaluate?
q Test on Synthetic Text where ground truth is known:
w
β
q
z
μ Σ
© Eric Xing @ CMU, 2005-2020 43
Comparison: accuracy and speed
q L2 error in topic vector est. and # of iterations
q Varying Num. of Topics
q Varying Voc. Size
q Varying Num. Words Per Document
© Eric Xing @ CMU, 2005-2020 44
Comparison: perplexity
© Eric Xing @ CMU, 2005-2020 45
Classification Result on PNAS collection
q PNAS abstracts from 1997-2002q 2500 documentsq Average of 170 words per document
q Fitted 40-topics model using both approachesq Use low dimensional representation to predict the abstract category
q Use SVM classifierq 85% for training and 15% for testing
Classification Accuracy
-Notable Difference-Examine the low dimensionalrepresentations below
© Eric Xing @ CMU, 2005-2020 46
What makes topic models useful --- The Zoo of Topic Models!
q It is a building block of many models.Williamson et al. 2010 Chang & Blei, 2009
Boyd-Graber & Blei, 2008 Wang & Blei, 2008McCallum et al. 2007
Titov & McDonald, 2008
What makes topic models useful?
! It is a building block of many models.
© Eric Xing @ CMU, 2005-2013 37
Extending LDAby emerging groups. Both modalities are driven by thecommon goal of increasing data likelihood. Consider thevoting example again; resolutions that would have been as-signed the same topic in a model using words alone maybe assigned to di�erent topics if they exhibit distinct votingpatterns. Distinct word-based topics may be merged if theentities vote very similarly on them. Likewise, multiple dif-ferent divisions of entities into groups are made possible byconditioning them on the topics.
The importance of modeling the language associated withinteractions between people has recently been demonstratedin the Author-Recipient-Topic (ART) model [16]. In ARTthe words in a message between people in a network aregenerated conditioned on the author, recipient and a setof topics that describes the message. The model thus cap-tures both the network structure within which the peopleinteract as well as the language associated with the inter-actions. In experiments with Enron and academic email,the ART model is able to discover role similarity of peoplebetter than SNA models that consider network connectivityalone. However, the ART model does not explicitly capturegroups formed by entities in the network.
The GT model simultaneously clusters entities to groupsand clusters words into topics, unlike models that gener-ate topics solely based on word distributions such as LatentDirichlet Allocation [4]. In this way the GT model discov-ers salient topics relevant to relationships between entitiesin the social network—topics which the models that onlyexamine words are unable to detect.
We demonstrate the capabilities of the GT model by ap-plying it to two large sets of voting data: one from US Sen-ate and the other from the General Assembly of the UN.The model clusters voting entities into coalitions and si-multaneously discovers topics for word attributes describingthe relations (bills or resolutions) between entities. We findthat the groups obtained from the GT model are signifi-cantly more cohesive (p-value < .01) than those obtainedfrom the Blockstructures model. The GT model also dis-covers new and more salient topics in both the UN and Sen-ate datasets—in comparison with topics discovered by onlyexamining the words of the resolutions, the GT topics areeither split or joined together as influenced by the voters’patterns of behavior.
2. GROUP-TOPIC MODELThe Group-Topic Model is a directed graphical model that
clusters entities with relations between them, as well as at-tributes of those relations. The relations may be either di-rected or undirected and have multiple attributes. In thispaper, we focus on undirected relations and have words asthe attributes on relations.
In the generative process for each event (an interactionbetween entities), the model first picks the topic t of theevent and then generates all the words describing the eventwhere each word is generated independently according toa multinomial distribution �t, specific to the topic t. Togenerate the relational structure of the network, first thegroup assignment, gst for each entity s is chosen condition-ally on the topic, from a particular multinomial distribution�t over groups for each topic t. Given the group assignmentson an event b, the matrix V (b) is generated where each cellV (b)
gigj represents how often the groups of two senators be-haved the same or not during the event b, (e.g., voted the
SYMBOL DESCRIPTIONgit entity i’s group assignment in topic ttb topic of an event b
w(b)k the kth token in the event b
V (b)ij entity i and j’s groups behaved same (1)
or di�erently (2) on the event bS number of entitiesT number of topicsG number of groupsB number of eventsV number of unique wordsNb number of word tokens in the event bSb number of entities who participated in the event b
Table 1: Notation used in this paper
! " w v # $
t g % &
Nb S
b2
T BG2
S T
B
Figure 1: The Group-Topic model
same or not on a bill). The elements of V are sampled from
a binomial distribution �(b)gigj . Our notation is summarized
in Table 1, and the graphical model representation of themodel is shown in Figure 1.
Without considering the topic of an event, or by treat-ing all events in a corpus as reflecting a single topic, thesimplified model (only the right part of Figure 1) becomesequivalent to the stochastic Blockstructures model [17]. Tomatch the Blockstructures model, each event defines a re-lationship, e.g., whether in the event two entities’ groupsbehave the same or not. On the other hand, in our model arelation may have multiple attributes (which in our exper-iments are the words describing the event, generated by aper-topic multinomial).
When we consider the complete model, the dataset is dy-namically divided into T sub-blocks each of which corre-sponds to a topic. The complete GT model is as follows,
tb � Uniform(1T
)
wit|�t � Multinomial(�t)
�t|� � Dirichlet(�)
git|�t � Multinomial(�t)
�t|� � Dirichlet(�)
V (b)ij |�(b)
gigj� Binomial(�(b)
gigj)
�(b)gh |� � Beta(�).
We want to perform joint inference on (text) attributesand relations to obtain topic-wise group memberships. Sinceinference can not be done exactly on such complicated prob-abilistic graphical models, we employ Gibbs sampling to con-duct inference. Note that we adopt conjugate priors in our
Indian Buffet Process Compound Dirichlet Process
B selects a subset of atoms for each distribution, and thegamma random variables � determine the relative massesassociated with these atoms.
2.4. Focused Topic Models
Suppose H parametrizes distributions over words. Then,the ICD defines a generative topic model, where it is usedto generate a set of sparse distributions over an infinite num-ber of components, called “topics.” Each topic is drawnfrom a Dirichlet distribution over words. In order to specifya fully generative model, we sample the number of wordsfor each document from a negative binomial distribution,n(m)
· � NB(�
k bmk�k, 1/2).2
The generative model for M documents is
1. for k = 1, 2, . . . ,
(a) Sample the stick length �k according to Eq. 1.(b) Sample the relative mass �k � Gamma(�, 1).(c) Draw the topic distribution over words,
�k � Dirichlet(�).
2. for m = 1, . . . , M ,
(a) Sample a binary vector bm according to Eq. 1.(b) Draw the total number of words,
n(m)· � NB(
�k bmk�k, 1/2).
(c) Sample the distribution over topics,�m � Dirichlet(bm · �).
(d) For each word wmi, i = 1, . . . , n(m)· ,
i. Draw the topic index zmi � Discrete(�m).ii. Draw the word wmi � Discrete(�zmi
).
We call this the focused topic model (FTM) because theinfinite binary matrix B serves to focus the distributionover topics onto a finite subset (see Figure 1). The numberof topics within a single document is almost surely finite,though the total number of topics is unbounded. The topicdistribution for the mth document, �m, is drawn from aDirichlet distribution over the topics selected by bm. TheDirichlet distribution models uncertainty about topic pro-portions while maintaining the restriction to a sparse set oftopics.
The ICD models the distribution over the global topic pro-portion parameters � separately from the distribution overthe binary matrix B. This captures the idea that a topic mayappear infrequently in a corpus, but make up a high propor-tion of those documents in which it occurs. Conversely, atopic may appear frequently in a corpus, but only with lowproportion.
2Notation n(m)k is the number of words assigned to the kth
topic of the mth document, and we use a dot notation to representsummation - i.e. n(m)
· =P
k n(m)k .
Figure 1. Graphical model for the focused topic model
3. Related ModelsTitsias (2007) introduced the infinite gamma-Poisson pro-cess, a distribution over unbounded matrices of non-negative integers, and used it as the basis for a topic modelof images. In this model, the distribution over featuresfor the mth image is given by a Dirichlet distribution overthe non-negative elements of the mth row of the infinitegamma-Poisson process matrix, with parameters propor-tional to the values at these elements. While this results ina sparse matrix of distributions, the number of zero entriesin any column of the matrix is correlated with the valuesof the non-zero entries. Columns which have entries withlarge values will not typically be sparse. Therefore, thismodel will not decouple across-data prevalence and within-data proportions of topics. In the ICD the number of zeroentries is controlled by a separate process, the IBP, fromthe values of the non-zero entries, which are controlled bythe gamma random variables.
The sparse topic model (SparseTM, Wang & Blei, 2009)uses a finite spike and slab model to ensure that each topicis represented by a sparse distribution over words. Thespikes are generated by Bernoulli draws with a single topic-wide parameter. The topic distribution is then drawn from asymmetric Dirichlet distribution defined over these spikes.The ICD also uses a spike and slab approach, but allowsan unbounded number of “spikes” (due to the IBP) and amore globally informative “slab” (due to the shared gammarandom variables). We extend the SparseTM’s approxima-tion of the expectation of a finite mixture of Dirichlet dis-tributions, to approximate the more complicated mixture ofDirichlet distributions given in Eq. 2.
Recent work by Fox et al. (2009) uses draws from an IBPto select subsets of an infinite set of states, to model multi-ple dynamic systems with shared states. (A state in the dy-namic system is like a component in a mixed membershipmodel.) The probability of transitioning from the ith stateto the jth state in the mth dynamic system is drawn from aDirichlet distribution with parameters bmj� + ��i,j , where
Chang, Blei
!
Nd
"d
wd,n
zd,n
K
#k
yd,d'
$
Nd'
"d'
wd',n
zd',n
Figure 2: A two-document segment of the RTM. The variable y indicates whether the two documents are linked. The complete modelcontains this variable for each pair of documents. The plates indicate replication. This model captures both the words and the linkstructure of the data shown in Figure 1.
formulation, inspired by the supervised LDA model (Bleiand McAuliffe 2007), ensures that the same latent topic as-signments used to generate the content of the documentsalso generates their link structure. Models which do notenforce this coupling, such as Nallapati et al. (2008), mightdivide the topics into two independent subsets—one forlinks and the other for words. Such a decomposition pre-vents these models from making meaningful predictionsabout links given words and words given links. In Sec-tion 4 we demonstrate empirically that the RTM outper-forms such models on these tasks.
3 INFERENCE, ESTIMATION, ANDPREDICTION
With the model defined, we turn to approximate poste-rior inference, parameter estimation, and prediction. Wedevelop a variational inference procedure for approximat-ing the posterior. We use this procedure in a variationalexpectation-maximization (EM) algorithm for parameterestimation. Finally, we show how a model whose parame-ters have been estimated can be used as a predictive modelof words and links.
Inference In posterior inference, we seek to computethe posterior distribution of the latent variables condi-tioned on the observations. Exact posterior inference is in-tractable (Blei et al. 2003; Blei and McAuliffe 2007). Weappeal to variational methods.
In variational methods, we posit a family of distributionsover the latent variables indexed by free variational pa-rameters. Those parameters are fit to be close to the trueposterior, where closeness is measured by relative entropy.See Jordan et al. (1999) for a review. We use the fully-factorized family,
q(�,Z|�,�) =�
d [q�(�d|�d)�
n qz(zd,n|�d,n)] , (3)
where � is a set of Dirichlet parameters, one for each doc-
ument, and � is a set of multinomial parameters, one foreach word in each document. Note that Eq [zd,n] = �d,n.
Minimizing the relative entropy is equivalent to maximiz-ing the Jensen’s lower bound on the marginal probability ofthe observations, i.e., the evidence lower bound (ELBO),
L =�
(d1,d2)Eq [log p(yd1,d2 |zd1 , zd2 , �, �)] +
�d
�n Eq [log p(wd,n|�1:K , zd,n)] +
�d
�n Eq [log p(zd,n|�d)] +�
d Eq [log p(�d|�)] + H(q), (4)
where (d1, d2) denotes all document pairs. The first termof the ELBO differentiates the RTM from LDA (Blei et al.2003). The connections between documents affect the ob-jective in approximate posterior inference (and, below, inparameter estimation).
We develop the inference procedure under the assumptionthat only observed links will be modeled (i.e., yd1,d2 is ei-ther 1 or unobserved).1 We do this for two reasons.
First, while one can fix yd1,d2 = 1 whenever a link is ob-served between d1 and d2 and set yd1,d2 = 0 otherwise, thisapproach is inappropriate in corpora where the absence ofa link cannot be construed as evidence for yd1,d2 = 0. Inthese cases, treating these links as unobserved variables ismore faithful to the underlying semantics of the data. Forexample, in large social networks such as Facebook the ab-sence of a link between two people does not necessarilymean that they are not friends; they may be real friendswho are unaware of each other’s existence in the network.Treating this link as unobserved better respects our lack ofknowledge about the status of their relationship.
Second, treating non-links links as hidden decreases thecomputational cost of inference; since the link variables areleaves in the graphical model they can be removed when-
1Sums over document pairs (d1, d2) are understood to rangeover pairs for which a link has been observed.
!
!T
"
#k
$k
% M
&d
!D
'
Parse trees
grouped into M
documents
(a) Overall Graphical Model
w1:laid
w2:phrases
w6:forw5:his
w4:some w5:mind
w7:years
w3:in
z1
z2 z3
z4
z5
z5
z6
z7
(b) Sentence Graphical Model
Figure 1: In the graphical model of the STM, a document is made up of a number of sentences,represented by a tree of latent topics z which in turn generate words w. These words’ topics arechosen by the topic of their parent (as encoded by the tree), the topic weights for a document �,and the node’s parent’s successor weights �. (For clarity, not all dependencies of sentence nodesare shown.) The structure of variables for sentences within the document plate is on the right, asdemonstrated by an automatic parse of the sentence “Some phrases laid in his mind for years.” TheSTM assumes that the tree structure and words are given, but the latent topics z are not.
is going to be a noun consistent as the object of the preposition “of.” Thematically, because it is ina travel brochure, we would expect to see words such as “Acapulco,” “Costa Rica,” or “Australia”more than “kitchen,” “debt,” or “pocket.” Our model can capture these kinds of regularities andexploit them in predictive problems.
Previous efforts to capture local syntactic context include semantic space models [6] and similarityfunctions derived from dependency parses [7]. These methods successfully determine words thatshare similar contexts, but do not account for thematic consistency. They have difficulty with pol-ysemous words such as “fly,” which can be either an insect or a term from baseball. With a senseof document context, i.e., a representation of whether a document is about sports or animals, themeaning of such terms can be distinguished.
Other techniques have attempted to combine local context with document coherence using linearsequence models [8, 9]. While these models are powerful, ordering words sequentially removesthe important connections that are preserved in a syntactic parse. Moreover, these models gener-ate words either from the syntactic or thematic context. In the syntactic topic model, words areconstrained to be consistent with both.
The remainder of this paper is organized as follows. We describe the syntactic topic model, anddevelop an approximate posterior inference technique based on variational methods. We study itsperformance both on synthetic data and hand parsed data [10]. We show that the STM capturesrelationships missed by other models and achieves lower held-out perplexity.
2 The syntactic topic model
We describe the syntactic topic model (STM), a document model that combines observed syntacticstructure and latent thematic structure. To motivate this model, we return to the travel brochuresentence “In the near future, you could find yourself in .”. The word that fills in the blank isconstrained by its syntactic context and its document context. The syntactic context tells us that it isan object of a preposition, and the document context tells us that it is a travel-related word.
The STM attempts to capture these joint influences on words. It models a document corpus asexchangeable collections of sentences, each of which is associated with a tree structure such as a
2
This provides an inferential speed-up that makes itpossible to fit models at varying granularities. As ex-amples, journal articles might be exchangeable withinan issue, an assumption which is more realistic thanone where they are exchangeable by year. Other data,such as news, might experience periods of time withoutany observation. While the dDTM requires represent-ing all topics for the discrete ticks within these periods,the cDTM can analyze such data without a sacrificeof memory or speed. With the cDTM, the granularitycan be chosen to maximize model fitness rather thanto limit computational complexity.
We note that the cDTM and dDTM are not the onlytopic models to take time into consideration. Topicsover time models (TOT) [23] and dynamic mixturemodels (DMM) [25] also include timestamps in theanalysis of documents. The TOT model treats thetime stamps as observations of the latent topics, whileDMM assumes that the topic mixture proportions ofeach document is dependent on previous topic mix-ture proportions. In both TOT and DMM, the topicsthemselves are constant, and the time information isused to better discover them. In the setting here, weare interested in inferring evolving topics.
The rest of the paper is organized as follows. In sec-tion 2 we describe the dDTM and develop the cDTMin detail. Section 3 presents an e�cient posterior in-ference algorithm for the cDTM based on sparse varia-tional methods. In section 4, we present experimentalresults on two news corpora.
2 Continuous time dynamic topicmodels
In a time stamped document collection, we would liketo model its latent topics as changing through thecourse of the collection. In news data, for example, asingle topic will change as the stories associated withit develop. The discrete-time dynamic topic model(dDTM) builds on the exchangeable topic model toprovide such machinery [2]. In the dDTM, documentsare divided into sequential groups, and the topics ofeach slice evolve from the topics of the previous slice.Documents in a group are assumed exchangeable.
More specifically, a topic is represented as a distribu-tion over the fixed vocabulary of the collection. ThedDTM assumes that a discrete-time state space modelgoverns the evolution of the natural parameters of themultinomial distributions that represent the topics.(Recall that the natural parameters of the multino-mial are the logs of the probabilities of each item.)This is a time-series extension to the logistic normaldistribution [26].
Figure 1: Graphical model representation of thecDTM. The evolution of the topic parameters �t isgoverned by Brownian motion. The variable st is theobserved time stamp of document dt.
A drawback of the dDTM is that time is discretized.If the resolution is chosen to be too coarse, then theassumption that documents within a time step are ex-changeable will not be true. If the resolution is toofine, then the number of variational parameters will ex-plode as more time points are added. Choosing the dis-cretization should be a decision based on assumptionsabout the data. However, the computational concernsmight prevent analysis at the appropriate time scale.
Thus, we develop the continuous time dynamic topicmodel (cDTM) for modeling sequential time-seriesdata with arbitrary granularity. The cDTM can beseen as a natural limit of the dDTM at its finest pos-sible resolution, the resolution at which the documenttime stamps are measured.
In the cDTM, we still represent topics in their naturalparameterization, but we use Brownian motion [14] tomodel their evolution through time. Let i, j (j > i >0) be two arbitrary time indexes, si and sj be the timestamps, and �sj ,si be the elapsed time between them.In a K-topic cDTM model, the distribution of the kth
(1 � k � K) topic’s parameter at term w is:
�0,k,w � N (m, v0)
�j,k,w|�i,k,w, s � N��i,k,w, v�sj ,si
�, (1)
where the variance increases linearly with the lag.
This construction is used as a component in the fullgenerative process. (Note: if j = i+1, we write �sj ,si
as �sj for short.)
1. For each topic k, 1 � k � K,
(a) Draw �0,k � N (m, v0I).
(a) (b)
Figure 1: (a) LDA model. (b) MG-LDA model.
is still not directly dependent on the number of documentsand, therefore, the model is not expected to su�er from over-fitting. Another approach is to use a Markov chain MonteCarlo algorithm for inference with LDA, as proposed in [14].In section 3 we will describe a modification of this samplingmethod for the proposed Multi-grain LDA model.
Both LDA and PLSA methods use the bag-of-words rep-resentation of documents, therefore they can only exploreco-occurrences at the document level. This is fine, providedthe goal is to represent an overall topic of the document,but our goal is di�erent: extracting ratable aspects. Themain topic of all the reviews for a particular item is virtu-ally the same: a review of this item. Therefore, when suchtopic modeling methods are applied to a collection of re-views for di�erent items, they infer topics corresponding todistinguishing properties of these items. E.g. when appliedto a collection of hotel reviews, these models are likely to in-fer topics: hotels in France, New York hotels, youth hostels,or, similarly, when applied to a collection of Mp3 players’reviews, these models will infer topics like reviews of iPodor reviews of Creative Zen player. Though these are all validtopics, they do not represent ratable aspects, but rather de-fine clusterings of the reviewed items into specific types. Infurther discussion we will refer to such topics as global topics,because they correspond to a global property of the objectin the review, such as its brand or base of operation. Dis-covering topics that correlate with ratable aspects, such ascleanliness and location for hotels, is much more problem-atic with LDA or PLSA methods. Most of these topics arepresent in some way in every review. Therefore, it is di�cultto discover them by using only co-occurrence information atthe document level. In this case exceedingly large amountsof training data is needed and as well as a very large num-ber of topics K. Even in this case there is a danger thatthe model will be overflown by very fine-grain global topicsor the resulting topics will be intersection of global topicsand ratable aspects, like location for hotels in New York.We will show in Section 4 that this hypothesis is confirmedexperimentally.
One way to address this problem would be to consider co-occurrences at the sentence level, i.e., apply LDA or PLSA toindividual sentences. But in this case we will not have a suf-ficient co-occurrence domain, and it is known that LDA andPLSA behave badly when applied to very short documents.This problem can be addressed by explicitly modeling topictransitions [5, 15, 33, 32, 28, 16], but these topic n-gram
models are considerably more computationally expensive.Also, like LDA and PLSA, they will not be able to distin-guish between topics corresponding to ratable aspects andglobal topics representing properties of the reviewed item.In the following section we will introduce a method whichexplicitly models both types of topics and e�ciently infersratable aspects from limited amount of training data.
2.2 MG-LDAWe propose a model called Multi-grain LDA (MG-LDA),
which models two distinct types of topics: global topics andlocal topics. As in PLSA and LDA, the distribution of globaltopics is fixed for a document. However, the distribution oflocal topics is allowed to vary across the document. A wordin the document is sampled either from the mixture of globaltopics or from the mixture of local topics specific for thelocal context of the word. The hypothesis is that ratableaspects will be captured by local topics and global topicswill capture properties of reviewed items. For example con-sider an extract from a review of a London hotel: “. . . publictransport in London is straightforward, the tube station isabout an 8 minute walk . . . or you can get a bus for £1.50”.It can be viewed as a mixture of topic London shared bythe entire review (words: “London”, “tube”, “£”), and theratable aspect location, specific for the local context of thesentence (words: “transport”, “walk”, “bus”). Local topicsare expected to be reused between very di�erent types ofitems, whereas global topics will correspond only to partic-ular types of items. In order to capture only genuine localtopics, we allow a large number of global topics, e�ectively,creating a bottleneck at the level of local topics. Of course,this bottleneck is specific to our purposes. Other applica-tions of multi-grain topic models conceivably might preferthe bottleneck reversed. Finally, we note that our definitionof multi-grain is simply for two-levels of granularity, globaland local. In principle though, there is nothing preventingthe model described in this section from extending beyondtwo levels. One might expect that for other tasks even morelevels of granularity could be beneficial.
We represent a document as a set of sliding windows, eachcovering T adjacent sentences within it. Each window v indocument d has an associated distribution over local topics�loc
d,v and a distribution defining preference for local topicsversus global topics �d,v. A word can be sampled using anywindow covering its sentence s, where the window is chosenaccording to a categorical distribution �s. Importantly, thefact that the windows overlap, permits to exploit a largerco-occurrence domain. These simple techniques are capableof modeling local topics without more expensive modeling oftopics transitions used in [5, 15, 33, 32, 28, 16]. Introductionof a symmetrical Dirichlet prior Dir(�) for the distribution�s permits to control smoothness of topic transitions in ourmodel.
The formal definition of the model with Kgl global andKloc local topics is the following. First, draw Kgl worddistributions for global topics �gl
z from a Dirichlet priorDir(�gl) and Kloc word distributions for local topics �loc
z�
from Dir(�loc). Then, for each document d:
• Choose a distribution of global topics �gld � Dir(�gl).
• For each sentence s choose a distribution �d,s(v) �Dir(�).
• For each sliding window v
113
WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China
McCallum, Wang, & Corrada-Emmanuel
!
"
z
w
Latent Dirichlet Allocation
(LDA)[Blei, Ng, Jordan, 2003]
N
D
x
z
w
Author-Topic Model
(AT)[Rosen-Zvi, Griffiths, Steyvers, Smyth 2004]
N
D
"
#
!
$
T
A
#$
T
x
z
w
Author-Recipient-Topic Model
(ART)[This paper]
N
D
"
#
!
$
T
A,A
z
w
Author Model
(Multi-label Mixture Model)[McCallum 1999]
N
D
#$
A
ad ad rda
ddd
d
d
Figure 1: Three related models, and the ART model. In all models, each observed word,w, is generated from a multinomial word distribution, �z, specific to a particulartopic/author, z, however topics are selected di�erently in each of the models.In LDA, the topic is sampled from a per-document topic distribution, �, whichin turn is sampled from a Dirichlet over topics. In the Author Model, there isone topic associated with each author (or category), and authors are sampleduniformly. In the Author-Topic model, the topic is sampled from a per-authormultinomial distribution, �, and authors are sampled uniformly from the observedlist of the document’s authors. In the Author-Recipient-Topic model, there isa separate topic-distribution for each author-recipient pair, and the selection oftopic-distribution is determined from the observed author, and by uniformly sam-pling a recipient from the set of recipients for the document.
its generative process for each document d, a set of authors, ad, is observed. To generateeach word, an author x is chosen uniformly from this set, then a topic z is selected from atopic distribution �x that is specific to the author, and then a word w is generated from atopic-specific multinomial distribution �z. However, as described previously, none of thesemodels is suitable for modeling message data.
An email message has one sender and in general more than one recipients. We couldtreat both the sender and the recipients as “authors” of the message, and then employ theAT model, but this does not distinguish the author and the recipients of the message, whichis undesirable in many real-world situations. A manager may send email to a secretary andvice versa, but the nature of the requests and language used may be quite di�erent. Evenmore dramatically, consider the large quantity of junk email that we receive; modeling thetopics of these messages as undistinguished from the topics we write about as authors wouldbe extremely confounding and undesirable since they do not reflect our expertise or roles.
Alternatively we could still employ the AT model by ignoring the recipient informationof email and treating each email document as if it only has one author. However, in thiscase (which is similar to the LDA model) we are losing all information about the recipients,and the connections between people implied by the sender-recipient relationships.
252
• The posterior can be used in creative ways.
• E.g., we can use inferences in information retrieval, recommendation,similarity, visualization, summarization, and other applications.
What makes topic models useful?
! It is a building block of many models.
© Eric Xing @ CMU, 2005-2013 37
Extending LDAby emerging groups. Both modalities are driven by thecommon goal of increasing data likelihood. Consider thevoting example again; resolutions that would have been as-signed the same topic in a model using words alone maybe assigned to di�erent topics if they exhibit distinct votingpatterns. Distinct word-based topics may be merged if theentities vote very similarly on them. Likewise, multiple dif-ferent divisions of entities into groups are made possible byconditioning them on the topics.
The importance of modeling the language associated withinteractions between people has recently been demonstratedin the Author-Recipient-Topic (ART) model [16]. In ARTthe words in a message between people in a network aregenerated conditioned on the author, recipient and a setof topics that describes the message. The model thus cap-tures both the network structure within which the peopleinteract as well as the language associated with the inter-actions. In experiments with Enron and academic email,the ART model is able to discover role similarity of peoplebetter than SNA models that consider network connectivityalone. However, the ART model does not explicitly capturegroups formed by entities in the network.
The GT model simultaneously clusters entities to groupsand clusters words into topics, unlike models that gener-ate topics solely based on word distributions such as LatentDirichlet Allocation [4]. In this way the GT model discov-ers salient topics relevant to relationships between entitiesin the social network—topics which the models that onlyexamine words are unable to detect.
We demonstrate the capabilities of the GT model by ap-plying it to two large sets of voting data: one from US Sen-ate and the other from the General Assembly of the UN.The model clusters voting entities into coalitions and si-multaneously discovers topics for word attributes describingthe relations (bills or resolutions) between entities. We findthat the groups obtained from the GT model are signifi-cantly more cohesive (p-value < .01) than those obtainedfrom the Blockstructures model. The GT model also dis-covers new and more salient topics in both the UN and Sen-ate datasets—in comparison with topics discovered by onlyexamining the words of the resolutions, the GT topics areeither split or joined together as influenced by the voters’patterns of behavior.
2. GROUP-TOPIC MODELThe Group-Topic Model is a directed graphical model that
clusters entities with relations between them, as well as at-tributes of those relations. The relations may be either di-rected or undirected and have multiple attributes. In thispaper, we focus on undirected relations and have words asthe attributes on relations.
In the generative process for each event (an interactionbetween entities), the model first picks the topic t of theevent and then generates all the words describing the eventwhere each word is generated independently according toa multinomial distribution �t, specific to the topic t. Togenerate the relational structure of the network, first thegroup assignment, gst for each entity s is chosen condition-ally on the topic, from a particular multinomial distribution�t over groups for each topic t. Given the group assignmentson an event b, the matrix V (b) is generated where each cellV (b)
gigj represents how often the groups of two senators be-haved the same or not during the event b, (e.g., voted the
SYMBOL DESCRIPTIONgit entity i’s group assignment in topic ttb topic of an event b
w(b)k the kth token in the event b
V (b)ij entity i and j’s groups behaved same (1)
or di�erently (2) on the event bS number of entitiesT number of topicsG number of groupsB number of eventsV number of unique wordsNb number of word tokens in the event bSb number of entities who participated in the event b
Table 1: Notation used in this paper
! " w v # $
t g % &
Nb S
b2
T BG2
S T
B
Figure 1: The Group-Topic model
same or not on a bill). The elements of V are sampled from
a binomial distribution �(b)gigj . Our notation is summarized
in Table 1, and the graphical model representation of themodel is shown in Figure 1.
Without considering the topic of an event, or by treat-ing all events in a corpus as reflecting a single topic, thesimplified model (only the right part of Figure 1) becomesequivalent to the stochastic Blockstructures model [17]. Tomatch the Blockstructures model, each event defines a re-lationship, e.g., whether in the event two entities’ groupsbehave the same or not. On the other hand, in our model arelation may have multiple attributes (which in our exper-iments are the words describing the event, generated by aper-topic multinomial).
When we consider the complete model, the dataset is dy-namically divided into T sub-blocks each of which corre-sponds to a topic. The complete GT model is as follows,
tb � Uniform(1T
)
wit|�t � Multinomial(�t)
�t|� � Dirichlet(�)
git|�t � Multinomial(�t)
�t|� � Dirichlet(�)
V (b)ij |�(b)
gigj� Binomial(�(b)
gigj)
�(b)gh |� � Beta(�).
We want to perform joint inference on (text) attributesand relations to obtain topic-wise group memberships. Sinceinference can not be done exactly on such complicated prob-abilistic graphical models, we employ Gibbs sampling to con-duct inference. Note that we adopt conjugate priors in our
Indian Buffet Process Compound Dirichlet Process
B selects a subset of atoms for each distribution, and thegamma random variables � determine the relative massesassociated with these atoms.
2.4. Focused Topic Models
Suppose H parametrizes distributions over words. Then,the ICD defines a generative topic model, where it is usedto generate a set of sparse distributions over an infinite num-ber of components, called “topics.” Each topic is drawnfrom a Dirichlet distribution over words. In order to specifya fully generative model, we sample the number of wordsfor each document from a negative binomial distribution,n(m)
· � NB(�
k bmk�k, 1/2).2
The generative model for M documents is
1. for k = 1, 2, . . . ,
(a) Sample the stick length �k according to Eq. 1.(b) Sample the relative mass �k � Gamma(�, 1).(c) Draw the topic distribution over words,
�k � Dirichlet(�).
2. for m = 1, . . . , M ,
(a) Sample a binary vector bm according to Eq. 1.(b) Draw the total number of words,
n(m)· � NB(
�k bmk�k, 1/2).
(c) Sample the distribution over topics,�m � Dirichlet(bm · �).
(d) For each word wmi, i = 1, . . . , n(m)· ,
i. Draw the topic index zmi � Discrete(�m).ii. Draw the word wmi � Discrete(�zmi
).
We call this the focused topic model (FTM) because theinfinite binary matrix B serves to focus the distributionover topics onto a finite subset (see Figure 1). The numberof topics within a single document is almost surely finite,though the total number of topics is unbounded. The topicdistribution for the mth document, �m, is drawn from aDirichlet distribution over the topics selected by bm. TheDirichlet distribution models uncertainty about topic pro-portions while maintaining the restriction to a sparse set oftopics.
The ICD models the distribution over the global topic pro-portion parameters � separately from the distribution overthe binary matrix B. This captures the idea that a topic mayappear infrequently in a corpus, but make up a high propor-tion of those documents in which it occurs. Conversely, atopic may appear frequently in a corpus, but only with lowproportion.
2Notation n(m)k is the number of words assigned to the kth
topic of the mth document, and we use a dot notation to representsummation - i.e. n(m)
· =P
k n(m)k .
Figure 1. Graphical model for the focused topic model
3. Related ModelsTitsias (2007) introduced the infinite gamma-Poisson pro-cess, a distribution over unbounded matrices of non-negative integers, and used it as the basis for a topic modelof images. In this model, the distribution over featuresfor the mth image is given by a Dirichlet distribution overthe non-negative elements of the mth row of the infinitegamma-Poisson process matrix, with parameters propor-tional to the values at these elements. While this results ina sparse matrix of distributions, the number of zero entriesin any column of the matrix is correlated with the valuesof the non-zero entries. Columns which have entries withlarge values will not typically be sparse. Therefore, thismodel will not decouple across-data prevalence and within-data proportions of topics. In the ICD the number of zeroentries is controlled by a separate process, the IBP, fromthe values of the non-zero entries, which are controlled bythe gamma random variables.
The sparse topic model (SparseTM, Wang & Blei, 2009)uses a finite spike and slab model to ensure that each topicis represented by a sparse distribution over words. Thespikes are generated by Bernoulli draws with a single topic-wide parameter. The topic distribution is then drawn from asymmetric Dirichlet distribution defined over these spikes.The ICD also uses a spike and slab approach, but allowsan unbounded number of “spikes” (due to the IBP) and amore globally informative “slab” (due to the shared gammarandom variables). We extend the SparseTM’s approxima-tion of the expectation of a finite mixture of Dirichlet dis-tributions, to approximate the more complicated mixture ofDirichlet distributions given in Eq. 2.
Recent work by Fox et al. (2009) uses draws from an IBPto select subsets of an infinite set of states, to model multi-ple dynamic systems with shared states. (A state in the dy-namic system is like a component in a mixed membershipmodel.) The probability of transitioning from the ith stateto the jth state in the mth dynamic system is drawn from aDirichlet distribution with parameters bmj� + ��i,j , where
Chang, Blei
!
Nd
"d
wd,n
zd,n
K
#k
yd,d'
$
Nd'
"d'
wd',n
zd',n
Figure 2: A two-document segment of the RTM. The variable y indicates whether the two documents are linked. The complete modelcontains this variable for each pair of documents. The plates indicate replication. This model captures both the words and the linkstructure of the data shown in Figure 1.
formulation, inspired by the supervised LDA model (Bleiand McAuliffe 2007), ensures that the same latent topic as-signments used to generate the content of the documentsalso generates their link structure. Models which do notenforce this coupling, such as Nallapati et al. (2008), mightdivide the topics into two independent subsets—one forlinks and the other for words. Such a decomposition pre-vents these models from making meaningful predictionsabout links given words and words given links. In Sec-tion 4 we demonstrate empirically that the RTM outper-forms such models on these tasks.
3 INFERENCE, ESTIMATION, ANDPREDICTION
With the model defined, we turn to approximate poste-rior inference, parameter estimation, and prediction. Wedevelop a variational inference procedure for approximat-ing the posterior. We use this procedure in a variationalexpectation-maximization (EM) algorithm for parameterestimation. Finally, we show how a model whose parame-ters have been estimated can be used as a predictive modelof words and links.
Inference In posterior inference, we seek to computethe posterior distribution of the latent variables condi-tioned on the observations. Exact posterior inference is in-tractable (Blei et al. 2003; Blei and McAuliffe 2007). Weappeal to variational methods.
In variational methods, we posit a family of distributionsover the latent variables indexed by free variational pa-rameters. Those parameters are fit to be close to the trueposterior, where closeness is measured by relative entropy.See Jordan et al. (1999) for a review. We use the fully-factorized family,
q(�,Z|�,�) =�
d [q�(�d|�d)�
n qz(zd,n|�d,n)] , (3)
where � is a set of Dirichlet parameters, one for each doc-
ument, and � is a set of multinomial parameters, one foreach word in each document. Note that Eq [zd,n] = �d,n.
Minimizing the relative entropy is equivalent to maximiz-ing the Jensen’s lower bound on the marginal probability ofthe observations, i.e., the evidence lower bound (ELBO),
L =�
(d1,d2)Eq [log p(yd1,d2 |zd1 , zd2 , �, �)] +
�d
�n Eq [log p(wd,n|�1:K , zd,n)] +
�d
�n Eq [log p(zd,n|�d)] +�
d Eq [log p(�d|�)] + H(q), (4)
where (d1, d2) denotes all document pairs. The first termof the ELBO differentiates the RTM from LDA (Blei et al.2003). The connections between documents affect the ob-jective in approximate posterior inference (and, below, inparameter estimation).
We develop the inference procedure under the assumptionthat only observed links will be modeled (i.e., yd1,d2 is ei-ther 1 or unobserved).1 We do this for two reasons.
First, while one can fix yd1,d2 = 1 whenever a link is ob-served between d1 and d2 and set yd1,d2 = 0 otherwise, thisapproach is inappropriate in corpora where the absence ofa link cannot be construed as evidence for yd1,d2 = 0. Inthese cases, treating these links as unobserved variables ismore faithful to the underlying semantics of the data. Forexample, in large social networks such as Facebook the ab-sence of a link between two people does not necessarilymean that they are not friends; they may be real friendswho are unaware of each other’s existence in the network.Treating this link as unobserved better respects our lack ofknowledge about the status of their relationship.
Second, treating non-links links as hidden decreases thecomputational cost of inference; since the link variables areleaves in the graphical model they can be removed when-
1Sums over document pairs (d1, d2) are understood to rangeover pairs for which a link has been observed.
!
!T
"
#k
$k
% M
&d
!D
'
Parse trees
grouped into M
documents
(a) Overall Graphical Model
w1:laid
w2:phrases
w6:forw5:his
w4:some w5:mind
w7:years
w3:in
z1
z2 z3
z4
z5
z5
z6
z7
(b) Sentence Graphical Model
Figure 1: In the graphical model of the STM, a document is made up of a number of sentences,represented by a tree of latent topics z which in turn generate words w. These words’ topics arechosen by the topic of their parent (as encoded by the tree), the topic weights for a document �,and the node’s parent’s successor weights �. (For clarity, not all dependencies of sentence nodesare shown.) The structure of variables for sentences within the document plate is on the right, asdemonstrated by an automatic parse of the sentence “Some phrases laid in his mind for years.” TheSTM assumes that the tree structure and words are given, but the latent topics z are not.
is going to be a noun consistent as the object of the preposition “of.” Thematically, because it is ina travel brochure, we would expect to see words such as “Acapulco,” “Costa Rica,” or “Australia”more than “kitchen,” “debt,” or “pocket.” Our model can capture these kinds of regularities andexploit them in predictive problems.
Previous efforts to capture local syntactic context include semantic space models [6] and similarityfunctions derived from dependency parses [7]. These methods successfully determine words thatshare similar contexts, but do not account for thematic consistency. They have difficulty with pol-ysemous words such as “fly,” which can be either an insect or a term from baseball. With a senseof document context, i.e., a representation of whether a document is about sports or animals, themeaning of such terms can be distinguished.
Other techniques have attempted to combine local context with document coherence using linearsequence models [8, 9]. While these models are powerful, ordering words sequentially removesthe important connections that are preserved in a syntactic parse. Moreover, these models gener-ate words either from the syntactic or thematic context. In the syntactic topic model, words areconstrained to be consistent with both.
The remainder of this paper is organized as follows. We describe the syntactic topic model, anddevelop an approximate posterior inference technique based on variational methods. We study itsperformance both on synthetic data and hand parsed data [10]. We show that the STM capturesrelationships missed by other models and achieves lower held-out perplexity.
2 The syntactic topic model
We describe the syntactic topic model (STM), a document model that combines observed syntacticstructure and latent thematic structure. To motivate this model, we return to the travel brochuresentence “In the near future, you could find yourself in .”. The word that fills in the blank isconstrained by its syntactic context and its document context. The syntactic context tells us that it isan object of a preposition, and the document context tells us that it is a travel-related word.
The STM attempts to capture these joint influences on words. It models a document corpus asexchangeable collections of sentences, each of which is associated with a tree structure such as a
2
This provides an inferential speed-up that makes itpossible to fit models at varying granularities. As ex-amples, journal articles might be exchangeable withinan issue, an assumption which is more realistic thanone where they are exchangeable by year. Other data,such as news, might experience periods of time withoutany observation. While the dDTM requires represent-ing all topics for the discrete ticks within these periods,the cDTM can analyze such data without a sacrificeof memory or speed. With the cDTM, the granularitycan be chosen to maximize model fitness rather thanto limit computational complexity.
We note that the cDTM and dDTM are not the onlytopic models to take time into consideration. Topicsover time models (TOT) [23] and dynamic mixturemodels (DMM) [25] also include timestamps in theanalysis of documents. The TOT model treats thetime stamps as observations of the latent topics, whileDMM assumes that the topic mixture proportions ofeach document is dependent on previous topic mix-ture proportions. In both TOT and DMM, the topicsthemselves are constant, and the time information isused to better discover them. In the setting here, weare interested in inferring evolving topics.
The rest of the paper is organized as follows. In sec-tion 2 we describe the dDTM and develop the cDTMin detail. Section 3 presents an e�cient posterior in-ference algorithm for the cDTM based on sparse varia-tional methods. In section 4, we present experimentalresults on two news corpora.
2 Continuous time dynamic topicmodels
In a time stamped document collection, we would liketo model its latent topics as changing through thecourse of the collection. In news data, for example, asingle topic will change as the stories associated withit develop. The discrete-time dynamic topic model(dDTM) builds on the exchangeable topic model toprovide such machinery [2]. In the dDTM, documentsare divided into sequential groups, and the topics ofeach slice evolve from the topics of the previous slice.Documents in a group are assumed exchangeable.
More specifically, a topic is represented as a distribu-tion over the fixed vocabulary of the collection. ThedDTM assumes that a discrete-time state space modelgoverns the evolution of the natural parameters of themultinomial distributions that represent the topics.(Recall that the natural parameters of the multino-mial are the logs of the probabilities of each item.)This is a time-series extension to the logistic normaldistribution [26].
Figure 1: Graphical model representation of thecDTM. The evolution of the topic parameters �t isgoverned by Brownian motion. The variable st is theobserved time stamp of document dt.
A drawback of the dDTM is that time is discretized.If the resolution is chosen to be too coarse, then theassumption that documents within a time step are ex-changeable will not be true. If the resolution is toofine, then the number of variational parameters will ex-plode as more time points are added. Choosing the dis-cretization should be a decision based on assumptionsabout the data. However, the computational concernsmight prevent analysis at the appropriate time scale.
Thus, we develop the continuous time dynamic topicmodel (cDTM) for modeling sequential time-seriesdata with arbitrary granularity. The cDTM can beseen as a natural limit of the dDTM at its finest pos-sible resolution, the resolution at which the documenttime stamps are measured.
In the cDTM, we still represent topics in their naturalparameterization, but we use Brownian motion [14] tomodel their evolution through time. Let i, j (j > i >0) be two arbitrary time indexes, si and sj be the timestamps, and �sj ,si be the elapsed time between them.In a K-topic cDTM model, the distribution of the kth
(1 � k � K) topic’s parameter at term w is:
�0,k,w � N (m, v0)
�j,k,w|�i,k,w, s � N��i,k,w, v�sj ,si
�, (1)
where the variance increases linearly with the lag.
This construction is used as a component in the fullgenerative process. (Note: if j = i+1, we write �sj ,si
as �sj for short.)
1. For each topic k, 1 � k � K,
(a) Draw �0,k � N (m, v0I).
(a) (b)
Figure 1: (a) LDA model. (b) MG-LDA model.
is still not directly dependent on the number of documentsand, therefore, the model is not expected to su�er from over-fitting. Another approach is to use a Markov chain MonteCarlo algorithm for inference with LDA, as proposed in [14].In section 3 we will describe a modification of this samplingmethod for the proposed Multi-grain LDA model.
Both LDA and PLSA methods use the bag-of-words rep-resentation of documents, therefore they can only exploreco-occurrences at the document level. This is fine, providedthe goal is to represent an overall topic of the document,but our goal is di�erent: extracting ratable aspects. Themain topic of all the reviews for a particular item is virtu-ally the same: a review of this item. Therefore, when suchtopic modeling methods are applied to a collection of re-views for di�erent items, they infer topics corresponding todistinguishing properties of these items. E.g. when appliedto a collection of hotel reviews, these models are likely to in-fer topics: hotels in France, New York hotels, youth hostels,or, similarly, when applied to a collection of Mp3 players’reviews, these models will infer topics like reviews of iPodor reviews of Creative Zen player. Though these are all validtopics, they do not represent ratable aspects, but rather de-fine clusterings of the reviewed items into specific types. Infurther discussion we will refer to such topics as global topics,because they correspond to a global property of the objectin the review, such as its brand or base of operation. Dis-covering topics that correlate with ratable aspects, such ascleanliness and location for hotels, is much more problem-atic with LDA or PLSA methods. Most of these topics arepresent in some way in every review. Therefore, it is di�cultto discover them by using only co-occurrence information atthe document level. In this case exceedingly large amountsof training data is needed and as well as a very large num-ber of topics K. Even in this case there is a danger thatthe model will be overflown by very fine-grain global topicsor the resulting topics will be intersection of global topicsand ratable aspects, like location for hotels in New York.We will show in Section 4 that this hypothesis is confirmedexperimentally.
One way to address this problem would be to consider co-occurrences at the sentence level, i.e., apply LDA or PLSA toindividual sentences. But in this case we will not have a suf-ficient co-occurrence domain, and it is known that LDA andPLSA behave badly when applied to very short documents.This problem can be addressed by explicitly modeling topictransitions [5, 15, 33, 32, 28, 16], but these topic n-gram
models are considerably more computationally expensive.Also, like LDA and PLSA, they will not be able to distin-guish between topics corresponding to ratable aspects andglobal topics representing properties of the reviewed item.In the following section we will introduce a method whichexplicitly models both types of topics and e�ciently infersratable aspects from limited amount of training data.
2.2 MG-LDAWe propose a model called Multi-grain LDA (MG-LDA),
which models two distinct types of topics: global topics andlocal topics. As in PLSA and LDA, the distribution of globaltopics is fixed for a document. However, the distribution oflocal topics is allowed to vary across the document. A wordin the document is sampled either from the mixture of globaltopics or from the mixture of local topics specific for thelocal context of the word. The hypothesis is that ratableaspects will be captured by local topics and global topicswill capture properties of reviewed items. For example con-sider an extract from a review of a London hotel: “. . . publictransport in London is straightforward, the tube station isabout an 8 minute walk . . . or you can get a bus for £1.50”.It can be viewed as a mixture of topic London shared bythe entire review (words: “London”, “tube”, “£”), and theratable aspect location, specific for the local context of thesentence (words: “transport”, “walk”, “bus”). Local topicsare expected to be reused between very di�erent types ofitems, whereas global topics will correspond only to partic-ular types of items. In order to capture only genuine localtopics, we allow a large number of global topics, e�ectively,creating a bottleneck at the level of local topics. Of course,this bottleneck is specific to our purposes. Other applica-tions of multi-grain topic models conceivably might preferthe bottleneck reversed. Finally, we note that our definitionof multi-grain is simply for two-levels of granularity, globaland local. In principle though, there is nothing preventingthe model described in this section from extending beyondtwo levels. One might expect that for other tasks even morelevels of granularity could be beneficial.
We represent a document as a set of sliding windows, eachcovering T adjacent sentences within it. Each window v indocument d has an associated distribution over local topics�loc
d,v and a distribution defining preference for local topicsversus global topics �d,v. A word can be sampled using anywindow covering its sentence s, where the window is chosenaccording to a categorical distribution �s. Importantly, thefact that the windows overlap, permits to exploit a largerco-occurrence domain. These simple techniques are capableof modeling local topics without more expensive modeling oftopics transitions used in [5, 15, 33, 32, 28, 16]. Introductionof a symmetrical Dirichlet prior Dir(�) for the distribution�s permits to control smoothness of topic transitions in ourmodel.
The formal definition of the model with Kgl global andKloc local topics is the following. First, draw Kgl worddistributions for global topics �gl
z from a Dirichlet priorDir(�gl) and Kloc word distributions for local topics �loc
z�
from Dir(�loc). Then, for each document d:
• Choose a distribution of global topics �gld � Dir(�gl).
• For each sentence s choose a distribution �d,s(v) �Dir(�).
• For each sliding window v
113
WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China
McCallum, Wang, & Corrada-Emmanuel
!
"
z
w
Latent Dirichlet Allocation
(LDA)[Blei, Ng, Jordan, 2003]
N
D
x
z
w
Author-Topic Model
(AT)[Rosen-Zvi, Griffiths, Steyvers, Smyth 2004]
N
D
"
#
!
$
T
A
#$
T
x
z
w
Author-Recipient-Topic Model
(ART)[This paper]
N
D
"
#
!
$
T
A,A
z
w
Author Model
(Multi-label Mixture Model)[McCallum 1999]
N
D
#$
A
ad ad rda
ddd
d
d
Figure 1: Three related models, and the ART model. In all models, each observed word,w, is generated from a multinomial word distribution, �z, specific to a particulartopic/author, z, however topics are selected di�erently in each of the models.In LDA, the topic is sampled from a per-document topic distribution, �, whichin turn is sampled from a Dirichlet over topics. In the Author Model, there isone topic associated with each author (or category), and authors are sampleduniformly. In the Author-Topic model, the topic is sampled from a per-authormultinomial distribution, �, and authors are sampled uniformly from the observedlist of the document’s authors. In the Author-Recipient-Topic model, there isa separate topic-distribution for each author-recipient pair, and the selection oftopic-distribution is determined from the observed author, and by uniformly sam-pling a recipient from the set of recipients for the document.
its generative process for each document d, a set of authors, ad, is observed. To generateeach word, an author x is chosen uniformly from this set, then a topic z is selected from atopic distribution �x that is specific to the author, and then a word w is generated from atopic-specific multinomial distribution �z. However, as described previously, none of thesemodels is suitable for modeling message data.
An email message has one sender and in general more than one recipients. We couldtreat both the sender and the recipients as “authors” of the message, and then employ theAT model, but this does not distinguish the author and the recipients of the message, whichis undesirable in many real-world situations. A manager may send email to a secretary andvice versa, but the nature of the requests and language used may be quite di�erent. Evenmore dramatically, consider the large quantity of junk email that we receive; modeling thetopics of these messages as undistinguished from the topics we write about as authors wouldbe extremely confounding and undesirable since they do not reflect our expertise or roles.
Alternatively we could still employ the AT model by ignoring the recipient informationof email and treating each email document as if it only has one author. However, in thiscase (which is similar to the LDA model) we are losing all information about the recipients,and the connections between people implied by the sender-recipient relationships.
252
• The posterior can be used in creative ways.
• E.g., we can use inferences in information retrieval, recommendation,similarity, visualization, summarization, and other applications.
What makes topic models useful?
! It is a building block of many models.
© Eric Xing @ CMU, 2005-2013 37
Extending LDAby emerging groups. Both modalities are driven by thecommon goal of increasing data likelihood. Consider thevoting example again; resolutions that would have been as-signed the same topic in a model using words alone maybe assigned to di�erent topics if they exhibit distinct votingpatterns. Distinct word-based topics may be merged if theentities vote very similarly on them. Likewise, multiple dif-ferent divisions of entities into groups are made possible byconditioning them on the topics.
The importance of modeling the language associated withinteractions between people has recently been demonstratedin the Author-Recipient-Topic (ART) model [16]. In ARTthe words in a message between people in a network aregenerated conditioned on the author, recipient and a setof topics that describes the message. The model thus cap-tures both the network structure within which the peopleinteract as well as the language associated with the inter-actions. In experiments with Enron and academic email,the ART model is able to discover role similarity of peoplebetter than SNA models that consider network connectivityalone. However, the ART model does not explicitly capturegroups formed by entities in the network.
The GT model simultaneously clusters entities to groupsand clusters words into topics, unlike models that gener-ate topics solely based on word distributions such as LatentDirichlet Allocation [4]. In this way the GT model discov-ers salient topics relevant to relationships between entitiesin the social network—topics which the models that onlyexamine words are unable to detect.
We demonstrate the capabilities of the GT model by ap-plying it to two large sets of voting data: one from US Sen-ate and the other from the General Assembly of the UN.The model clusters voting entities into coalitions and si-multaneously discovers topics for word attributes describingthe relations (bills or resolutions) between entities. We findthat the groups obtained from the GT model are signifi-cantly more cohesive (p-value < .01) than those obtainedfrom the Blockstructures model. The GT model also dis-covers new and more salient topics in both the UN and Sen-ate datasets—in comparison with topics discovered by onlyexamining the words of the resolutions, the GT topics areeither split or joined together as influenced by the voters’patterns of behavior.
2. GROUP-TOPIC MODELThe Group-Topic Model is a directed graphical model that
clusters entities with relations between them, as well as at-tributes of those relations. The relations may be either di-rected or undirected and have multiple attributes. In thispaper, we focus on undirected relations and have words asthe attributes on relations.
In the generative process for each event (an interactionbetween entities), the model first picks the topic t of theevent and then generates all the words describing the eventwhere each word is generated independently according toa multinomial distribution �t, specific to the topic t. Togenerate the relational structure of the network, first thegroup assignment, gst for each entity s is chosen condition-ally on the topic, from a particular multinomial distribution�t over groups for each topic t. Given the group assignmentson an event b, the matrix V (b) is generated where each cellV (b)
gigj represents how often the groups of two senators be-haved the same or not during the event b, (e.g., voted the
SYMBOL DESCRIPTIONgit entity i’s group assignment in topic ttb topic of an event b
w(b)k the kth token in the event b
V (b)ij entity i and j’s groups behaved same (1)
or di�erently (2) on the event bS number of entitiesT number of topicsG number of groupsB number of eventsV number of unique wordsNb number of word tokens in the event bSb number of entities who participated in the event b
Table 1: Notation used in this paper
! " w v # $
t g % &
Nb S
b2
T BG2
S T
B
Figure 1: The Group-Topic model
same or not on a bill). The elements of V are sampled from
a binomial distribution �(b)gigj . Our notation is summarized
in Table 1, and the graphical model representation of themodel is shown in Figure 1.
Without considering the topic of an event, or by treat-ing all events in a corpus as reflecting a single topic, thesimplified model (only the right part of Figure 1) becomesequivalent to the stochastic Blockstructures model [17]. Tomatch the Blockstructures model, each event defines a re-lationship, e.g., whether in the event two entities’ groupsbehave the same or not. On the other hand, in our model arelation may have multiple attributes (which in our exper-iments are the words describing the event, generated by aper-topic multinomial).
When we consider the complete model, the dataset is dy-namically divided into T sub-blocks each of which corre-sponds to a topic. The complete GT model is as follows,
tb � Uniform(1T
)
wit|�t � Multinomial(�t)
�t|� � Dirichlet(�)
git|�t � Multinomial(�t)
�t|� � Dirichlet(�)
V (b)ij |�(b)
gigj� Binomial(�(b)
gigj)
�(b)gh |� � Beta(�).
We want to perform joint inference on (text) attributesand relations to obtain topic-wise group memberships. Sinceinference can not be done exactly on such complicated prob-abilistic graphical models, we employ Gibbs sampling to con-duct inference. Note that we adopt conjugate priors in our
Indian Buffet Process Compound Dirichlet Process
B selects a subset of atoms for each distribution, and thegamma random variables � determine the relative massesassociated with these atoms.
2.4. Focused Topic Models
Suppose H parametrizes distributions over words. Then,the ICD defines a generative topic model, where it is usedto generate a set of sparse distributions over an infinite num-ber of components, called “topics.” Each topic is drawnfrom a Dirichlet distribution over words. In order to specifya fully generative model, we sample the number of wordsfor each document from a negative binomial distribution,n(m)
· � NB(�
k bmk�k, 1/2).2
The generative model for M documents is
1. for k = 1, 2, . . . ,
(a) Sample the stick length �k according to Eq. 1.(b) Sample the relative mass �k � Gamma(�, 1).(c) Draw the topic distribution over words,
�k � Dirichlet(�).
2. for m = 1, . . . , M ,
(a) Sample a binary vector bm according to Eq. 1.(b) Draw the total number of words,
n(m)· � NB(
�k bmk�k, 1/2).
(c) Sample the distribution over topics,�m � Dirichlet(bm · �).
(d) For each word wmi, i = 1, . . . , n(m)· ,
i. Draw the topic index zmi � Discrete(�m).ii. Draw the word wmi � Discrete(�zmi
).
We call this the focused topic model (FTM) because theinfinite binary matrix B serves to focus the distributionover topics onto a finite subset (see Figure 1). The numberof topics within a single document is almost surely finite,though the total number of topics is unbounded. The topicdistribution for the mth document, �m, is drawn from aDirichlet distribution over the topics selected by bm. TheDirichlet distribution models uncertainty about topic pro-portions while maintaining the restriction to a sparse set oftopics.
The ICD models the distribution over the global topic pro-portion parameters � separately from the distribution overthe binary matrix B. This captures the idea that a topic mayappear infrequently in a corpus, but make up a high propor-tion of those documents in which it occurs. Conversely, atopic may appear frequently in a corpus, but only with lowproportion.
2Notation n(m)k is the number of words assigned to the kth
topic of the mth document, and we use a dot notation to representsummation - i.e. n(m)
· =P
k n(m)k .
Figure 1. Graphical model for the focused topic model
3. Related ModelsTitsias (2007) introduced the infinite gamma-Poisson pro-cess, a distribution over unbounded matrices of non-negative integers, and used it as the basis for a topic modelof images. In this model, the distribution over featuresfor the mth image is given by a Dirichlet distribution overthe non-negative elements of the mth row of the infinitegamma-Poisson process matrix, with parameters propor-tional to the values at these elements. While this results ina sparse matrix of distributions, the number of zero entriesin any column of the matrix is correlated with the valuesof the non-zero entries. Columns which have entries withlarge values will not typically be sparse. Therefore, thismodel will not decouple across-data prevalence and within-data proportions of topics. In the ICD the number of zeroentries is controlled by a separate process, the IBP, fromthe values of the non-zero entries, which are controlled bythe gamma random variables.
The sparse topic model (SparseTM, Wang & Blei, 2009)uses a finite spike and slab model to ensure that each topicis represented by a sparse distribution over words. Thespikes are generated by Bernoulli draws with a single topic-wide parameter. The topic distribution is then drawn from asymmetric Dirichlet distribution defined over these spikes.The ICD also uses a spike and slab approach, but allowsan unbounded number of “spikes” (due to the IBP) and amore globally informative “slab” (due to the shared gammarandom variables). We extend the SparseTM’s approxima-tion of the expectation of a finite mixture of Dirichlet dis-tributions, to approximate the more complicated mixture ofDirichlet distributions given in Eq. 2.
Recent work by Fox et al. (2009) uses draws from an IBPto select subsets of an infinite set of states, to model multi-ple dynamic systems with shared states. (A state in the dy-namic system is like a component in a mixed membershipmodel.) The probability of transitioning from the ith stateto the jth state in the mth dynamic system is drawn from aDirichlet distribution with parameters bmj� + ��i,j , where
Chang, Blei
!
Nd
"d
wd,n
zd,n
K
#k
yd,d'
$
Nd'
"d'
wd',n
zd',n
Figure 2: A two-document segment of the RTM. The variable y indicates whether the two documents are linked. The complete modelcontains this variable for each pair of documents. The plates indicate replication. This model captures both the words and the linkstructure of the data shown in Figure 1.
formulation, inspired by the supervised LDA model (Bleiand McAuliffe 2007), ensures that the same latent topic as-signments used to generate the content of the documentsalso generates their link structure. Models which do notenforce this coupling, such as Nallapati et al. (2008), mightdivide the topics into two independent subsets—one forlinks and the other for words. Such a decomposition pre-vents these models from making meaningful predictionsabout links given words and words given links. In Sec-tion 4 we demonstrate empirically that the RTM outper-forms such models on these tasks.
3 INFERENCE, ESTIMATION, ANDPREDICTION
With the model defined, we turn to approximate poste-rior inference, parameter estimation, and prediction. Wedevelop a variational inference procedure for approximat-ing the posterior. We use this procedure in a variationalexpectation-maximization (EM) algorithm for parameterestimation. Finally, we show how a model whose parame-ters have been estimated can be used as a predictive modelof words and links.
Inference In posterior inference, we seek to computethe posterior distribution of the latent variables condi-tioned on the observations. Exact posterior inference is in-tractable (Blei et al. 2003; Blei and McAuliffe 2007). Weappeal to variational methods.
In variational methods, we posit a family of distributionsover the latent variables indexed by free variational pa-rameters. Those parameters are fit to be close to the trueposterior, where closeness is measured by relative entropy.See Jordan et al. (1999) for a review. We use the fully-factorized family,
q(�,Z|�,�) =�
d [q�(�d|�d)�
n qz(zd,n|�d,n)] , (3)
where � is a set of Dirichlet parameters, one for each doc-
ument, and � is a set of multinomial parameters, one foreach word in each document. Note that Eq [zd,n] = �d,n.
Minimizing the relative entropy is equivalent to maximiz-ing the Jensen’s lower bound on the marginal probability ofthe observations, i.e., the evidence lower bound (ELBO),
L =�
(d1,d2)Eq [log p(yd1,d2 |zd1 , zd2 , �, �)] +
�d
�n Eq [log p(wd,n|�1:K , zd,n)] +
�d
�n Eq [log p(zd,n|�d)] +�
d Eq [log p(�d|�)] + H(q), (4)
where (d1, d2) denotes all document pairs. The first termof the ELBO differentiates the RTM from LDA (Blei et al.2003). The connections between documents affect the ob-jective in approximate posterior inference (and, below, inparameter estimation).
We develop the inference procedure under the assumptionthat only observed links will be modeled (i.e., yd1,d2 is ei-ther 1 or unobserved).1 We do this for two reasons.
First, while one can fix yd1,d2 = 1 whenever a link is ob-served between d1 and d2 and set yd1,d2 = 0 otherwise, thisapproach is inappropriate in corpora where the absence ofa link cannot be construed as evidence for yd1,d2 = 0. Inthese cases, treating these links as unobserved variables ismore faithful to the underlying semantics of the data. Forexample, in large social networks such as Facebook the ab-sence of a link between two people does not necessarilymean that they are not friends; they may be real friendswho are unaware of each other’s existence in the network.Treating this link as unobserved better respects our lack ofknowledge about the status of their relationship.
Second, treating non-links links as hidden decreases thecomputational cost of inference; since the link variables areleaves in the graphical model they can be removed when-
1Sums over document pairs (d1, d2) are understood to rangeover pairs for which a link has been observed.
!
!T
"
#k
$k
% M
&d
!D
'
Parse trees
grouped into M
documents
(a) Overall Graphical Model
w1:laid
w2:phrases
w6:forw5:his
w4:some w5:mind
w7:years
w3:in
z1
z2 z3
z4
z5
z5
z6
z7
(b) Sentence Graphical Model
Figure 1: In the graphical model of the STM, a document is made up of a number of sentences,represented by a tree of latent topics z which in turn generate words w. These words’ topics arechosen by the topic of their parent (as encoded by the tree), the topic weights for a document �,and the node’s parent’s successor weights �. (For clarity, not all dependencies of sentence nodesare shown.) The structure of variables for sentences within the document plate is on the right, asdemonstrated by an automatic parse of the sentence “Some phrases laid in his mind for years.” TheSTM assumes that the tree structure and words are given, but the latent topics z are not.
is going to be a noun consistent as the object of the preposition “of.” Thematically, because it is ina travel brochure, we would expect to see words such as “Acapulco,” “Costa Rica,” or “Australia”more than “kitchen,” “debt,” or “pocket.” Our model can capture these kinds of regularities andexploit them in predictive problems.
Previous efforts to capture local syntactic context include semantic space models [6] and similarityfunctions derived from dependency parses [7]. These methods successfully determine words thatshare similar contexts, but do not account for thematic consistency. They have difficulty with pol-ysemous words such as “fly,” which can be either an insect or a term from baseball. With a senseof document context, i.e., a representation of whether a document is about sports or animals, themeaning of such terms can be distinguished.
Other techniques have attempted to combine local context with document coherence using linearsequence models [8, 9]. While these models are powerful, ordering words sequentially removesthe important connections that are preserved in a syntactic parse. Moreover, these models gener-ate words either from the syntactic or thematic context. In the syntactic topic model, words areconstrained to be consistent with both.
The remainder of this paper is organized as follows. We describe the syntactic topic model, anddevelop an approximate posterior inference technique based on variational methods. We study itsperformance both on synthetic data and hand parsed data [10]. We show that the STM capturesrelationships missed by other models and achieves lower held-out perplexity.
2 The syntactic topic model
We describe the syntactic topic model (STM), a document model that combines observed syntacticstructure and latent thematic structure. To motivate this model, we return to the travel brochuresentence “In the near future, you could find yourself in .”. The word that fills in the blank isconstrained by its syntactic context and its document context. The syntactic context tells us that it isan object of a preposition, and the document context tells us that it is a travel-related word.
The STM attempts to capture these joint influences on words. It models a document corpus asexchangeable collections of sentences, each of which is associated with a tree structure such as a
2
This provides an inferential speed-up that makes itpossible to fit models at varying granularities. As ex-amples, journal articles might be exchangeable withinan issue, an assumption which is more realistic thanone where they are exchangeable by year. Other data,such as news, might experience periods of time withoutany observation. While the dDTM requires represent-ing all topics for the discrete ticks within these periods,the cDTM can analyze such data without a sacrificeof memory or speed. With the cDTM, the granularitycan be chosen to maximize model fitness rather thanto limit computational complexity.
We note that the cDTM and dDTM are not the onlytopic models to take time into consideration. Topicsover time models (TOT) [23] and dynamic mixturemodels (DMM) [25] also include timestamps in theanalysis of documents. The TOT model treats thetime stamps as observations of the latent topics, whileDMM assumes that the topic mixture proportions ofeach document is dependent on previous topic mix-ture proportions. In both TOT and DMM, the topicsthemselves are constant, and the time information isused to better discover them. In the setting here, weare interested in inferring evolving topics.
The rest of the paper is organized as follows. In sec-tion 2 we describe the dDTM and develop the cDTMin detail. Section 3 presents an e�cient posterior in-ference algorithm for the cDTM based on sparse varia-tional methods. In section 4, we present experimentalresults on two news corpora.
2 Continuous time dynamic topicmodels
In a time stamped document collection, we would liketo model its latent topics as changing through thecourse of the collection. In news data, for example, asingle topic will change as the stories associated withit develop. The discrete-time dynamic topic model(dDTM) builds on the exchangeable topic model toprovide such machinery [2]. In the dDTM, documentsare divided into sequential groups, and the topics ofeach slice evolve from the topics of the previous slice.Documents in a group are assumed exchangeable.
More specifically, a topic is represented as a distribu-tion over the fixed vocabulary of the collection. ThedDTM assumes that a discrete-time state space modelgoverns the evolution of the natural parameters of themultinomial distributions that represent the topics.(Recall that the natural parameters of the multino-mial are the logs of the probabilities of each item.)This is a time-series extension to the logistic normaldistribution [26].
Figure 1: Graphical model representation of thecDTM. The evolution of the topic parameters �t isgoverned by Brownian motion. The variable st is theobserved time stamp of document dt.
A drawback of the dDTM is that time is discretized.If the resolution is chosen to be too coarse, then theassumption that documents within a time step are ex-changeable will not be true. If the resolution is toofine, then the number of variational parameters will ex-plode as more time points are added. Choosing the dis-cretization should be a decision based on assumptionsabout the data. However, the computational concernsmight prevent analysis at the appropriate time scale.
Thus, we develop the continuous time dynamic topicmodel (cDTM) for modeling sequential time-seriesdata with arbitrary granularity. The cDTM can beseen as a natural limit of the dDTM at its finest pos-sible resolution, the resolution at which the documenttime stamps are measured.
In the cDTM, we still represent topics in their naturalparameterization, but we use Brownian motion [14] tomodel their evolution through time. Let i, j (j > i >0) be two arbitrary time indexes, si and sj be the timestamps, and �sj ,si be the elapsed time between them.In a K-topic cDTM model, the distribution of the kth
(1 � k � K) topic’s parameter at term w is:
�0,k,w � N (m, v0)
�j,k,w|�i,k,w, s � N��i,k,w, v�sj ,si
�, (1)
where the variance increases linearly with the lag.
This construction is used as a component in the fullgenerative process. (Note: if j = i+1, we write �sj ,si
as �sj for short.)
1. For each topic k, 1 � k � K,
(a) Draw �0,k � N (m, v0I).
(a) (b)
Figure 1: (a) LDA model. (b) MG-LDA model.
is still not directly dependent on the number of documentsand, therefore, the model is not expected to su�er from over-fitting. Another approach is to use a Markov chain MonteCarlo algorithm for inference with LDA, as proposed in [14].In section 3 we will describe a modification of this samplingmethod for the proposed Multi-grain LDA model.
Both LDA and PLSA methods use the bag-of-words rep-resentation of documents, therefore they can only exploreco-occurrences at the document level. This is fine, providedthe goal is to represent an overall topic of the document,but our goal is di�erent: extracting ratable aspects. Themain topic of all the reviews for a particular item is virtu-ally the same: a review of this item. Therefore, when suchtopic modeling methods are applied to a collection of re-views for di�erent items, they infer topics corresponding todistinguishing properties of these items. E.g. when appliedto a collection of hotel reviews, these models are likely to in-fer topics: hotels in France, New York hotels, youth hostels,or, similarly, when applied to a collection of Mp3 players’reviews, these models will infer topics like reviews of iPodor reviews of Creative Zen player. Though these are all validtopics, they do not represent ratable aspects, but rather de-fine clusterings of the reviewed items into specific types. Infurther discussion we will refer to such topics as global topics,because they correspond to a global property of the objectin the review, such as its brand or base of operation. Dis-covering topics that correlate with ratable aspects, such ascleanliness and location for hotels, is much more problem-atic with LDA or PLSA methods. Most of these topics arepresent in some way in every review. Therefore, it is di�cultto discover them by using only co-occurrence information atthe document level. In this case exceedingly large amountsof training data is needed and as well as a very large num-ber of topics K. Even in this case there is a danger thatthe model will be overflown by very fine-grain global topicsor the resulting topics will be intersection of global topicsand ratable aspects, like location for hotels in New York.We will show in Section 4 that this hypothesis is confirmedexperimentally.
One way to address this problem would be to consider co-occurrences at the sentence level, i.e., apply LDA or PLSA toindividual sentences. But in this case we will not have a suf-ficient co-occurrence domain, and it is known that LDA andPLSA behave badly when applied to very short documents.This problem can be addressed by explicitly modeling topictransitions [5, 15, 33, 32, 28, 16], but these topic n-gram
models are considerably more computationally expensive.Also, like LDA and PLSA, they will not be able to distin-guish between topics corresponding to ratable aspects andglobal topics representing properties of the reviewed item.In the following section we will introduce a method whichexplicitly models both types of topics and e�ciently infersratable aspects from limited amount of training data.
2.2 MG-LDAWe propose a model called Multi-grain LDA (MG-LDA),
which models two distinct types of topics: global topics andlocal topics. As in PLSA and LDA, the distribution of globaltopics is fixed for a document. However, the distribution oflocal topics is allowed to vary across the document. A wordin the document is sampled either from the mixture of globaltopics or from the mixture of local topics specific for thelocal context of the word. The hypothesis is that ratableaspects will be captured by local topics and global topicswill capture properties of reviewed items. For example con-sider an extract from a review of a London hotel: “. . . publictransport in London is straightforward, the tube station isabout an 8 minute walk . . . or you can get a bus for £1.50”.It can be viewed as a mixture of topic London shared bythe entire review (words: “London”, “tube”, “£”), and theratable aspect location, specific for the local context of thesentence (words: “transport”, “walk”, “bus”). Local topicsare expected to be reused between very di�erent types ofitems, whereas global topics will correspond only to partic-ular types of items. In order to capture only genuine localtopics, we allow a large number of global topics, e�ectively,creating a bottleneck at the level of local topics. Of course,this bottleneck is specific to our purposes. Other applica-tions of multi-grain topic models conceivably might preferthe bottleneck reversed. Finally, we note that our definitionof multi-grain is simply for two-levels of granularity, globaland local. In principle though, there is nothing preventingthe model described in this section from extending beyondtwo levels. One might expect that for other tasks even morelevels of granularity could be beneficial.
We represent a document as a set of sliding windows, eachcovering T adjacent sentences within it. Each window v indocument d has an associated distribution over local topics�loc
d,v and a distribution defining preference for local topicsversus global topics �d,v. A word can be sampled using anywindow covering its sentence s, where the window is chosenaccording to a categorical distribution �s. Importantly, thefact that the windows overlap, permits to exploit a largerco-occurrence domain. These simple techniques are capableof modeling local topics without more expensive modeling oftopics transitions used in [5, 15, 33, 32, 28, 16]. Introductionof a symmetrical Dirichlet prior Dir(�) for the distribution�s permits to control smoothness of topic transitions in ourmodel.
The formal definition of the model with Kgl global andKloc local topics is the following. First, draw Kgl worddistributions for global topics �gl
z from a Dirichlet priorDir(�gl) and Kloc word distributions for local topics �loc
z�
from Dir(�loc). Then, for each document d:
• Choose a distribution of global topics �gld � Dir(�gl).
• For each sentence s choose a distribution �d,s(v) �Dir(�).
• For each sliding window v
113
WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China
McCallum, Wang, & Corrada-Emmanuel
!
"
z
w
Latent Dirichlet Allocation
(LDA)[Blei, Ng, Jordan, 2003]
N
D
x
z
w
Author-Topic Model
(AT)[Rosen-Zvi, Griffiths, Steyvers, Smyth 2004]
N
D
"
#
!
$
T
A
#$
T
x
z
w
Author-Recipient-Topic Model
(ART)[This paper]
N
D
"
#
!
$
T
A,A
z
w
Author Model
(Multi-label Mixture Model)[McCallum 1999]
N
D
#$
A
ad ad rda
ddd
d
d
Figure 1: Three related models, and the ART model. In all models, each observed word,w, is generated from a multinomial word distribution, �z, specific to a particulartopic/author, z, however topics are selected di�erently in each of the models.In LDA, the topic is sampled from a per-document topic distribution, �, whichin turn is sampled from a Dirichlet over topics. In the Author Model, there isone topic associated with each author (or category), and authors are sampleduniformly. In the Author-Topic model, the topic is sampled from a per-authormultinomial distribution, �, and authors are sampled uniformly from the observedlist of the document’s authors. In the Author-Recipient-Topic model, there isa separate topic-distribution for each author-recipient pair, and the selection oftopic-distribution is determined from the observed author, and by uniformly sam-pling a recipient from the set of recipients for the document.
its generative process for each document d, a set of authors, ad, is observed. To generateeach word, an author x is chosen uniformly from this set, then a topic z is selected from atopic distribution �x that is specific to the author, and then a word w is generated from atopic-specific multinomial distribution �z. However, as described previously, none of thesemodels is suitable for modeling message data.
An email message has one sender and in general more than one recipients. We couldtreat both the sender and the recipients as “authors” of the message, and then employ theAT model, but this does not distinguish the author and the recipients of the message, whichis undesirable in many real-world situations. A manager may send email to a secretary andvice versa, but the nature of the requests and language used may be quite di�erent. Evenmore dramatically, consider the large quantity of junk email that we receive; modeling thetopics of these messages as undistinguished from the topics we write about as authors wouldbe extremely confounding and undesirable since they do not reflect our expertise or roles.
Alternatively we could still employ the AT model by ignoring the recipient informationof email and treating each email document as if it only has one author. However, in thiscase (which is similar to the LDA model) we are losing all information about the recipients,and the connections between people implied by the sender-recipient relationships.
252
• The posterior can be used in creative ways.
• E.g., we can use inferences in information retrieval, recommendation,similarity, visualization, summarization, and other applications.
What makes topic models useful?
! It is a building block of many models.
© Eric Xing @ CMU, 2005-2013 37
Extending LDAby emerging groups. Both modalities are driven by thecommon goal of increasing data likelihood. Consider thevoting example again; resolutions that would have been as-signed the same topic in a model using words alone maybe assigned to di�erent topics if they exhibit distinct votingpatterns. Distinct word-based topics may be merged if theentities vote very similarly on them. Likewise, multiple dif-ferent divisions of entities into groups are made possible byconditioning them on the topics.
The importance of modeling the language associated withinteractions between people has recently been demonstratedin the Author-Recipient-Topic (ART) model [16]. In ARTthe words in a message between people in a network aregenerated conditioned on the author, recipient and a setof topics that describes the message. The model thus cap-tures both the network structure within which the peopleinteract as well as the language associated with the inter-actions. In experiments with Enron and academic email,the ART model is able to discover role similarity of peoplebetter than SNA models that consider network connectivityalone. However, the ART model does not explicitly capturegroups formed by entities in the network.
The GT model simultaneously clusters entities to groupsand clusters words into topics, unlike models that gener-ate topics solely based on word distributions such as LatentDirichlet Allocation [4]. In this way the GT model discov-ers salient topics relevant to relationships between entitiesin the social network—topics which the models that onlyexamine words are unable to detect.
We demonstrate the capabilities of the GT model by ap-plying it to two large sets of voting data: one from US Sen-ate and the other from the General Assembly of the UN.The model clusters voting entities into coalitions and si-multaneously discovers topics for word attributes describingthe relations (bills or resolutions) between entities. We findthat the groups obtained from the GT model are signifi-cantly more cohesive (p-value < .01) than those obtainedfrom the Blockstructures model. The GT model also dis-covers new and more salient topics in both the UN and Sen-ate datasets—in comparison with topics discovered by onlyexamining the words of the resolutions, the GT topics areeither split or joined together as influenced by the voters’patterns of behavior.
2. GROUP-TOPIC MODELThe Group-Topic Model is a directed graphical model that
clusters entities with relations between them, as well as at-tributes of those relations. The relations may be either di-rected or undirected and have multiple attributes. In thispaper, we focus on undirected relations and have words asthe attributes on relations.
In the generative process for each event (an interactionbetween entities), the model first picks the topic t of theevent and then generates all the words describing the eventwhere each word is generated independently according toa multinomial distribution �t, specific to the topic t. Togenerate the relational structure of the network, first thegroup assignment, gst for each entity s is chosen condition-ally on the topic, from a particular multinomial distribution�t over groups for each topic t. Given the group assignmentson an event b, the matrix V (b) is generated where each cellV (b)
gigj represents how often the groups of two senators be-haved the same or not during the event b, (e.g., voted the
SYMBOL DESCRIPTIONgit entity i’s group assignment in topic ttb topic of an event b
w(b)k the kth token in the event b
V (b)ij entity i and j’s groups behaved same (1)
or di�erently (2) on the event bS number of entitiesT number of topicsG number of groupsB number of eventsV number of unique wordsNb number of word tokens in the event bSb number of entities who participated in the event b
Table 1: Notation used in this paper
! " w v # $
t g % &
Nb S
b2
T BG2
S T
B
Figure 1: The Group-Topic model
same or not on a bill). The elements of V are sampled from
a binomial distribution �(b)gigj . Our notation is summarized
in Table 1, and the graphical model representation of themodel is shown in Figure 1.
Without considering the topic of an event, or by treat-ing all events in a corpus as reflecting a single topic, thesimplified model (only the right part of Figure 1) becomesequivalent to the stochastic Blockstructures model [17]. Tomatch the Blockstructures model, each event defines a re-lationship, e.g., whether in the event two entities’ groupsbehave the same or not. On the other hand, in our model arelation may have multiple attributes (which in our exper-iments are the words describing the event, generated by aper-topic multinomial).
When we consider the complete model, the dataset is dy-namically divided into T sub-blocks each of which corre-sponds to a topic. The complete GT model is as follows,
tb � Uniform(1T
)
wit|�t � Multinomial(�t)
�t|� � Dirichlet(�)
git|�t � Multinomial(�t)
�t|� � Dirichlet(�)
V (b)ij |�(b)
gigj� Binomial(�(b)
gigj)
�(b)gh |� � Beta(�).
We want to perform joint inference on (text) attributesand relations to obtain topic-wise group memberships. Sinceinference can not be done exactly on such complicated prob-abilistic graphical models, we employ Gibbs sampling to con-duct inference. Note that we adopt conjugate priors in our
Indian Buffet Process Compound Dirichlet Process
B selects a subset of atoms for each distribution, and thegamma random variables � determine the relative massesassociated with these atoms.
2.4. Focused Topic Models
Suppose H parametrizes distributions over words. Then,the ICD defines a generative topic model, where it is usedto generate a set of sparse distributions over an infinite num-ber of components, called “topics.” Each topic is drawnfrom a Dirichlet distribution over words. In order to specifya fully generative model, we sample the number of wordsfor each document from a negative binomial distribution,n(m)
· � NB(�
k bmk�k, 1/2).2
The generative model for M documents is
1. for k = 1, 2, . . . ,
(a) Sample the stick length �k according to Eq. 1.(b) Sample the relative mass �k � Gamma(�, 1).(c) Draw the topic distribution over words,
�k � Dirichlet(�).
2. for m = 1, . . . , M ,
(a) Sample a binary vector bm according to Eq. 1.(b) Draw the total number of words,
n(m)· � NB(
�k bmk�k, 1/2).
(c) Sample the distribution over topics,�m � Dirichlet(bm · �).
(d) For each word wmi, i = 1, . . . , n(m)· ,
i. Draw the topic index zmi � Discrete(�m).ii. Draw the word wmi � Discrete(�zmi
).
We call this the focused topic model (FTM) because theinfinite binary matrix B serves to focus the distributionover topics onto a finite subset (see Figure 1). The numberof topics within a single document is almost surely finite,though the total number of topics is unbounded. The topicdistribution for the mth document, �m, is drawn from aDirichlet distribution over the topics selected by bm. TheDirichlet distribution models uncertainty about topic pro-portions while maintaining the restriction to a sparse set oftopics.
The ICD models the distribution over the global topic pro-portion parameters � separately from the distribution overthe binary matrix B. This captures the idea that a topic mayappear infrequently in a corpus, but make up a high propor-tion of those documents in which it occurs. Conversely, atopic may appear frequently in a corpus, but only with lowproportion.
2Notation n(m)k is the number of words assigned to the kth
topic of the mth document, and we use a dot notation to representsummation - i.e. n(m)
· =P
k n(m)k .
Figure 1. Graphical model for the focused topic model
3. Related ModelsTitsias (2007) introduced the infinite gamma-Poisson pro-cess, a distribution over unbounded matrices of non-negative integers, and used it as the basis for a topic modelof images. In this model, the distribution over featuresfor the mth image is given by a Dirichlet distribution overthe non-negative elements of the mth row of the infinitegamma-Poisson process matrix, with parameters propor-tional to the values at these elements. While this results ina sparse matrix of distributions, the number of zero entriesin any column of the matrix is correlated with the valuesof the non-zero entries. Columns which have entries withlarge values will not typically be sparse. Therefore, thismodel will not decouple across-data prevalence and within-data proportions of topics. In the ICD the number of zeroentries is controlled by a separate process, the IBP, fromthe values of the non-zero entries, which are controlled bythe gamma random variables.
The sparse topic model (SparseTM, Wang & Blei, 2009)uses a finite spike and slab model to ensure that each topicis represented by a sparse distribution over words. Thespikes are generated by Bernoulli draws with a single topic-wide parameter. The topic distribution is then drawn from asymmetric Dirichlet distribution defined over these spikes.The ICD also uses a spike and slab approach, but allowsan unbounded number of “spikes” (due to the IBP) and amore globally informative “slab” (due to the shared gammarandom variables). We extend the SparseTM’s approxima-tion of the expectation of a finite mixture of Dirichlet dis-tributions, to approximate the more complicated mixture ofDirichlet distributions given in Eq. 2.
Recent work by Fox et al. (2009) uses draws from an IBPto select subsets of an infinite set of states, to model multi-ple dynamic systems with shared states. (A state in the dy-namic system is like a component in a mixed membershipmodel.) The probability of transitioning from the ith stateto the jth state in the mth dynamic system is drawn from aDirichlet distribution with parameters bmj� + ��i,j , where
Chang, Blei
!
Nd
"d
wd,n
zd,n
K
#k
yd,d'
$
Nd'
"d'
wd',n
zd',n
Figure 2: A two-document segment of the RTM. The variable y indicates whether the two documents are linked. The complete modelcontains this variable for each pair of documents. The plates indicate replication. This model captures both the words and the linkstructure of the data shown in Figure 1.
formulation, inspired by the supervised LDA model (Bleiand McAuliffe 2007), ensures that the same latent topic as-signments used to generate the content of the documentsalso generates their link structure. Models which do notenforce this coupling, such as Nallapati et al. (2008), mightdivide the topics into two independent subsets—one forlinks and the other for words. Such a decomposition pre-vents these models from making meaningful predictionsabout links given words and words given links. In Sec-tion 4 we demonstrate empirically that the RTM outper-forms such models on these tasks.
3 INFERENCE, ESTIMATION, ANDPREDICTION
With the model defined, we turn to approximate poste-rior inference, parameter estimation, and prediction. Wedevelop a variational inference procedure for approximat-ing the posterior. We use this procedure in a variationalexpectation-maximization (EM) algorithm for parameterestimation. Finally, we show how a model whose parame-ters have been estimated can be used as a predictive modelof words and links.
Inference In posterior inference, we seek to computethe posterior distribution of the latent variables condi-tioned on the observations. Exact posterior inference is in-tractable (Blei et al. 2003; Blei and McAuliffe 2007). Weappeal to variational methods.
In variational methods, we posit a family of distributionsover the latent variables indexed by free variational pa-rameters. Those parameters are fit to be close to the trueposterior, where closeness is measured by relative entropy.See Jordan et al. (1999) for a review. We use the fully-factorized family,
q(�,Z|�,�) =�
d [q�(�d|�d)�
n qz(zd,n|�d,n)] , (3)
where � is a set of Dirichlet parameters, one for each doc-
ument, and � is a set of multinomial parameters, one foreach word in each document. Note that Eq [zd,n] = �d,n.
Minimizing the relative entropy is equivalent to maximiz-ing the Jensen’s lower bound on the marginal probability ofthe observations, i.e., the evidence lower bound (ELBO),
L =�
(d1,d2)Eq [log p(yd1,d2 |zd1 , zd2 , �, �)] +
�d
�n Eq [log p(wd,n|�1:K , zd,n)] +
�d
�n Eq [log p(zd,n|�d)] +�
d Eq [log p(�d|�)] + H(q), (4)
where (d1, d2) denotes all document pairs. The first termof the ELBO differentiates the RTM from LDA (Blei et al.2003). The connections between documents affect the ob-jective in approximate posterior inference (and, below, inparameter estimation).
We develop the inference procedure under the assumptionthat only observed links will be modeled (i.e., yd1,d2 is ei-ther 1 or unobserved).1 We do this for two reasons.
First, while one can fix yd1,d2 = 1 whenever a link is ob-served between d1 and d2 and set yd1,d2 = 0 otherwise, thisapproach is inappropriate in corpora where the absence ofa link cannot be construed as evidence for yd1,d2 = 0. Inthese cases, treating these links as unobserved variables ismore faithful to the underlying semantics of the data. Forexample, in large social networks such as Facebook the ab-sence of a link between two people does not necessarilymean that they are not friends; they may be real friendswho are unaware of each other’s existence in the network.Treating this link as unobserved better respects our lack ofknowledge about the status of their relationship.
Second, treating non-links links as hidden decreases thecomputational cost of inference; since the link variables areleaves in the graphical model they can be removed when-
1Sums over document pairs (d1, d2) are understood to rangeover pairs for which a link has been observed.
!
!T
"
#k
$k
% M
&d
!D
'
Parse trees
grouped into M
documents
(a) Overall Graphical Model
w1:laid
w2:phrases
w6:forw5:his
w4:some w5:mind
w7:years
w3:in
z1
z2 z3
z4
z5
z5
z6
z7
(b) Sentence Graphical Model
Figure 1: In the graphical model of the STM, a document is made up of a number of sentences,represented by a tree of latent topics z which in turn generate words w. These words’ topics arechosen by the topic of their parent (as encoded by the tree), the topic weights for a document �,and the node’s parent’s successor weights �. (For clarity, not all dependencies of sentence nodesare shown.) The structure of variables for sentences within the document plate is on the right, asdemonstrated by an automatic parse of the sentence “Some phrases laid in his mind for years.” TheSTM assumes that the tree structure and words are given, but the latent topics z are not.
is going to be a noun consistent as the object of the preposition “of.” Thematically, because it is ina travel brochure, we would expect to see words such as “Acapulco,” “Costa Rica,” or “Australia”more than “kitchen,” “debt,” or “pocket.” Our model can capture these kinds of regularities andexploit them in predictive problems.
Previous efforts to capture local syntactic context include semantic space models [6] and similarityfunctions derived from dependency parses [7]. These methods successfully determine words thatshare similar contexts, but do not account for thematic consistency. They have difficulty with pol-ysemous words such as “fly,” which can be either an insect or a term from baseball. With a senseof document context, i.e., a representation of whether a document is about sports or animals, themeaning of such terms can be distinguished.
Other techniques have attempted to combine local context with document coherence using linearsequence models [8, 9]. While these models are powerful, ordering words sequentially removesthe important connections that are preserved in a syntactic parse. Moreover, these models gener-ate words either from the syntactic or thematic context. In the syntactic topic model, words areconstrained to be consistent with both.
The remainder of this paper is organized as follows. We describe the syntactic topic model, anddevelop an approximate posterior inference technique based on variational methods. We study itsperformance both on synthetic data and hand parsed data [10]. We show that the STM capturesrelationships missed by other models and achieves lower held-out perplexity.
2 The syntactic topic model
We describe the syntactic topic model (STM), a document model that combines observed syntacticstructure and latent thematic structure. To motivate this model, we return to the travel brochuresentence “In the near future, you could find yourself in .”. The word that fills in the blank isconstrained by its syntactic context and its document context. The syntactic context tells us that it isan object of a preposition, and the document context tells us that it is a travel-related word.
The STM attempts to capture these joint influences on words. It models a document corpus asexchangeable collections of sentences, each of which is associated with a tree structure such as a
2
This provides an inferential speed-up that makes itpossible to fit models at varying granularities. As ex-amples, journal articles might be exchangeable withinan issue, an assumption which is more realistic thanone where they are exchangeable by year. Other data,such as news, might experience periods of time withoutany observation. While the dDTM requires represent-ing all topics for the discrete ticks within these periods,the cDTM can analyze such data without a sacrificeof memory or speed. With the cDTM, the granularitycan be chosen to maximize model fitness rather thanto limit computational complexity.
We note that the cDTM and dDTM are not the onlytopic models to take time into consideration. Topicsover time models (TOT) [23] and dynamic mixturemodels (DMM) [25] also include timestamps in theanalysis of documents. The TOT model treats thetime stamps as observations of the latent topics, whileDMM assumes that the topic mixture proportions ofeach document is dependent on previous topic mix-ture proportions. In both TOT and DMM, the topicsthemselves are constant, and the time information isused to better discover them. In the setting here, weare interested in inferring evolving topics.
The rest of the paper is organized as follows. In sec-tion 2 we describe the dDTM and develop the cDTMin detail. Section 3 presents an e�cient posterior in-ference algorithm for the cDTM based on sparse varia-tional methods. In section 4, we present experimentalresults on two news corpora.
2 Continuous time dynamic topicmodels
In a time stamped document collection, we would liketo model its latent topics as changing through thecourse of the collection. In news data, for example, asingle topic will change as the stories associated withit develop. The discrete-time dynamic topic model(dDTM) builds on the exchangeable topic model toprovide such machinery [2]. In the dDTM, documentsare divided into sequential groups, and the topics ofeach slice evolve from the topics of the previous slice.Documents in a group are assumed exchangeable.
More specifically, a topic is represented as a distribu-tion over the fixed vocabulary of the collection. ThedDTM assumes that a discrete-time state space modelgoverns the evolution of the natural parameters of themultinomial distributions that represent the topics.(Recall that the natural parameters of the multino-mial are the logs of the probabilities of each item.)This is a time-series extension to the logistic normaldistribution [26].
Figure 1: Graphical model representation of thecDTM. The evolution of the topic parameters �t isgoverned by Brownian motion. The variable st is theobserved time stamp of document dt.
A drawback of the dDTM is that time is discretized.If the resolution is chosen to be too coarse, then theassumption that documents within a time step are ex-changeable will not be true. If the resolution is toofine, then the number of variational parameters will ex-plode as more time points are added. Choosing the dis-cretization should be a decision based on assumptionsabout the data. However, the computational concernsmight prevent analysis at the appropriate time scale.
Thus, we develop the continuous time dynamic topicmodel (cDTM) for modeling sequential time-seriesdata with arbitrary granularity. The cDTM can beseen as a natural limit of the dDTM at its finest pos-sible resolution, the resolution at which the documenttime stamps are measured.
In the cDTM, we still represent topics in their naturalparameterization, but we use Brownian motion [14] tomodel their evolution through time. Let i, j (j > i >0) be two arbitrary time indexes, si and sj be the timestamps, and �sj ,si be the elapsed time between them.In a K-topic cDTM model, the distribution of the kth
(1 � k � K) topic’s parameter at term w is:
�0,k,w � N (m, v0)
�j,k,w|�i,k,w, s � N��i,k,w, v�sj ,si
�, (1)
where the variance increases linearly with the lag.
This construction is used as a component in the fullgenerative process. (Note: if j = i+1, we write �sj ,si
as �sj for short.)
1. For each topic k, 1 � k � K,
(a) Draw �0,k � N (m, v0I).
(a) (b)
Figure 1: (a) LDA model. (b) MG-LDA model.
is still not directly dependent on the number of documentsand, therefore, the model is not expected to su�er from over-fitting. Another approach is to use a Markov chain MonteCarlo algorithm for inference with LDA, as proposed in [14].In section 3 we will describe a modification of this samplingmethod for the proposed Multi-grain LDA model.
Both LDA and PLSA methods use the bag-of-words rep-resentation of documents, therefore they can only exploreco-occurrences at the document level. This is fine, providedthe goal is to represent an overall topic of the document,but our goal is di�erent: extracting ratable aspects. Themain topic of all the reviews for a particular item is virtu-ally the same: a review of this item. Therefore, when suchtopic modeling methods are applied to a collection of re-views for di�erent items, they infer topics corresponding todistinguishing properties of these items. E.g. when appliedto a collection of hotel reviews, these models are likely to in-fer topics: hotels in France, New York hotels, youth hostels,or, similarly, when applied to a collection of Mp3 players’reviews, these models will infer topics like reviews of iPodor reviews of Creative Zen player. Though these are all validtopics, they do not represent ratable aspects, but rather de-fine clusterings of the reviewed items into specific types. Infurther discussion we will refer to such topics as global topics,because they correspond to a global property of the objectin the review, such as its brand or base of operation. Dis-covering topics that correlate with ratable aspects, such ascleanliness and location for hotels, is much more problem-atic with LDA or PLSA methods. Most of these topics arepresent in some way in every review. Therefore, it is di�cultto discover them by using only co-occurrence information atthe document level. In this case exceedingly large amountsof training data is needed and as well as a very large num-ber of topics K. Even in this case there is a danger thatthe model will be overflown by very fine-grain global topicsor the resulting topics will be intersection of global topicsand ratable aspects, like location for hotels in New York.We will show in Section 4 that this hypothesis is confirmedexperimentally.
One way to address this problem would be to consider co-occurrences at the sentence level, i.e., apply LDA or PLSA toindividual sentences. But in this case we will not have a suf-ficient co-occurrence domain, and it is known that LDA andPLSA behave badly when applied to very short documents.This problem can be addressed by explicitly modeling topictransitions [5, 15, 33, 32, 28, 16], but these topic n-gram
models are considerably more computationally expensive.Also, like LDA and PLSA, they will not be able to distin-guish between topics corresponding to ratable aspects andglobal topics representing properties of the reviewed item.In the following section we will introduce a method whichexplicitly models both types of topics and e�ciently infersratable aspects from limited amount of training data.
2.2 MG-LDAWe propose a model called Multi-grain LDA (MG-LDA),
which models two distinct types of topics: global topics andlocal topics. As in PLSA and LDA, the distribution of globaltopics is fixed for a document. However, the distribution oflocal topics is allowed to vary across the document. A wordin the document is sampled either from the mixture of globaltopics or from the mixture of local topics specific for thelocal context of the word. The hypothesis is that ratableaspects will be captured by local topics and global topicswill capture properties of reviewed items. For example con-sider an extract from a review of a London hotel: “. . . publictransport in London is straightforward, the tube station isabout an 8 minute walk . . . or you can get a bus for £1.50”.It can be viewed as a mixture of topic London shared bythe entire review (words: “London”, “tube”, “£”), and theratable aspect location, specific for the local context of thesentence (words: “transport”, “walk”, “bus”). Local topicsare expected to be reused between very di�erent types ofitems, whereas global topics will correspond only to partic-ular types of items. In order to capture only genuine localtopics, we allow a large number of global topics, e�ectively,creating a bottleneck at the level of local topics. Of course,this bottleneck is specific to our purposes. Other applica-tions of multi-grain topic models conceivably might preferthe bottleneck reversed. Finally, we note that our definitionof multi-grain is simply for two-levels of granularity, globaland local. In principle though, there is nothing preventingthe model described in this section from extending beyondtwo levels. One might expect that for other tasks even morelevels of granularity could be beneficial.
We represent a document as a set of sliding windows, eachcovering T adjacent sentences within it. Each window v indocument d has an associated distribution over local topics�loc
d,v and a distribution defining preference for local topicsversus global topics �d,v. A word can be sampled using anywindow covering its sentence s, where the window is chosenaccording to a categorical distribution �s. Importantly, thefact that the windows overlap, permits to exploit a largerco-occurrence domain. These simple techniques are capableof modeling local topics without more expensive modeling oftopics transitions used in [5, 15, 33, 32, 28, 16]. Introductionof a symmetrical Dirichlet prior Dir(�) for the distribution�s permits to control smoothness of topic transitions in ourmodel.
The formal definition of the model with Kgl global andKloc local topics is the following. First, draw Kgl worddistributions for global topics �gl
z from a Dirichlet priorDir(�gl) and Kloc word distributions for local topics �loc
z�
from Dir(�loc). Then, for each document d:
• Choose a distribution of global topics �gld � Dir(�gl).
• For each sentence s choose a distribution �d,s(v) �Dir(�).
• For each sliding window v
113
WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China
McCallum, Wang, & Corrada-Emmanuel
!
"
z
w
Latent Dirichlet Allocation
(LDA)[Blei, Ng, Jordan, 2003]
N
D
x
z
w
Author-Topic Model
(AT)[Rosen-Zvi, Griffiths, Steyvers, Smyth 2004]
N
D
"
#
!
$
T
A
#$
T
x
z
w
Author-Recipient-Topic Model
(ART)[This paper]
N
D
"
#
!
$
T
A,A
z
w
Author Model
(Multi-label Mixture Model)[McCallum 1999]
N
D
#$
A
ad ad rda
ddd
d
d
Figure 1: Three related models, and the ART model. In all models, each observed word,w, is generated from a multinomial word distribution, �z, specific to a particulartopic/author, z, however topics are selected di�erently in each of the models.In LDA, the topic is sampled from a per-document topic distribution, �, whichin turn is sampled from a Dirichlet over topics. In the Author Model, there isone topic associated with each author (or category), and authors are sampleduniformly. In the Author-Topic model, the topic is sampled from a per-authormultinomial distribution, �, and authors are sampled uniformly from the observedlist of the document’s authors. In the Author-Recipient-Topic model, there isa separate topic-distribution for each author-recipient pair, and the selection oftopic-distribution is determined from the observed author, and by uniformly sam-pling a recipient from the set of recipients for the document.
its generative process for each document d, a set of authors, ad, is observed. To generateeach word, an author x is chosen uniformly from this set, then a topic z is selected from atopic distribution �x that is specific to the author, and then a word w is generated from atopic-specific multinomial distribution �z. However, as described previously, none of thesemodels is suitable for modeling message data.
An email message has one sender and in general more than one recipients. We couldtreat both the sender and the recipients as “authors” of the message, and then employ theAT model, but this does not distinguish the author and the recipients of the message, whichis undesirable in many real-world situations. A manager may send email to a secretary andvice versa, but the nature of the requests and language used may be quite di�erent. Evenmore dramatically, consider the large quantity of junk email that we receive; modeling thetopics of these messages as undistinguished from the topics we write about as authors wouldbe extremely confounding and undesirable since they do not reflect our expertise or roles.
Alternatively we could still employ the AT model by ignoring the recipient informationof email and treating each email document as if it only has one author. However, in thiscase (which is similar to the LDA model) we are losing all information about the recipients,and the connections between people implied by the sender-recipient relationships.
252
• The posterior can be used in creative ways.
• E.g., we can use inferences in information retrieval, recommendation,similarity, visualization, summarization, and other applications.
© Eric Xing @ CMU, 2005-2020 47
More on Mean Field Approximation
© Eric Xing @ CMU, 2005-2020 48
Gibbs predictive distribution:
ïþ
ïýü
ïî
ïíì
++= åÎ
-ij
ijiijiiii AxXXxXpN
qq 0exp)|(
}):{|( iji jxXp NÎ=
jx
jx
mean field equation:
}):{|(
exp)(
iji
jiqjiijiiii
jXXp
AXXXXq
jq
ij
NN
Îñá=
ïþ
ïýü
ïî
ïíì
++= åÎ
qq 0jqjX
}:{ ij jXjq
NÎñáXi
The naive mean field approximation
q Approximate p(X) by fully factorized q(X)=Piqi(Xi)
q For Boltzmann distribution p(X)=exp{åi < j qijXiXj+qioXi}/Z :
Xi
§ Áxjñqj resembles a “message” sent from node j to i
§ {áxjñqj : j Î Ni} forms the “mean field” applied to Xi from its neighborhood}:{ iqj jXj
NÎñájqjX
© Eric Xing @ CMU, 2005-2020 49
)]([ XpG)}]([{ cc XqG
Exact:
Clusters:
(intractable)
Cluster-based approx. to the Gibbs free energy(Wiegerinck 2001, Xing et al 03,04)
© Eric Xing @ CMU, 2005-2020 50
Mean field approx. to Gibbs free energy
q Given a disjoint clustering, {C1, … , CI}, of all variablesq Let
q Mean-field free energy
q Will never equal to the exact Gibbs free energy no matter what clustering is used, but it does always define a lower bound of the likelihood
q Optimize each qi(xc)'s. q Variational calculus …q Do inference in each qi(xc) using any tractable algorithm
),()( ii
iqq CXX Õ=
( ) ( ) ( )ååååÕ +=i
CiCii
Ci
CiiC
iii
iCi
qqEqGxx
xxxx ln)(MF
( ) ( ) ( ) ( ) ( )åååååå ++=< i x
iii
ix
iji
jixx
jiiiji
xqxqxxqxxxqxqG ln)()( e.g., MF ff (naïve mean field)
© Eric Xing @ CMU, 2005-2020 51
),|()( ,,,,*
ijiiii qMBHCECHCHi pq
¹
= XxXX
The Generalized Mean Field theorem
Theorem: The optimum GMF approximation to the cluster marginal is isomorphic to the cluster posterior of the original distribution given internal evidence and its generalized mean fields:
GMF algorithm: Iterate over each qi
© Eric Xing @ CMU, 2005-2020 52
[xing et al. UAI 2003]A generalized mean field algorithm
© Eric Xing @ CMU, 2005-2020 53
[xing et al. UAI 2003]A generalized mean field algorithm
© Eric Xing @ CMU, 2005-2020 54
Theorem: The GMF algorithm is guaranteed to converge to a local optimum, and provides a lower bound for the likelihood of evidence (or partition function) the model.
Convergence theorem
© Eric Xing @ CMU, 2005-2020 55
Cluster marginal of a square block Ck:
ïþ
ïýü
ïî
ïíì
++µ å ååÎ Î
ÎÎÎkkMBCkkMBjkCi
kCk
kCji
XqjiijCi
iijiijC XXXXXXq,
)(
',, '
exp)( qqq 0
Virtually a reparameterized Ising model of small size.
Example 1: Generalized MF approximations to Isingmodels
© Eric Xing @ CMU, 2005-2020 56
GMF approximation to Ising models
GMF2x2GMF4x4
BP
Attractive coupling: positively weightedRepulsive coupling: negatively weighted
© Eric Xing @ CMU, 2005-2020 57
GMFrGMFb
BP
Example 2: Sigmoid belief network
© Eric Xing @ CMU, 2005-2020 58
Example 3: Factorial HMM
© Eric Xing @ CMU, 2005-2020 59