s tatistical t opic m odeling part 1 andrea tagarelli univ. of calabria, italy

STATISTICAL TOPIC MODELINGpart 1

Andrea Tagarelli

Univ. of Calabria, Italy

Statistical topic modeling (1/3)

• Key assumption: • text data represented as a mixture of topics, i.e., probability

distributions over terms

• Generative model for documents:• document features as being generated by latent variables

• Topic modeling vs. vector-space text modeling• (Latent) Semantic aspects underlying correlations between words

• Document topical structure


• Training on (large) corpus to learn:• Per-topic word distributions• Per-document topic distributions

[Blei, CACM, 2012]


• Graphical “Plate” notation• Standard representation for generative models• Rectangles (plates) represent repeated areas of the model

• number of times the variable(s) is repeated

[Hofmann, SIGIR, 1999]

Observed and latent variables• Observed variable: we know the current value• Latent variable: a variable whose state cannot be observed• Estimation problem:

• Estimate values for a set of distribution parameters that can best explain a set of observations

• Most likely values of parameters: maximum likelihood of a model

• Likelihood impossible to calculate in full• Approximation through

• Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained

• Gibbs sampling: update parameters sample-wise• Variational inference: approximate the model by a simpler one

Probabilistic LSA• PLSA [Hofmann, 2001]

• Probabilistic version of LSA conceived to better handling problems of term polysemy M

N

d z w

PLSA training (1/2)

• Joint probability model:

• Likelihood

PLSA training (2/2)

• Training with EM:• Initialization of the per-topic word distributions and per-document

topic distributions• E-step:

• M-step:

Latent Dirichlet Allocation (1/2)

• LDA [Blei et al., 2003]• Adds a Dirichlet prior on the per-document topic distribution

• 3-level scheme: corpus, documents, and terms• Terms are the only observed variables

For each doc in a collection of N docs

For each word position in a doc of length M

Topic assignment to a word at position i in doc dj

Word token at position i in doc dj

Per-document topic distribution

Per-topic word distribution

[Moens and Vulic, Tutorial @WSDM 2014]

Latent Dirichlet Allocation (2/2)

• Meaning of Dirichlet priors• θ ~ Dir(α1, …, αK)

• Each αk is a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations

• Analogously for ηi, with β ~ Dir(η1, …, ηV)

• Inference for a new document: Given α, β, η, infer θ• Exact inference problem is intractable: training through

• Gibbs sampling• Variational inference

s tatistical t opic m odeling part 1 andrea tagarelli univ. of calabria, italy

Documents

topic word distributions

topic z

statistical topic modeling

topic assignment

likelihood slide

topic word distribution

word position

new document