probabilistic topic models - latent dirichlet allocation€¦ · latent dirichlet allocation....
Post on 19-Aug-2020
14 Views
Preview:
TRANSCRIPT
Probabilistic Topic Models -Latent Dirichlet Allocation
Machine Learning Reading Group
March 2014
ML Reading Group Probabilistic Topic Models 1 / 12
Related Papers
D. Blei. Probabilistic topic models. Communications of the ACM,55(4):77-84, 2012.
D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation. Journal ofMachine Learning Research 3:993-1022, 2003.
ML Reading Group Probabilistic Topic Models 2 / 12
Motivation
Searching and exploring documents based on the themes that runthrough them
Rather than finding documents through keyword search alone, wemight first find the theme that we are interested in, and then examinethe documents related to that theme.
Probabilistic topic modeling: a suite of algorithms that aim todiscover and annotate large archives of documents with thematicinformation.
Note: these algorithms, sometimes in different names, are used forother data types (audio, image, video...)
ML Reading Group Probabilistic Topic Models 3 / 12
Topic Models (1)
Unigram model: the words of every document are drawnindependently from a single multinomial distribution
p(w) =∏
n
p(wn) (1)
ML Reading Group Probabilistic Topic Models 4 / 12
Topic Models (2)
Mixture of unigrams: each document is generated by first choosing atopic z and then generating N words independently from theconditional multinomial p(w |z)
p(w) =∑
z
p(z)∏
n
p(wn|z) (2)
ML Reading Group Probabilistic Topic Models 5 / 12
Topic Models (3)
Probabilistic latent semantic indexing (PLSI): it captures thepossibility that a document d may contain multiple topics
p(d ,wn) = p(d)∑
z
p(wn|z)p(z |d) (3)
ML Reading Group Probabilistic Topic Models 6 / 12
Latent Dirichlet Allocation (1)
Latent Dirichlet Allocation (LDA): expands PLSI by introducing priorson probability distributions
Better generalisability on new documents
ML Reading Group Probabilistic Topic Models 7 / 12
Latent Dirichlet Allocation (2)
Generative Process1 Randomly choose a distribution over topics2 For each word in the document
Randomly choose a topic from the distribution over topics in step 1Randomly choose a word from the corresponding distribution over thevocabulary
ML Reading Group Probabilistic Topic Models 8 / 12
Latent Dirichlet Allocation (3)
Graphical Model
ML Reading Group Probabilistic Topic Models 9 / 12
Latent Dirichlet Allocation (4)
Posterior Computation
Posterior is intractable - need to approximate it
Variational inference
MCMC - Gibbs sampling
ML Reading Group Probabilistic Topic Models 10 / 12
Latent Dirichlet Allocation (5)
Extensions
LDA can be exteded by relaxing some of the original assumptions
Bag-of-words: Not suitable for language generation
Solution: Integrating syntax
Bag-of-documents: Not suitable for chronologically ordereddocuments
Solution: Dynamic topic models
Number of topics: Assumed to be known and fixed
Solution: Bayesian nonparametric topic models
ML Reading Group Probabilistic Topic Models 11 / 12
Latent Dirichlet Allocation (6)
Open Issues
Evaluation and model checking: Interpretability over goodness of fit.
Visualization and user interfaces: Intuitive ways to visualise topics.
Data discovery : Seeking help of domain experts.
ML Reading Group Probabilistic Topic Models 12 / 12
top related