latent dirichletallocation (lda) for topic modelingsmiran/lda.pdf · latent dirichlet allocation...
TRANSCRIPT
![Page 1: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/1.jpg)
Latent Dirichlet Allocation (LDA) for Topic Modeling
PSTAT 226 Project Presentation
Presenter: Sina Miran
![Page 2: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/2.jpg)
Latent Dirichlet Allocation (LDA)
Outline
1/22
1. Introduction2. Model Definition3. Parameter Estimation and Inference4. Example Output and Simulation5. References
![Page 3: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/3.jpg)
Latent Dirichlet Allocation (LDA)
1.Introduction
2/22
As more information becomes available, it becomesmore difficult to find and discover what we need.
We need tools to help us organize, search andunderstand these vast amount of information.
Topic modeling provides methods for automaticallyorganizing, understanding, searching, and summarizinglarge electronic archives:
1. Discover the hidden themes in the collection2. Annotate the documents according to these themes3. Use annotations to organize, summarize, search,
and form predictions
![Page 4: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/4.jpg)
Latent Dirichlet Allocation (LDA)
1.Introduction
3/22
Today, the large collection of data calls for unsupervised probabilistic models.
Example Applications:
1. Summarizing Collections of Images 2. Evolution of Pervasive Topics by Time
![Page 5: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/5.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
4/22
Some Assumptions:
⢠We have a set of documents ð·ð·1,ð·ð·2, ⊠,ð·ð·ð·ð·.
⢠Each document is just a collection of words or a âbag of wordsâ. Thus, the order of thewords and the grammatical role of the words (subject, object, verbs, âŠ) are not consideredin the model.
⢠Words like am/is/are/of/a/the/but/⊠can be eliminated from the documents as apreprocessing step since they donât carry any information about the âtopicsâ.
⢠In fact, we can eliminate words that occur in at least %80 ~ %90 of the documents!
⢠Each document is composed of ðð âimportantâ or âeffectiveâ words, and we want ðŸðŸ topics.
![Page 6: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/6.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
5/22
⢠Each topic is a distribution over words
⢠Each document is a mixture of corpus-wide topics
⢠Each word is drawn from one of these topics
⢠We only observe the words within the documents and the other structure are hidden variables.
![Page 7: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/7.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
6/22
Our goal is to infer or estimate the hidden variables, i.e. computing their distributionconditioned on the documents. ðð ð¡ð¡ð¡ð¡ððð¡ð¡ð¡ð¡ð¡ð¡, ððððð¡ð¡ððð¡ð¡ððð¡ð¡ð¡ð¡ð¡ð¡ððð¡ð¡, ððð¡ð¡ð¡ð¡ð¡ð¡ððððððððððð¡ð¡ð¡ð¡ ððð¡ð¡ð¡ð¡ððððððððð¡ð¡ð¡ð¡)
⢠Nodes are RVs; edges indicatedependence.
⢠Shaded nodes are observed, andunshaded nodes are hidden.
⢠Plates indicate replicatedvariables.
![Page 8: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/8.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
7/22
1) Draw each topic ðœðœðð~ð·ð·ð¡ð¡ðð ðð for ð¡ð¡ =1, ⊠,ðŸðŸ
2) For each document:
First, Draw topic proportionsðððð~ð·ð·ð¡ð¡ðð(ðŒðŒ)
For each word within the document:a) Draw ðððð,ðð~ððððððð¡ð¡ð¡ð¡(ðððð)b) Draw ðððð,ðð~ððððððð¡ð¡ð¡ð¡(ðœðœð§ð§ðð,ðð)
![Page 9: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/9.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
8/22
This joint distribution defines a posterior ðð ðð, ð§ð§,ðœðœ ð€ð€).
From a collection of documents we have to infer:1. Per-word topic assignment ð§ð§ðð,ðð.2. Per-document topic proportions ðððð.3. Per-corpus topic distributions ðœðœðð.
Then use posterior expectations (ðžðž ðœðœ|ð€ð€ for the corpus, ðžðž{ðððð|ð€ð€} for each document) toperform the task at hand: information retrieval, document similarity, exploration, and others.
![Page 10: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/10.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
9/22
Formal definition of the model:
ðð ðœðœ,ðð, ð§ð§,ð€ð€ = ï¿œðð=1
ðŸðŸ
ðð ðœðœðð ðð ï¿œðð=1
ð·ð·
ðð ðððð ðŒðŒ ï¿œðð=1
ðð
ðð ð§ð§ðð,ðð ðððð ðð(ð€ð€ðð,ðð|ðœðœ1:ðŸðŸ, ð§ð§ðð,ðð)
ðœðœðð ðð ~ð·ð·ð¡ð¡ðð ðœðœ ðððð ðŒðŒ ~ð·ð·ð¡ð¡ðð ðŒðŒ ðððð,ðð~ððððððð¡ð¡ð¡ð¡ ðððð ðððð,ðð~ððððððð¡ð¡ð¡ð¡(ðœðœð§ð§ðð,ðð)
ðð ð§ð§ðð,ðð ðððð = ðððð,ð§ð§ðð,ðð ðð ð€ð€ðð,ðð ð§ð§ðð,ðð,ðœðœ1:ðŸðŸ = ðœðœð§ð§ðð,ðð,ð€ð€ðð,ðð
ðððð: ðœðœ:Word probabilities for each topic
Topi
cs
![Page 11: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/11.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
10/22
Review of Multinomial and Dirichlet distributions:
1. Multinomial:
ðð ðð1 = ð¥ð¥1, ⊠,ðððŸðŸ = ð¥ð¥ðŸðŸ =ðð!
ð¥ð¥1! ⊠ð¥ð¥ðŸðŸ!ðð1ð¥ð¥1 ⊠ðððŸðŸ
ð¥ð¥ðŸðŸ ðððð â 0, ⊠,ðð ï¿œðð=1
ðŸðŸ
ðððð = ðð
2. Dirichlet: Good for modeling a distribution over distributions
ðð ðð ðŒðŒ =Î(âðð=1ðŸðŸ ðŒðŒðð)âðð=1ðŸðŸ Î(ðŒðŒðð)
ðð1ðŒðŒ1â1 âŠðððŸðŸ
ðŒðŒðŸðŸâ1 ðŒðŒ = ðð â ððð¡ð¡ððððððð¡ð¡ð¡ð¡ð¡ð¡ðððððð ð£ð£ððð¡ð¡ð¡ð¡ð¡ð¡ðð ðŒðŒðð > 0
ð£ð£ððððð¡ð¡ððð£ð£ðððð ðð ð¡ð¡ðððð ð¡ð¡ðððððð ð£ð£ððððððððð¡ð¡ ð¡ð¡ðð ð¡ð¡ð¡ðð ðð â 1 ð¡ð¡ð¡ð¡ððððððððð¥ð¥: ðððð > 0 ðððððð ï¿œðð=1
ðŸðŸðððð = 1
![Page 12: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/12.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
11/22
![Page 13: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/13.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
12/22
Role of parameter ï¿œâï¿œðŒ = ðŒðŒ1, ⊠,ðŒðŒðŸðŸ :ðžðž ðððð ðŒðŒ =
ðŒðŒððâðŒðŒðð
Note that here we are working with symmetric (exchangeable) Dirichlet distributionsmeaning ðŒðŒ1 = ⯠= ðŒðŒðŸðŸ
![Page 14: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/14.jpg)
Latent Dirichlet Allocation (LDA)
2.Model Definition
13/22
ðŒðŒðð = 0.1 ðŒðŒðð = 1 ðŒðŒðð = 10
![Page 15: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/15.jpg)
Latent Dirichlet Allocation (LDA)
3.Parameter Estimation and Inference
14/22
As mentioned earlier, from a set of ðððð documents and the observed words within eachdocument, we want to infer the posterior distribution ðð ðð, ð§ð§,ðœðœ ð€ð€) (Bayesian Inference)
There are many approximate posterior inference algorithms for this! We will briefly reviewGibbs sampling here as an example.
Gibbs Sampling for Bayesian Inference:
Denote ðð as the collection of model parameters and ðð as the observed data. For BayesianInference:
ðð ðð ðð =ðð ðð ðð ðð(ðð)
ðð(ðð)=
ðð ðð ðð ðð(ðð)â«ðð ðð ðð ðð ðð ðððð
Computing the integral in the denominator is impractical Gibbs Sampling
![Page 16: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/16.jpg)
Latent Dirichlet Allocation (LDA) 15/22
Simple Gibbs Sampling:
Suppose you wish to sample ðð1,ðð2~ðð ðð1,ðð2 but cannot use direct simulation or some othermethods.
But, you can sample from ðð ðð1 ðð2 and ðð(ðð2|ðð1). Then you can use Gibbs sampling.
Algorithm:1. Initialize ðð1
0 ,ðð20
2. Repeat the following steps consecutively to compute ðð1ðð ,ðð2
ðð :
a) Sample ðð1(ðð)~ðð ðð1 ðð2
ððâ1
b) Sample ðð2(ðð)~ðð ðð2 ðð1
ðð
3.Parameter Estimation and Inference
![Page 17: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/17.jpg)
Latent Dirichlet Allocation (LDA) 16/22
Theorem: ðð(ðð) = ðð1ðð ,ðð2
ðð ððconverges in distribution to ðð = ðð1,ðð2 ðð~ðð(ðð1,ðð2)
Example: Bivariate Normal ðð~ðð2(0,Σ) where Σ = 1 ðððð 1
We can show that ðð1|ðð2~ðð(ðððð2, 1 â ðð2) and ðð2|ðð1~ðð(ðððð1, 1 â ðð2)
3.Parameter Estimation and Inference
![Page 18: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/18.jpg)
Latent Dirichlet Allocation (LDA) 17/22
Applying this technique to estimating ðð(ðð, ð§ð§|ð€ð€,ðœðœ), we can form a simple Gibbs sampler:
1. ðð ðð ð§ð§1:ðð,ð€ð€1:ðð = ð·ð·ð¡ð¡ðð ðŒðŒ + ðð ð§ð§1:ðð where ðð() is a vector of the count of each topic(this is because of the choice of conjugate priors!)
2. ðð zi zâi,ðð,ð€ð€1:ðð â ðð ð§ð§ðð ðð ðð(ð€ð€ðð|ðœðœ1:ðŸðŸ, ð§ð§ðð)
We have assumed ðœðœ1:ðŸðŸ is fixed in the above iterations!!! We can use a more complex Gibbssampling to infer ðœðœ1:ðŸðŸ as well, or we can use other more efficient methods like the mean-fieldvariational inference!
Now you have samples from the inferred posterior distribution for Bayesian Inference! Enjoy!
3.Parameter Estimation and Inference
![Page 19: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/19.jpg)
Latent Dirichlet Allocation (LDA)
4.Example Output and Simulation
18/22
Text Mining package in R library(âtmâ)
Topic Models library in R library(âtopicmodelsâ)
We can perform preprocessing steps using function tm_map in the tm package such asremoving unimportant words (stopwords, âŠ)
LDA function in the âtopicmodelsâ package can fit the LDA model for a specific number oftopics ðŸðŸ.
R > LDA(x, k, method = "VEM", control = NULL, model = NULL, ...)
![Page 20: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/20.jpg)
Latent Dirichlet Allocation (LDA) 19/22
Data: Collection of Science from 1990-200017K documents11M words20K unique terms (redundant words removed)
Model: 100-topic LDA model using variational inference
4.Example Output and Simulation
![Page 21: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/21.jpg)
Latent Dirichlet Allocation (LDA) 20/22
Most probable words in four of the topics:
Topic 1 Topic 2 Topic 3 Topic 4
4.Example Output and Simulation
Prob
abili
ty
![Page 22: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/22.jpg)
Latent Dirichlet Allocation (LDA) 21/22
[1] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
[2] Homepage of David Blei, Associate Professor of Computer Science at Princeton University: http://www.cs.princeton.edu/~blei/topicmodeling.html
[3] Video Lectures of David Blei on videolectures.net:http://videolectures.net/mlss09uk_blei_tm/
5.References
![Page 23: Latent DirichletAllocation (LDA) for Topic Modelingsmiran/LDA.pdf · Latent Dirichlet Allocation (LDA) 2.Model Definition. 8/22. This joint distribution defines a posterior ðððð,ð§ð§,ðœðœð€ð€)](https://reader030.vdocuments.mx/reader030/viewer/2022040611/5ed71a7ba524404e76174316/html5/thumbnails/23.jpg)
Latent Dirichlet Allocation (LDA) 22/22
Questions?!