topic modeling using latent dirichlet allocation
TRANSCRIPT
Topic Modeling using Latent Dirichlet Allocation
Topic Modeling
• A process of analyzing large collections of documents in order to discover latent topics from the documents.
• Able to organize and structure the documents• Discover the different topics that a documents has • How similar are certain documents
Latent Dirichlet Allocation (LDA)
• It is a unsupervised learning
• Produces a generative model
Terminology
• Word: w {1,…,V} ∈
• Document: Sequence of N words
• Corpus: which is a set of M documents
• Topic: z {1,…, K} ∈
Topic
A topic is a set of co-occurring terms
Generate Process
1. Choose N based on Poisson distribution
2. Choose θ based on Dirichlet distribution (θ is a topic weight vector)
3. For each of the N words:1. Choose z from θ2. Choose w from z
Learning
• Variational Bayes
• Gibbs Sampling
Applications of LDA
• Collaborative Filtering
• Spam Detection
• Music
• Image
References
D M Blei, A Y Ng, M I Jordan. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research. 993-1022.
D J Hu. (2009). Latent Dirichlet Allocation for text, images, and music.