topic extraction using machine learning

Topic Extraction using Machine LearningSanjib Basak

Director of Data Science, Digital RiverJan,2016

Twin cities Big Data Analytics and Apache Spark user group meet up

Agenda• History of Topic Models• A Use Case • Demo using R• Demo using Spark• Conclusion

History of Topic Modeling • TF-IDF model (Salton and McGill, 1983)• A basic vocabulary of “words” or “terms” is chosen, and, for

each document in the corpus, a count is formed of the number of occurrences of each word. (TF)

• After suitable normalization, this term frequency count is compared to an Inverse Document Frequency (IDF) count, which measures the number of occurrences of a word in the entire corpus.

• Not a generative model

TF-IDF

History of Topic Modeling • To address the shortcomings of TF-IDF Deerwester et al. 1990

came up with LSI(Latent Semantic Indexing) model.• LSI uses a singular value decomposition of term document

matrix to identify a linear subspace in the space of TF-IDF features that captures most of the variance in the collection

• They claim that the model can capture some aspects of basic linguistic notions such as synonymy and polysemy

• Still not a useful model to capture distribution of words

LSI

PLSI

• Hofmann (1999), presented the Probabilistic Latent Semantic Analysis (pLSI) model, also known as the aspect model, as an alternative to LSI.

• Models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.”

• The model is still incomplete • Not a probabilistic model at the level of documents• Each document is represented as a list of numbers (the mixing proportions

for topics)

History of Topic Modeling

• De Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture.. This line of thinking leads to the latent Dirichlet allocation (LDA) model

• Blei, Ng and Jordon 2003 explained LDA• Hierarchical Bayesian Model - Each item or word

is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities.

LDAHistory of Topic Modeling

Taken from Wikipedia

LDA• The original paper used a variational

Bayes approximation of the posterior distribution• Alternative inference techniques use Gibbs

sampling, Expectation Maximization Algorithm, Online Variation and many more.

Model Workflow

Review Results

Step 3 Apply

Models

Step 2 Create

Document Term

Matrix

Step 1 Preprocessi

ng

K-Means• Choose number of clusters (K)• Initialize the clusters. Make one

observation as centroid• Determine observations that are

closest to the centroid and assign them part of the cluster

• Revise the cluster center as mean of the assigned observation

• Repeat above steps until convergence

Demo in R• Use Case• Model with K-Means• Model with LDA and visualization• Github Code Location -

https://github.com/sanjibb/R-Code

K-Means Result

Experimentation with Spark MLLib• Work with dataset and in Scala• 2 variations of optimization model –

• EM Variation Optimizer • online variational inference - http://www.cs.columbia.edu/~

blei/papers/WangPaisleyBlei2011.pdf

Github code Location• https://github.com/sanjibb/spark_example

https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf




http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf

http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf

Conclusion 1. LDA provides mixture of topics on the words vs K-Means

provides distinct topics1. In real-life topics may not be distinctively separated

2. Unsupervised LDA model may require to work with SMEs to get better representation of topics

1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)

Bibliographyhttps://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdfhttps://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttp://vis.stanford.edu/files/2012-Termite-AVI.pdfhttp://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

http://vis.stanford.edu/files/2012-Termite-AVI.pdf

http://vis.stanford.edu/files/2012-Termite-AVI.pdf

http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf



topic extraction using machine learning

Documents