cs145: introduction to data...

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou [email protected]

December 4, 2017

Text Data: Topic Model

mailto:[email protected]

Methods to be Learnt

2

Vector Data Set Data Sequence Data Text Data

Classification Logistic Regression; Decision Tree; KNN;SVM; NN

Naïve Bayes for Text

Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models

PLSA

Prediction Linear RegressionGLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search DTW

Text Data: Topic Models

•Text Data and Topic Models

•Revisit of Mixture Model

•Probabilistic Latent Semantic Analysis (pLSA)

•Summary

3

Text Data

•Word/term

•Document

• A sequence of words

•Corpus

• A collection of

documents

4

Represent a Document

•Most common way: Bag-of-Words

• Ignore the order of words

• keep the count

5

c1 c2 c3 c4 c5 m1 m2 m3 m4

Vector space model

Topics

•Topic

• A topic is represented by a word

distribution

• Relate to an issue

6

Topic Models

• Topic modeling• Get topics automatically from a corpus

• Assign documents to topics automatically

• Most frequently used topic models• pLSA

• LDA

7





•Summary

8

Mixture Model-Based Clustering

• A set C of k probabilistic clusters C1, …,Ck

• probability density/mass functions: f1, …, fk,

• Cluster prior probabilities: w1, …, wk, σ𝑗𝑤𝑗 = 1

• Joint Probability of an object i and its cluster Cj is:

•𝑃(𝑥𝑖 , 𝑧𝑖 = 𝐶𝑗) = 𝑤𝑗𝑓𝑗 𝑥𝑖

• 𝑧𝑖: hidden random variable

• Probability of i is:

• 𝑃 𝑥𝑖 = σ𝑗𝑤𝑗𝑓𝑗(𝑥𝑖)9

𝑓1(𝑥)

𝑓2(𝑥)

Maximum Likelihood Estimation

• Since objects are assumed to be generated

independently, for a data set D = {x1, …, xn},

we have,

𝑃 𝐷 =ෑ

𝑖

𝑃 𝑥𝑖 =ෑ

𝑖

𝑗

𝑤𝑗𝑓𝑗(𝑥𝑖)

⇒ 𝑙𝑜𝑔𝑃 𝐷 =

𝑖

𝑙𝑜𝑔𝑃 𝑥𝑖 =

𝑖

𝑙𝑜𝑔

𝑗

𝑤𝑗𝑓𝑗(𝑥𝑖)

• Task: Find a set C of k probabilistic clusters

s.t. P(D) is maximized10

Gaussian Mixture Model

•Generative model

• For each object:

• Pick its cluster, i.e., a distribution component: 𝑍~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖 𝑤1, … , 𝑤𝑘

• Sample a value from the selected distribution:

𝑋|𝑍~𝑁 𝜇𝑍, 𝜎𝑍2

•Overall likelihood function

•𝐿 𝐷| 𝜃 = ς𝑖σ𝑗𝑤𝑗𝑝(𝑥𝑖|𝜇𝑗 , 𝜎𝑗2)

s.t. σ𝑗𝑤𝑗 = 1 𝑎𝑛𝑑 𝑤𝑗 ≥ 0

11

Multinomial Mixture Model

• For documents with bag-of-words representation• 𝒙𝑑 = (𝑥𝑑1, 𝑥𝑑2, … , 𝑥𝑑𝑁), 𝑥𝑑𝑛 is the number of words for nth word in the vocabulary

• Generative model • For each document

• Sample its cluster label 𝑧~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖(𝝅)• 𝝅 = (𝜋1, 𝜋2, … , 𝜋𝐾), 𝜋𝑘 is the proportion of kth cluster

• 𝑝 𝑧 = 𝑘 = 𝜋𝑘

• Sample its word vector 𝒙𝑑~𝑚𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜷𝑧)• 𝜷𝑧 = 𝛽𝑧1, 𝛽𝑧2, … , 𝛽𝑧𝑁 , 𝛽𝑧𝑛 is the parameter associate with nth word

in the vocabulary

• 𝑝 𝒙𝑑|𝑧 = 𝑘 =σ𝑛 𝑥𝑑𝑛 !

ς𝑛 𝑥𝑑𝑛!ς𝑛𝛽𝑘𝑛

𝑥𝑑𝑛 ∝ ς𝑛𝛽𝑘𝑛𝑥𝑑𝑛

12

Likelihood Function

•For a set of M documents

𝐿 =ෑ

𝑑

𝑝(𝒙𝑑) =ෑ

𝑑

𝑘

𝑝(𝒙𝑑 , 𝑧 = 𝑘)

=ෑ

𝑑

𝑘

𝑝 𝒙𝑑 𝑧 = 𝑘 𝑝(𝑧 = 𝑘)

∝ෑ

𝑑

𝑘

𝑝(𝑧 = 𝑘)ෑ

𝑛

𝛽𝑘𝑛𝑥𝑑𝑛

13

Mixture of Unigrams

• For documents represented by a sequence of words•𝒘𝑑 = (𝑤𝑑1, 𝑤𝑑2, … , 𝑤𝑑𝑁𝑑

), 𝑁𝑑 is the length of document d, 𝑤𝑑𝑛 is the word at the nth position of the document

• Generative model• For each document

• Sample its cluster label 𝑧~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖(𝝅)• 𝝅 = (𝜋1, 𝜋2, … , 𝜋𝐾), 𝜋𝑘 is the proportion of kth cluster

• 𝑝 𝑧 = 𝑘 = 𝜋𝑘

• For each word in the sequence• Sample the word 𝑤𝑑𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖(𝜷𝑧)

• 𝑝 𝑤𝑑𝑛|𝑧 = 𝑘 = 𝛽𝑘𝑤𝑑𝑛

14

Likelihood Function

•For a set of M documents

𝐿 =ෑ

𝑑

𝑝(𝒘𝑑) =ෑ

𝑑

𝑘

𝑝(𝒘𝑑 , 𝑧 = 𝑘)

=ෑ

𝑑

𝑘

𝑝 𝒘𝑑 𝑧 = 𝑘 𝑝(𝑧 = 𝑘)

=ෑ

𝑑

𝑘

𝑝(𝑧 = 𝑘)ෑ

𝑛

𝛽𝑘𝑤𝑑𝑛

15

Question

•Are multinomial mixture model and mixture of unigrams model equivalent? Why?

16





•Summary

17

Notations

•Word, document, topic

•𝑤, 𝑑, 𝑧

•Word count in document

• 𝑐(𝑤, 𝑑)

•Word distribution for each topic (𝛽𝑧)

•𝛽𝑧𝑤: 𝑝(𝑤|𝑧)

•Topic distribution for each document (𝜃𝑑)

•𝜃𝑑𝑧: 𝑝(𝑧|𝑑) (Yes, soft clustering)

18

Issues of Mixture of Unigrams

•All the words in the same documents are sampled from the same topic

• In practice, people switch topics during their writing

19

Illustration of pLSA

20

Generative Model for pLSA

•Describe how a document is generated probabilistically

• For each position in d, 𝑛 = 1,… ,𝑁𝑑• Generate the topic for the position as𝑧𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖(𝜽𝑑), 𝑖. 𝑒. , 𝑝 𝑧𝑛 = 𝑘 = 𝜃𝑑𝑘

(Note, 1 trial multinomial, i.e., categorical distribution)

• Generate the word for the position as

𝑤𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑢𝑙𝑙𝑖(𝜷𝑧𝑛), 𝑖. 𝑒. , 𝑝 𝑤𝑛 = 𝑤 = 𝛽𝑧𝑛𝑤

21

Graphical Model

Note: Sometimes, people add parameters such as 𝜃 𝑎𝑛𝑑 𝛽 into the graphical model

22

The Likelihood Function for a Corpus

•Probability of a word𝑝 𝑤|𝑑 =

𝑘

𝑝(𝑤, 𝑧 = 𝑘|𝑑) =

𝑘

𝑝 𝑤 𝑧 = 𝑘 𝑝 𝑧 = 𝑘|𝑑 =

𝑘

𝛽𝑘𝑤𝜃𝑑𝑘

•Likelihood of a corpus

23

𝜋𝑑 𝑖𝑠 𝑢𝑠𝑢𝑎𝑙𝑙𝑦 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑎𝑠 𝑢𝑛𝑖𝑓𝑜𝑟𝑚, i.e., 1/M

Re-arrange the Likelihood Function

•Group the same word from different positions together

max 𝑙𝑜𝑔𝐿 =

𝑑𝑤

𝑐 𝑤, 𝑑 𝑙𝑜𝑔

𝑧

𝜃𝑑𝑧 𝛽𝑧𝑤

𝑠. 𝑡.

𝑧

𝜃𝑑𝑧 = 1 𝑎𝑛𝑑

𝑤

𝛽𝑧𝑤 = 1

24

Optimization: EM Algorithm

• Repeat until converge

• E-step: for each word in each document, calculate its conditional

probability belonging to each topic

𝑝 𝑧 𝑤, 𝑑 ∝ 𝑝 𝑤 𝑧, 𝑑 𝑝 𝑧 𝑑 = 𝛽𝑧𝑤𝜃𝑑𝑧 (𝑖. 𝑒. , 𝑝 𝑧 𝑤, 𝑑

=𝛽𝑧𝑤𝜃𝑑𝑧

σ𝑧′𝛽𝑧′𝑤𝜃𝑑𝑧′)

• M-step: given the conditional distribution, find the parameters that

can maximize the expected likelihood

𝛽𝑧𝑤 ∝ σ𝑑 𝑝 𝑧 𝑤, 𝑑 𝑐 𝑤, 𝑑 (𝑖. 𝑒. , 𝛽𝑧𝑤 =σ𝑑 𝑝 𝑧 𝑤, 𝑑 𝑐 𝑤,𝑑

σ𝑤′,𝑑

𝑝 𝑧 𝑤′, 𝑑 𝑐 𝑤′,𝑑)

𝜃𝑑𝑧 ∝

𝑤

𝑝 𝑧 𝑤, 𝑑 𝑐 𝑤, 𝑑 (𝑖. 𝑒. , 𝜃𝑑𝑧 =σ𝑤 𝑝 𝑧 𝑤, 𝑑 𝑐 𝑤, 𝑑

𝑁𝑑)

25

Example• Two documents, two topics

• Vocabulary: {data, mining, frequent, pattern, web, information, retrieval}

• At some iteration of EM algorithm, E-step

26

Example (Continued)

• M-step

27

𝛽11 =0.8 ∗ 5 + 0.5 ∗ 2

11.8 + 5.8= 5/17.6

𝛽12 =0.8 ∗ 4 + 0.5 ∗ 3

11.8 + 5.8= 4.7/17.6

𝛽13 = 3/17.6𝛽14 = 1.6/17.6𝛽15 = 1.3/17.6𝛽16 = 1.2/17.6𝛽17 = 0.8/17.6

𝜃11 =11.8

17

𝜃12 =5.2

17





•Summary

28

Summary

•Basic Concepts

• Word/term, document, corpus, topic

•Mixture of unigrams

•pLSA

• Generative model

• Likelihood function

• EM algorithm

29

Quiz

•Q1: Is Multinomial Naïve Bayes a linear classifier?

•Q2: In pLSA, For the same word in different positions in a document, do they have the same conditional probability 𝑝 𝑧 𝑤, 𝑑 ?

30

cs145: introduction to data...

Documents