text-classification using latent dirichlet allocation - intro graphical model lei li

19
Text-classification Text-classification using Latent using Latent Dirichlet Allocation Dirichlet Allocation - intro graphical - intro graphical model model Lei Li Lei Li leili@cs leili@cs

Upload: gloria-casey

Post on 20-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Text Classification What class can you tell given a doc? …………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… bank debt loan interest billion buy ……………………… …………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… Iraq war weapon army Ak-47 bomb ……………………… finance military

TRANSCRIPT

Page 1: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Text-classification using Text-classification using Latent Dirichlet Latent Dirichlet

AllocationAllocation- intro graphical model- intro graphical model

Lei LiLei Lileili@csleili@cs

Page 2: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Outline• Introduction• Unigram model and mixture• Text classification using LDA • Experiments• Conclusion

Page 3: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Text ClassificationWhat class can you tell given a doc?

…………………… the New York

Stock Exchange……………………

America’s Nasdaq ………………………

Buy………………………

…………………… bank debtloan

interest billion

buy………………………

…………………… the New York

Stock Exchange……………………

America’s Nasdaq ………………………

Buy………………………

…………………… Iraq war

weapon armyAk-47bomb

………………………

finance

military

Page 4: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Why db guys care?• Could be adapted to model

discrete random variables– Disk failures– user access pattern– Social network, tags– blog

Page 5: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Document• “ bag of words”: no order on

words• d=(w1, w2, … wN)• wi one value in 1…V (1-of-V

scheme)• V: vocabulary size

Page 6: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Modeling Document• Unigram: simple multinomial dist• Mixture of unigram• LDA• Other: PLSA, bigram

Page 7: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Unigram Model for Classification

• Y is the class label,• d={w1, w2, … wN}• Use bayes rule: • How to model the

document given class• ~ Multinomial

distribution, estimated as word frequency

)0()0|()1()1|()()|()|(

YPYdPYPYdPYPYdPdYP

N

ii YwPYdP

1

)|()|(

)|( YwP iY

wN

Page 8: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Unigram: exampleP(w|Y) bank debt interest war army weapon

finance 0.2 0.15 0.1 0.0001 0.0001 0.0001

military 0.0001 0.0001 0.0001 0.1 0.15 0.2

d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0P(finance|d)=?P(military|d)=?

P(Y)

finance 0.6

military 0.4

Page 9: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Mixture of unigrams for classification

Y

wN

z

• For each class, assume k topics

• Each topic represents a multinomial distribution

• Under each topic, each word is multinomial

Page 10: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Unigram: example

d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0P(finance|d)=?P(military|d)=?

P(Y)

finance 0.6

military 0.4

P(w|z,Y)

bank debt interest war army weapon

finance 0.01 0.15 0.1 0.0001 0.0001 0.00010.2 0.01 0.01 0.0001 0.0001 0.0001

military 0.0001 0.0001 0.0001 0.1 0.15 0.010.0001 0.0001 0.0001 0.01 0.01 0.2

P(z|Y)finance 0.3

0.7military 0.5

0.5

Page 11: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Bayesian Network• Given a DAG• Nodes are random variables, or

parameters• Arrow are conditional probability

dependency• Given some prob on part nodes, there

are algorithm to infer values for other nodes

Page 12: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Latent Dirichlet Allocation

• Model a θ as a Dirichlet distribution, on α

• For n-th term wn:– Model n-th latent

variable zn as a multinomial distribution according to θ.

– Model wn as a multinomial distribution according to zn and β.

Page 13: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Variational inference for LDA

• Direct inference with LDA is HARD

• Approximation with variational distribution

• use factorized distribution on variational parameters γ and Φ to approximate posterior distribution of latent variables θand z.

Page 14: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Experiment• Data set: Reuters-21578, 8681 training

documents, 2966 test documents.• Classification task: “EARN” vs. “Non-

EARN” • For each document, learn LDA features

and classify with them (discriminative)

Page 15: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Result'bank' 'trade' 'shares' 'tonnes''banks' 'japan' 'company' 'mln''debt' 'japanese' 'stock' 'reuter''billion' 'states' 'dlrs' 'sugar''foreign' 'united' 'share' 'production''dlrs' 'officials' 'reuter' 'gold''government' 'reuter' 'offer' 'wheat''interest' 'told' 'common' 'nil''loans' 'government' 'pct' 'gulf'

most frequent words in each topic

Page 16: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Classification Accuracy

Page 17: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Comparison of Accuracy

Page 18: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Take Away Message• LDA with few topics and few training data

could produce relative better results• Bayesian network is useful to model multiple

random variable, nice algorithm for it, • Potential use of LDA:

– disk failure– database access pattern– user preference (collaborative filtering)– social network (tags)

Page 19: Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Reference• Blei, D., Ng, A., Jordan, M.: Latent

Dirichlet allocation. Journal of machine Learning Research