distributional clustering of words for text classification

Distributional Clustering of Words for Text Classification

Presentation by:

Thomas Walsh

(Rutgers University)

L.Douglas Baker (Carnegie Mellon

University)

Andrew Kachites McCallum (Justsystem Pittsburgh

Research Center)

Clustering

• Define what it means for words to be “similar”.

• “Collapse” the word space by grouping similar words in “clusters”.

• Key Idea for Distributional Clustering:– Class probabilities given the words in a labeled

document collection P(C|w) provide rules for correlating words to classifications.

Voting

• Can be understood by a voting model:

• Each word in a document casts a weighted vote for classification.

• Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

Benefits of Word Clustering

• Useful Semantic word clustering– Automatically generates a “Thesaurus”

• Higher classification accuracy– Sort of, we’ll discuss in the results section

• Smaller classification models– size reductions as dramatic as 50000 50

Benefits of Smaller Models

• Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch.

• Memory constrained devices like PDA’s could now use text classification algorithms to organize documents.

• More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.

The Framework

• Start with Training Data with:– Set of Classes C = {c1, c2… cm}

– Set of Documents D ={d1… dn}

– Each Document has a class label

Mixture Models

• f(xi|) = pkh(xi|k)

• Sum of pk’s is 1

• h is a distriution function for x (such as a Gausian) with k as the parameter (, ) in the Gausian case.

• Thus = (p1…pk, 1… k)

What is in this case?

• Assumption: one-to-one correspondence between the mixture model components and the classes.

• The class priors are contained in the vector 0

• Instances of each class / number of documents

What is in this case?

• The rest of the entries in correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj.

• N(wt, di) is the number of times a word appears in document di.

• P(cj|di) = {0, 1}

Prob. of a given Document in the Model

• The mixture model can be used to produce documents with probability:

• Just the sum of the probability of generating this document in the model over each class.

Documents as Collections of Words

• Treat each document as an ordered collection of word events.

• Dik = work in document di at place k.

• Each word is dependent on preceding words

Apply Naïve Bayes Assumption

• Assume each word is independent of both content and position

• Where dik = wt

• Update Formulas 2 and 1:– (2) P(di | cj ; ) = P(wt|cj ; )

– (1) P(di| ) = P(cj|) P(wt|cj; )

Incorporate Expanded Formulae for

• We can calculate the model parameter from the training data.

• Now we wish to calculate P(cj|di; ), the probability of document di belonging to class cj.

Final Equation

Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj

-------------------------------------------------------------(1/2/3) Sum of all class priors * product of all word

probabilities assuming we are in class cr

Maximize and that value of cj is the class for the document

Shortcomings of the Framework

• In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold.

• But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

What about clustering?

• So assuming the Framework holds… how does clustering fit into all this?

How Does Clustering affect probabilities?

• Fraction of cluster from wt + fraction of cluster from ws

Vs. other forms of learning

• Measures similarity based on the property it is trying to estimate (the classes)– Makes the supervision in the training data

really important.

• Clustering is based on the similarity of the class variable distributions

• Key Idea: Clustering preserves the “shape” of the class distributions.

Kullock-Liebler Divergence

• Measures the similarity between class distributions

• D( P(C | wt) || P(C | ws)) =

• If P(cj | wt) = P(cj | ws) then log(1) = 0

Problems with K-L Divergence

• Not symmetric

• Denominator can be 0 if ws does not appear in any documents of class cj.

K-L Divergence from the Mean

• Ratio of each words occurrence in the cluster * K-L divergence of that word within the cluster

• New and improved: uses a weighted average instead of just the mean

• Justification: fits clustering because independent distributions now form combined statistics.

Minimizing Error in Naïve Bayes Scores

• Assuming uniform class priors allows us to drop P(cj | ) and the whole denominator from (6)

• Then performing a little algebra gets us the cross entropy:

• So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

The Clustering Algorithm

• Comparing similarity of all possible word clusters would be O(V2)

• Instead, a number M is set as the total number of desired clusters– More supervision

• M clusters initialized with the M words with the highest mutual information to the class variable

• Properties: Greedy, scales efficiently

Algorithm

P(C | wt)

Related Work• Chi Merge / Chi 2

– Use D. Clustering to discretize numbers

• Class-based clustering– Uses amount that mutual information is reduced to

determine when to cluster– Not effective in text classification

• Feature Selection by Mutual Information– cannot capture dependencies between words

• Markov-blanket-based Feature Selection– Also attempts to Preserve P(C | wt) shapes

• Latent Semantic Indexing– Unsupervised, using PCA

The Experiment : Competitors to Distributional

Clustering• Clustering with LSI

• Information Gain Based Feature Selection

• Mutual-Information Feature Selection

• Feature Selection involves cutting out redundant instances

• Clustering combines these redundancies

The Experiment: Testbeds

• 20 Newsgroups– 20,000 articles from 20 usenet groups (apx 62000

words)

• ModApte “Reuters-21578”– 9603 training docs, 3299 testing docs, 135 topics

(apx. 16000 words)

• Yahoo! Science (July 1997)– 6294 pages in 41 classes (apx. 44000 words)– Very noisy data

20 Newsgroups Results

• Averaged over 5-20 trials• Computational constraints forced Markov blanket to a

smaller data set (second graph)• LSI uses only 1/3 training ratio

20 Newsgroups Analysis• Distributional Clustering achieves 82.1% accuracy at

50 features, almost as good as having the full vocabulary.

• More accurate then all non-clustering approaches• LSI did not add any improvement to clustering

(claim: because it is unsupervised)• On the smaller data set, D.C. achieves 80% accuracy

far quicker then the others, in some cases doubling their performance for small numbers of features.

• Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

Speed in 20-Newsgroups Test

• Distributional Clustering: 7.5 minutes

• LSI: 23 minutes

• Makov Blanket: 10 hours

• Mutual information feature selection (???): 30 seconds

Reuters-21578 Results

• D.C. outperforms others for small numbers of features

• Information-Gain based feature selection does better for larger feature sets.

• In this data set, documents can have multiple labels.

Yahoo! Results

• Feature selection performs almost as well or better in these cases

• Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

Performance Summary

• Only slight loss in accuraccy despite despite the reduction in feature space

• Preserves “redundent” information better than feature selection.

• The improvement is not as drastic with noisy data.

Improvements on Earlier D.C. Work

• Does not show much improvement on sparse data because the performance measure is related to the data distribution– D.C. preserves class distributions, even if these

are poor estimates to begin with.

• Thus this whole method relies on accurate values for P(C | wi)

Future Work

• Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi)

• Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

Some Thoughts

• Extremely supervised• Needs to be retrained when new documents

come in• In a paper with a lot of topics, does Naïve

Bayes (word independent of context) make sense?

• Didn’t work well in noisy data• How can we ensure proper theta values?