web bar 2004 advanced retrieval and web mining lecture 16

55
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Upload: paulina-flynn

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 16

Page 2: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Today’s Topics

Evaluation of classification Vector space methods, Nearest neighbors Decision trees

Page 3: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Measuring ClassificationFigures of Merit

Accuracy of classification Main evaluation criterion in academia

Speed of training statistical classifier Speed of classification (docs/hour)

No big differences for most algorithms Exceptions: kNN

Effort in creating training set (human hours/class) Active Learning can speed up training set creation

Page 4: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Measures of Accuracy

Error rate Not a good measure for small classes. Why?

Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Precision/recall for ranking classes (next slide) Correct estimate of size of category

Why is this different? Stability over time / category drift Utility

Costs of false positives / false negatives may be different For example, cost = tp-0.5fp

Page 5: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Accuracy Measurement: 1 cat/doc

Confusion matrix

53

Class assigned by classifierA

ctual

Cla

ss

This (i, j) entry means 53 of the docs actually inclass i were put in class j by the classifier.

Page 6: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Confusion matrix

Function of classifier, classes and test docs. For a perfect classifier, all off-diagonal entries

should be zero. For a perfect classifier, if there are n docs in

category j than entry (j,j) should be n. Straightforward when there is 1 category per

document. Can be extended to n categories per document.

Page 7: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Confusion measures (1 class / doc)

Recall: Fraction of docs in class i classified correctly:

Precision: Fraction of docs assigned class i that are actually about class i:

“Correct rate”: (1- error rate) Fraction of docs classified correctly:

j

ij

ii

c

c

j

ji

ii

c

c

jij

i

iii

c

c

Page 8: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Multiclass Problem: >1 cat /docPrecision/Recall for Ranking Classes

Example doc: “Bad wheat harvest in Turkey” True categories

Wheat Turkey

Ranked category list 0.9: turkey 0.7: poultry 0.5: armenia 0.4: barley 0.3: georgia

Precision at 5: 0.2, Recall at 5: 0.5

Page 9: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Precision/Recall for Ranking Classes

Consider problems with many categories (>10) Use method returning scores comparable across

categories (not: Naïve Bayes) Rank categories and compute average precision

(or other measure characterizing precision/recall curve)

Good measure for interactive support of human categorization

Useless for an “autonomous” system (e.g. a filter on a stream of newswire stories)

Page 10: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Multiclass Problem: >1 cat/docMicro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity?

Macroaveraging: Compute performance for each class, then average.

Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Page 11: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7

Microaveraged precision: 100/120 = .83 Why this difference?

Page 12: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Variabilities of Measurements

Can we trust a computed figure of merit? Variability of data

document size/length quality/style of authorship uniformity of vocabulary

Variability of “truth” / gold standard need definitive judgement on which class(es) a

doc belongs to usually human

Ideally: consistent judgments

Page 13: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Vector Space ClassificationNearest Neighbor Classification

Page 14: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Classification Methods

Naïve Bayes Vector space methods, Nearest neighbors Decision trees Logistic regression Support vector machines

Page 15: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Recall Vector Space Representation

Each document is a vector, one component for each term (= word).

Normalize to unit length. Terms are axes 10,000+ dimensions, or even 1,000,000+ Docs are vectors in this space

Page 16: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Classification Using Vector Spaces

Each training doc a point (vector) labeled by its topic (= class)

Hypothesis: docs of the same class form a contiguous region of space

Define surfaces to delineate classes in space

Page 17: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Classes in a Vector Space

Government

Science

Arts

Page 18: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Test Document = Government

Government

Science

Arts

Similarityhypothesistrue ingeneral?

Page 19: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Binary Classification

Consider 2 class problems How do we define (and find) the separating

surface? How do we test which region a test doc is in?

Page 20: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Separation by Hyperplanes

Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes

Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c

Page 21: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Linear programming / Perceptron

Find a,b,c, such thatax + by c for red pointsax + by c for green points.

Page 22: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Which Hyperplane?

In general, lots of possiblesolutions for a,b,c.

Page 23: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Which Hyperplane?

Lots of possible solutions for a,b,c. Some methods find a separating

hyperplane, but not the optimal one (e.g., perceptron)

Most methods find an optimal separating hyperplane

Which points should influence optimality? All points

Linear regression Naïve Bayes

Only “difficult points” close to decision boundary

Support vector machines Logistic regression (kind of)

Page 24: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Hyperplane: Example

Class: “interest” (as in interest rate) Example features of a linear classifier (SVM) wi ti wi ti

• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank

• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr

Page 25: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Relationship to Naïve Bayes?

Find a,b,c, such thatax + by c for red pointsax + by c for green points.

Page 26: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Linear Classifiers

Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (with linear kernel) Linear regression (Simple) neural networks

Despite this similarity, large performance differences For separable problems, there is an infinite number of

separating hyperplanes. Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes

Classifiers more powerful than linear often don’t perform better. Why?

Page 27: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

More Than Two Classes

Any-of or multiclass classification Classes are independent of each other. A document can belong to 0, 1, or >1 classes. Decompose into n binary problems

One-of or multinomial or polytomous classification Classes are mutually exclusive. Each document belongs to exactly one class.

Page 28: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

One-Of/Polytomous/Multinomial Classification

Digit recognition is polytomous classification Digits are mutually exclusive A given image is to be assigned to exactly one of

the 10 classes

Page 29: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Composing Surfaces: Issues

?

?

?

Page 30: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Set of Binary Classifiers: Any of

Build a separator between each class and its complementary set (docs from all other classes).

Given test doc, evaluate it for membership in each class.

Apply decision criterion of classifiers independently

Done

Page 31: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Set of Binary Classifiers: One of

Build a separator between each class and its complementary set (docs from all other classes).

Given test doc, evaluate it for membership in each class.

Assign document to class with: maximum score maximum confidence maximum probability

Why different from multiclass/any of classification?

Naïve Bayes?

Page 32: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Negative Examples

Formulate as above, except negative examples for a class are added to its complementary set.

Positive examples

Negative examples

Page 33: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Centroid Classification

Given training docs for a class, compute their centroid

Now we have a centroid for each class Given query doc, assign to class whose centroid

is nearest. Exercise: Compare to Rocchio

Page 34: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Example

Government

Science

Arts

Page 35: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

k Nearest Neighbor Classification

To classify document d into class c Define k-neighborhood N as k nearest neighbors

of d Count number of documents i in N that belong to

c Estimate P(c|d) as i/k

Page 36: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Example: k=6 (6NN)

Government

Science

Arts

P(science| )?

Page 37: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

kNN Is Close to Optimal

Cover and Hart 1967 Asymptotically, the error rate of 1-nearest-

neighbor classification is less than twice the Bayes rate.

In particular, asymptotic error rate is 0 if Bayes rate is 0.

Assume: query point coincides with a training point.

Both query point and training point contribute error -> 2 times Bayes rate

Page 38: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

kNN: Discussion

Classification time linear in training set expensive

No feature selection necessary Scales well with large number of classes

Don’t need to train n classifiers for n classes Classes can influence each other

Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary

Actually: not true. Why?

Page 39: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Number of neighbors

Page 40: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

kNN vs. Naïve Bayes

Bias/Variance tradeoff Variance ≈ Capacity kNN has high variance and low bias.

Infinite memory NB has low variance and high bias.

Decision surface has to be linear (hyperplane) Consider: Is an object a tree? (Burges) Too much capacity/variance, low bias

Botanist who memorizes Will always say “no” to new object (e.g., #leaves)

Not enough capacity/variance, high bias Lazy botanist Says “yes” if the object is green

Page 41: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

References

Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Manning and Schuetze.

Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer-Verlag, New York.

Page 42: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Decision Tree Example

Page 43: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Decision Trees

Tree with internal nodes labeled by terms Branches are labeled by tests on the weight that

the term has Leaves are labeled by categories Classifier categorizes document by descending

tree following tests to leaf The label of the leaf node is then assigned to the

document Most decision trees are binary trees (never

disadvantageous; may require extra internal nodes)

Page 44: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Decision trees vs. Hyperplanes

Optimal use of a few high-leverage features Hyperplanes integrate many features:

• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank

• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr

Page 45: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Decision Tree Learning

Learn a sequence of tests on features, typically using top-down, greedy search At each stage choose the unused feature with

highest Information Gain Binary (yes/no) or continuous decisions

f1 !f1

f7 !f7

P(class) = .6

P(class) = .9

P(class) = .2

Page 46: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Decision Tree Learning

Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents well

Therefore, pruning or early stopping methods for Decision Trees are normally a standard part of classification packages

Use of small number of features is potentially bad in text cat, but in practice decision trees do well for some text classification tasks

Decision trees are very easily interpreted by humans – much more easily than probabilistic methods like Naive Bayes

Decision Trees are normally regarded as a symbolic machine learning algorithm, though can be used probabilistically

Page 47: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories

An article can be in more than one category Learn 118 binary category distinctions

Example “interest rate” article2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council,

a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct.

Common categories(#train, #test)

Classic Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Page 48: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Classic Reuters Data Set

Average document: about 90 types, 200 tokens Average number of classes assigned

1.24 for docs with at least one category Only about 10 out of 118 categories are large Microaveraging measures performance on large

categories.

Page 49: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Dumais et al. 1998: Reuters - Accuracy

Recall: % labeled in category among those stories that are really in categoryPrecision: % really in category among those stories labeled in category

Break Even: (Recall + Precision) / 2

Rocchio NBayes Trees LinearSVMearn 92.9% 95.9% 97.8% 98.2%acq 64.7% 87.8% 89.7% 92.8%money-fx 46.7% 56.6% 66.2% 74.0%grain 67.5% 78.8% 85.0% 92.4%crude 70.1% 79.5% 85.0% 88.3%trade 65.1% 63.9% 72.5% 73.5%interest 63.4% 64.9% 67.1% 76.3%ship 49.2% 85.4% 74.2% 78.0%wheat 68.9% 69.7% 92.5% 89.7%corn 48.2% 65.3% 91.8% 91.1%

Avg Top 10 64.6% 81.5% 88.4% 91.4%Avg All Cat 61.7% 75.2% na 86.4%

Page 50: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Category: “interest”

rate=1

lending=0

prime=0

discount=0

pct=1

year=1year=0

rate.t=1

Page 51: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Reuters ROC - Category Grain

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Precision

Recall

LSVMDecision Tree Naïve BayesFind Similar

Recall: % labeled in category among those stories that are really in categoryPrecision: % really in category among those stories labeled in category

Two methods with same curve:

Same perfor-mance?

Page 52: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

ROC for Category - Earn

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LSVMDecision Tree Naïve BayesFind Similar

Precision

Recall

Page 53: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

ROC for Category - Crude

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LSVMDecision Tree Naïve BayesFind Similar

Precision

Recall

Page 54: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

ROC for Category - Ship

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LSVMDecision Tree Naïve BayesFind Similar

Precision

Recall

Page 55: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Take Away

Evaluation of text classification kNN Decision trees Hyperplanes

Many popular classification methods are linear classifiers

Training method is key difference, not the “hypothesis space” of potential classifiers

Any-of/multiclass vs One-of/polytomous