web bar 2004 advanced retrieval and web mining lecture 16
TRANSCRIPT
WEB BAR 2004 Advanced Retrieval and Web Mining
Lecture 16
Today’s Topics
Evaluation of classification Vector space methods, Nearest neighbors Decision trees
Measuring ClassificationFigures of Merit
Accuracy of classification Main evaluation criterion in academia
Speed of training statistical classifier Speed of classification (docs/hour)
No big differences for most algorithms Exceptions: kNN
Effort in creating training set (human hours/class) Active Learning can speed up training set creation
Measures of Accuracy
Error rate Not a good measure for small classes. Why?
Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Precision/recall for ranking classes (next slide) Correct estimate of size of category
Why is this different? Stability over time / category drift Utility
Costs of false positives / false negatives may be different For example, cost = tp-0.5fp
Accuracy Measurement: 1 cat/doc
Confusion matrix
53
Class assigned by classifierA
ctual
Cla
ss
This (i, j) entry means 53 of the docs actually inclass i were put in class j by the classifier.
Confusion matrix
Function of classifier, classes and test docs. For a perfect classifier, all off-diagonal entries
should be zero. For a perfect classifier, if there are n docs in
category j than entry (j,j) should be n. Straightforward when there is 1 category per
document. Can be extended to n categories per document.
Confusion measures (1 class / doc)
Recall: Fraction of docs in class i classified correctly:
Precision: Fraction of docs assigned class i that are actually about class i:
“Correct rate”: (1- error rate) Fraction of docs classified correctly:
j
ij
ii
c
c
j
ji
ii
c
c
jij
i
iii
c
c
Multiclass Problem: >1 cat /docPrecision/Recall for Ranking Classes
Example doc: “Bad wheat harvest in Turkey” True categories
Wheat Turkey
Ranked category list 0.9: turkey 0.7: poultry 0.5: armenia 0.4: barley 0.3: georgia
Precision at 5: 0.2, Recall at 5: 0.5
Precision/Recall for Ranking Classes
Consider problems with many categories (>10) Use method returning scores comparable across
categories (not: Naïve Bayes) Rank categories and compute average precision
(or other measure characterizing precision/recall curve)
Good measure for interactive support of human categorization
Useless for an “autonomous” system (e.g. a filter on a stream of newswire stories)
Multiclass Problem: >1 cat/docMicro- vs. Macro-Averaging
If we have more than one class, how do we combine multiple performance measures into one quantity?
Macroaveraging: Compute performance for each class, then average.
Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.
Micro- vs. Macro-Averaging: Example
Truth: yes
Truth: no
Classifier: yes
10 10
Classifier: no
10 970
Truth: yes
Truth: no
Classifier: yes
90 10
Classifier: no
10 890
Truth: yes
Truth: no
Classifier: yes
100 20
Classifier: no
20 1860
Class 1 Class 2 Micro.Av. Table
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83 Why this difference?
Variabilities of Measurements
Can we trust a computed figure of merit? Variability of data
document size/length quality/style of authorship uniformity of vocabulary
Variability of “truth” / gold standard need definitive judgement on which class(es) a
doc belongs to usually human
Ideally: consistent judgments
Vector Space ClassificationNearest Neighbor Classification
Classification Methods
Naïve Bayes Vector space methods, Nearest neighbors Decision trees Logistic regression Support vector machines
Recall Vector Space Representation
Each document is a vector, one component for each term (= word).
Normalize to unit length. Terms are axes 10,000+ dimensions, or even 1,000,000+ Docs are vectors in this space
Classification Using Vector Spaces
Each training doc a point (vector) labeled by its topic (= class)
Hypothesis: docs of the same class form a contiguous region of space
Define surfaces to delineate classes in space
Classes in a Vector Space
Government
Science
Arts
Test Document = Government
Government
Science
Arts
Similarityhypothesistrue ingeneral?
Binary Classification
Consider 2 class problems How do we define (and find) the separating
surface? How do we test which region a test doc is in?
Separation by Hyperplanes
Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes
Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c
Linear programming / Perceptron
Find a,b,c, such thatax + by c for red pointsax + by c for green points.
Which Hyperplane?
In general, lots of possiblesolutions for a,b,c.
Which Hyperplane?
Lots of possible solutions for a,b,c. Some methods find a separating
hyperplane, but not the optimal one (e.g., perceptron)
Most methods find an optimal separating hyperplane
Which points should influence optimality? All points
Linear regression Naïve Bayes
Only “difficult points” close to decision boundary
Support vector machines Logistic regression (kind of)
Hyperplane: Example
Class: “interest” (as in interest rate) Example features of a linear classifier (SVM) wi ti wi ti
• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank
• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr
Relationship to Naïve Bayes?
Find a,b,c, such thatax + by c for red pointsax + by c for green points.
Linear Classifiers
Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (with linear kernel) Linear regression (Simple) neural networks
Despite this similarity, large performance differences For separable problems, there is an infinite number of
separating hyperplanes. Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes
Classifiers more powerful than linear often don’t perform better. Why?
More Than Two Classes
Any-of or multiclass classification Classes are independent of each other. A document can belong to 0, 1, or >1 classes. Decompose into n binary problems
One-of or multinomial or polytomous classification Classes are mutually exclusive. Each document belongs to exactly one class.
One-Of/Polytomous/Multinomial Classification
Digit recognition is polytomous classification Digits are mutually exclusive A given image is to be assigned to exactly one of
the 10 classes
Composing Surfaces: Issues
?
?
?
Set of Binary Classifiers: Any of
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Apply decision criterion of classifiers independently
Done
Set of Binary Classifiers: One of
Build a separator between each class and its complementary set (docs from all other classes).
Given test doc, evaluate it for membership in each class.
Assign document to class with: maximum score maximum confidence maximum probability
Why different from multiclass/any of classification?
Naïve Bayes?
Negative Examples
Formulate as above, except negative examples for a class are added to its complementary set.
Positive examples
Negative examples
Centroid Classification
Given training docs for a class, compute their centroid
Now we have a centroid for each class Given query doc, assign to class whose centroid
is nearest. Exercise: Compare to Rocchio
Example
Government
Science
Arts
k Nearest Neighbor Classification
To classify document d into class c Define k-neighborhood N as k nearest neighbors
of d Count number of documents i in N that belong to
c Estimate P(c|d) as i/k
Example: k=6 (6NN)
Government
Science
Arts
P(science| )?
kNN Is Close to Optimal
Cover and Hart 1967 Asymptotically, the error rate of 1-nearest-
neighbor classification is less than twice the Bayes rate.
In particular, asymptotic error rate is 0 if Bayes rate is 0.
Assume: query point coincides with a training point.
Both query point and training point contribute error -> 2 times Bayes rate
kNN: Discussion
Classification time linear in training set expensive
No feature selection necessary Scales well with large number of classes
Don’t need to train n classifiers for n classes Classes can influence each other
Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary
Actually: not true. Why?
Number of neighbors
kNN vs. Naïve Bayes
Bias/Variance tradeoff Variance ≈ Capacity kNN has high variance and low bias.
Infinite memory NB has low variance and high bias.
Decision surface has to be linear (hyperplane) Consider: Is an object a tree? (Burges) Too much capacity/variance, low bias
Botanist who memorizes Will always say “no” to new object (e.g., #leaves)
Not enough capacity/variance, high bias Lazy botanist Says “yes” if the object is green
References
Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Manning and Schuetze.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer-Verlag, New York.
Decision Tree Example
Decision Trees
Tree with internal nodes labeled by terms Branches are labeled by tests on the weight that
the term has Leaves are labeled by categories Classifier categorizes document by descending
tree following tests to leaf The label of the leaf node is then assigned to the
document Most decision trees are binary trees (never
disadvantageous; may require extra internal nodes)
Decision trees vs. Hyperplanes
Optimal use of a few high-leverage features Hyperplanes integrate many features:
• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank
• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr
Decision Tree Learning
Learn a sequence of tests on features, typically using top-down, greedy search At each stage choose the unused feature with
highest Information Gain Binary (yes/no) or continuous decisions
f1 !f1
f7 !f7
P(class) = .6
P(class) = .9
P(class) = .2
Decision Tree Learning
Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents well
Therefore, pruning or early stopping methods for Decision Trees are normally a standard part of classification packages
Use of small number of features is potentially bad in text cat, but in practice decision trees do well for some text classification tasks
Decision trees are very easily interpreted by humans – much more easily than probabilistic methods like Naive Bayes
Decision Trees are normally regarded as a symbolic machine learning algorithm, though can be used probabilistically
21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories
An article can be in more than one category Learn 118 binary category distinctions
Example “interest rate” article2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council,
a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct.
Common categories(#train, #test)
Classic Reuters Data Set
• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)
• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)
Classic Reuters Data Set
Average document: about 90 types, 200 tokens Average number of classes assigned
1.24 for docs with at least one category Only about 10 out of 118 categories are large Microaveraging measures performance on large
categories.
Dumais et al. 1998: Reuters - Accuracy
Recall: % labeled in category among those stories that are really in categoryPrecision: % really in category among those stories labeled in category
Break Even: (Recall + Precision) / 2
Rocchio NBayes Trees LinearSVMearn 92.9% 95.9% 97.8% 98.2%acq 64.7% 87.8% 89.7% 92.8%money-fx 46.7% 56.6% 66.2% 74.0%grain 67.5% 78.8% 85.0% 92.4%crude 70.1% 79.5% 85.0% 88.3%trade 65.1% 63.9% 72.5% 73.5%interest 63.4% 64.9% 67.1% 76.3%ship 49.2% 85.4% 74.2% 78.0%wheat 68.9% 69.7% 92.5% 89.7%corn 48.2% 65.3% 91.8% 91.1%
Avg Top 10 64.6% 81.5% 88.4% 91.4%Avg All Cat 61.7% 75.2% na 86.4%
Category: “interest”
rate=1
lending=0
prime=0
discount=0
pct=1
year=1year=0
rate.t=1
Reuters ROC - Category Grain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Precision
Recall
LSVMDecision Tree Naïve BayesFind Similar
Recall: % labeled in category among those stories that are really in categoryPrecision: % really in category among those stories labeled in category
Two methods with same curve:
Same perfor-mance?
ROC for Category - Earn
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LSVMDecision Tree Naïve BayesFind Similar
Precision
Recall
ROC for Category - Crude
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LSVMDecision Tree Naïve BayesFind Similar
Precision
Recall
ROC for Category - Ship
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
LSVMDecision Tree Naïve BayesFind Similar
Precision
Recall
Take Away
Evaluation of text classification kNN Decision trees Hyperplanes
Many popular classification methods are linear classifiers
Training method is key difference, not the “hypothesis space” of potential classifiers
Any-of/multiclass vs One-of/polytomous