Evaluating What’s Been Learned
Cross-Validation• Foundation is a simple idea – “holdout” – holds out a
certain amount for testing and uses rest for training• Separation should NOT be “convenience”,
– Should at least be random
– Better – “stratified” random – division preserves relative proportion of classes in both training and test data
• Enhanced : repeated holdout– Enables using more data in training, while still getting a good test
• 10-fold cross validation has become standard• This is improved if the folds are chosen in a “stratified”
random way
For Small Datasets
• Leave One Out
• Bootstrapping
• To be discussed in turn
Leave One Out• Train on all but one instance, test on that one (pct correct
always equals 100% or 0%)• Repeat until have tested on all instances, average results• Really equivalent to N-fold cross validation where N =
number of instances available• Plusses:
– Always trains on maximum possible training data (without cheating)
– Efficient to run – no repeated (since fold contents not randomized)– No stratification, no random sampling necessary
• Minuses– Guarantees a non-stratified sample – the correct class will always
be at least a little bit under-represented in the training data– Statistical tests are not appropriate
Bootstrapping• Sampling done with replacement to form a training dataset
• Particular approach – 0.632 bootstrap– Dataset of n instances is sampled n times
– Some instances will be included multiple times
– Those not picked will be used as test data
– On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test
• This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation)
• May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair>
• This procedure can be repeated any number of times, allowing statistical tests
Counting the Cost
• Some mistakes are more costly to make than others• Giving a loan to a defaulter is more costly than denying
somebody who would be a good customer• Sending mail solicitation to somebody who won’t buy is less
costly than missing somebody who would buy (opportunity cost)
• Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions)
• Measurement could be average profit/ loss per prediction• To be fair in cost benefit analysis, should also factor in cost
of collecting and preparing the data, building the model …
Lift Charts
• In practice, costs are frequently not known• Decisions may be made by comparing possible
scenarios• Book Example – Promotional Mailing
– Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond
– Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond
– Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond
– The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)
Information Retrieval (IR) Measures• E.g., Given a WWW search, a search engine
produces a list of hits supposedly relevant
• Which is better?– Retrieving 100, of which 40 are actually relevant– Retrieving 400, of which 80 are actually relevant– Really depends on the costs
Information Retrieval (IR) Measures
• IR community has developed 3 measures:– Recall = number of documents retrieved that are relevant
total number of documents that are relevant
– Precision = number of documents retrieved that are relevant
total number of documents that are retrieved
– F-measure = 2 * recall * precision
recall + precision
WEKA• Part of the results provided by WEKA (that we’ve ignored so far)• Let’s look at an example (Naïve Bayes on my-weather-nominal)=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.667 0.125 0.8 0.667 0.727 yes
0.875 0.333 0.778 0.875 0.824 no
=== Confusion Matrix ===
a b <-- classified as
4 2 | a = yes
1 7 | b = no
• TP rate and recall are the same = TP / (TP + FN) – For Yes = 4 / (4 + 2); For No = 7 / (7 + 1)
• FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4)• Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2)• F-measure = 2TP / (2TP + FP + FN)
– For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11– For No = 2 * 7 / (2*7 + 2 + 1) = 14/17
In terms of true positives etc• True positives = TP; False positives = FP• True Negatives = TN; False negatives = FN• Recall = TP / (TP + FN) // true positives / actually positive• Precision = TP / (TP + FP) // true positives / predicted
positive • F-measure = 2TP / (2TP + FP + FN)
– This has been generated using algebra from the formula previous– Easier to understand this way – correct predictions are double
counted – once for recall, once for precision.denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant)
• There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two
WEKA
• For many occasions, this borders on “too much information”, but it’s all there
• We can decide, are we more interested in Yes , or No?
• Are we more interested in recall or precision?
WEKA – with more than two classes• Contact Lenses with Naïve Bayes=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.8 0.053 0.8 0.8 0.8 soft
0.25 0.1 0.333 0.25 0.286 hard
0.8 0.444 0.75 0.8 0.774 none
=== Confusion Matrix ===
a b c <-- classified as
4 0 1 | a = soft
0 1 3 | b = hard
1 2 12 | c = none
• Class exercise – show how to calculate recall, precision, f-measure for each class
Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/
Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> PromotersThe confidence of action rule – 0.993 * 0.849 = 0.84Our action rule can target only 4.2 (out of 10.2) detractors.So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status