learning from examples: standard methodology for evaluation 1) start with a dataset of labeled...
TRANSCRIPT
Learning from Examples: Learning from Examples: Standard Methodology for Standard Methodology for EvaluationEvaluation
1) Start with a dataset of labeled examples1) Start with a dataset of labeled examples2) Randomly partition into 2) Randomly partition into NN groups groups3a) 3a) NN times, combine times, combine N N -1 groups into -1 groups into
a train set a train set3b) Provide 3b) Provide train settrain set to learning system to learning system3c) Measure accuracy on “left out” group 3c) Measure accuracy on “left out” group
(the (the test settest set))
CalledCalled NN -fold cross validation-fold cross validation (typically (typically N N =10)=10)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Using Using TuningTuning Sets Sets
• Often, an ML system has to choose when to stop Often, an ML system has to choose when to stop learning, select among alternative answers, etc.learning, select among alternative answers, etc.
• One wants the model that produces the highest One wants the model that produces the highest accuracy on accuracy on future examples (“overfitting avoidance”) examples (“overfitting avoidance”)
• It is a It is a “cheat”“cheat” to look at the to look at the testtest set while still set while still learninglearning
• Better methodBetter method• Set aside part of the training setSet aside part of the training set• Measure performance on this “tuning” data to Measure performance on this “tuning” data to
estimate future performance for a given set of estimate future performance for a given set of parametersparameters
• Use best parameter settings, train with Use best parameter settings, train with allall training training data (except data (except testtest set) to estimate future set) to estimate future performance on performance on new new examplesexamples
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Experimental Experimental Methodology: A Pictorial Methodology: A Pictorial OverviewOverview
generate solutions
select best
LEARNER
training examples
train’ set tune set
testing examples
classifier
expected accuracy on future examples
collection of classified examples
Statistical techniques such as 10-fold cross validation and t-tests are used to get meaningful results
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Proper Experimental Proper Experimental Methodology Can Have a Huge Methodology Can Have a Huge Impact!Impact!
A 2002 paper in A 2002 paper in Nature Nature (a major, major (a major, major journal) needed to be corrected due to journal) needed to be corrected due to “training on the testing set”“training on the testing set”
Original report : 95% accuracy (5% error rate)Original report : 95% accuracy (5% error rate)
Corrected report (which still is buggy): Corrected report (which still is buggy): 73% accuracy (27% error rate) 73% accuracy (27% error rate)
Error rate increased over 400%!!!Error rate increased over 400%!!!
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Parameter SettingParameter Setting
Notice that each train/test fold may Notice that each train/test fold may get get differentdifferent parameter settings! parameter settings!• That’s fine (and proper) That’s fine (and proper)
I.e. , a “parameterless”* algorithm I.e. , a “parameterless”* algorithm internally sets parameters for internally sets parameters for each data seteach data set it gets it gets
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Using Multiple Tuning Using Multiple Tuning SetsSets• Using a Using a singlesingle tuning set can be unreliable tuning set can be unreliable
predictor, plus some data “wasted”predictor, plus some data “wasted”Hence, often the following is done:Hence, often the following is done:1) For each possible set of parameters,1) For each possible set of parameters,
a) Divide a) Divide trainingtraining data into data into train’train’ and and tunetune sets, using sets, using N-fold cross validationN-fold cross validation
b) Score this set of parameter values, average b) Score this set of parameter values, average tune tune set set accuracy accuracy
2) Use 2) Use bestbest combination of parameter settings on combination of parameter settings on allall (train’ + tune) (train’ + tune) examples examples
3) Apply resulting model to 3) Apply resulting model to testtest set set
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Tuning a ParameterTuning a Parameter- Sample Usage- Sample Usage
Step1: Try various values for Step1: Try various values for kk (eg, in (eg, in kkNN). NN). Use 10 train/tune splits for each Use 10 train/tune splits for each kk
Step2: Pick best value for Step2: Pick best value for k k (eg. (eg. k k = 2), = 2), Then train using Then train using allall training data training data
Step3: Measure accuracy on Step3: Measure accuracy on test settest set
K=1
tune train
Tune set accuracy (ave. over 10 runs)=92%
1
10
2
K=2 Tune set accuracy (ave. over 10 runs)=97%
1
10
2
…
Tune set accuracy (ave. over 10 runs)=80%
1
10
2K=100
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
What to Do for the What to Do for the FIELDED System?FIELDED System?
• Do Do notnot use any use any testtest sets sets• Instead only use Instead only use tuningtuning sets to sets to
determine good parametersdetermine good parameters• TestTest sets used to estimate sets used to estimate futurefuture
performanceperformance• You can report this estimate to your You can report this estimate to your
“customer,” then use “customer,” then use allall the data to the data to retrain a “product” to give themretrain a “product” to give them
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
What’s Wrong with This?What’s Wrong with This?
1.1. Do a cross-validation study to Do a cross-validation study to set parametersset parameters
2.2. Do Do anotheranother cross-validation study, using the cross-validation study, using the best parameters, to best parameters, to estimate future accuracyestimate future accuracy
• How will this relate to the “true” future accuracy?How will this relate to the “true” future accuracy?• Likely to be an Likely to be an overestimateoverestimate
What aboutWhat about1.1. Do a proper train/tune/test experimentDo a proper train/tune/test experiment
2.2. Improve your algorithm; goto 1Improve your algorithm; goto 1
(Machine Learning’s “dirty little” secret!)(Machine Learning’s “dirty little” secret!)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Why Not Learn After Why Not Learn After Each Each Test Example?Test Example?• In “production mode,” this would In “production mode,” this would
make sense (assuming one received make sense (assuming one received the correct label)the correct label)
• In “experiments,” we wish to In “experiments,” we wish to estimateestimate Probability we’ll label the next Probability we’ll label the next
example correctlyexample correctly
need need several samplesseveral samples to to accurately estimateaccurately estimate
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Choosing a Good Choosing a Good NN for CVfor CV(from Weiss & Kulikowski Textbook)(from Weiss & Kulikowski Textbook)
# of Examples# of Examples• < 50< 50
• 50 < ex’s < 10050 < ex’s < 100
• > 100> 100
MethodMethodInstead, use Bootstrapping (B. Instead, use Bootstrapping (B.
Ephron) See “bagging” later in Ephron) See “bagging” later in cs760cs760
Leave-one-out (“Jack knife”)Leave-one-out (“Jack knife”)NN = size of data set = size of data set
(leave out one example each (leave out one example each time)time)
10-fold cross validation (CV),10-fold cross validation (CV),also useful for also useful for tt-tests-tests
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Recap:Recap: N N -fold Cross -fold Cross ValidationValidation• Can be used to Can be used to
1) estimate 1) estimate future accuracy future accuracy (by (by testtest sets) sets)
2)2) choose choose parameter settings parameter settings (by (by tuningtuning sets) sets)
• MethodMethod1) Randomly permute examples1) Randomly permute examples
2) Divide into 2) Divide into NN binsbins
3) Train on 3) Train on NN-1 bins, measure performance on bin ”left -1 bins, measure performance on bin ”left out”out”
4) Compute average accuracy on held-out sets 4) Compute average accuracy on held-out sets
Examples
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Confusion MatricesConfusion Matrices- - Useful Way to Report TESTSET Useful Way to Report TESTSET ErrorsErrors
Useful for NETtalk testbed – task of pronouncing written words© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Scatter PlotsScatter Plots- Compare Two Algo’s on Many - Compare Two Algo’s on Many DatasetsDatasets
Alg
o A
’s E
rror
Rate
Algo B’s Error Rate
Each dot is the error rate of the two algo’s on ONE
dataset
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Statistical Analysis of Statistical Analysis of Sampling EffectsSampling Effects
• Assume we get Assume we get ee errors on errors on NN test set examplestest set examples
• What can we say about the accuracy of What can we say about the accuracy of our estimate of the true (future) error our estimate of the true (future) error rate?rate?
• We’ll assume test set/future examples We’ll assume test set/future examples independently drawnindependently drawn (iid assumption) (iid assumption)
• Can give probability our true error rate Can give probability our true error rate is in some range – error barsis in some range – error bars
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Binomial The Binomial DistributionDistribution
xnx ppx
nx
)1()Pr(
• Distribution over the number of successes in a Distribution over the number of successes in a fixed number fixed number nn of independent trials (with of independent trials (with same probability of success same probability of success pp in each) in each)
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
Pr(
x)
x
Binomial distribution w/ p=0.5, n=10
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Using the BinomialUsing the Binomial
• Let each test case (test data point) be a Let each test case (test data point) be a trial, and let a success be an incorrect trial, and let a success be an incorrect predictionprediction
• Maximum likelihood Maximum likelihood estimateestimate of of probability probability pp of success is fraction of of success is fraction of predictions wrongpredictions wrong
• Can exactly compute probability that error Can exactly compute probability that error rate estimate rate estimate pp is off by more than some is off by more than some amount, say 0.025, in either directionamount, say 0.025, in either direction
• For large N, this computation’s expensiveFor large N, this computation’s expensive
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Central Limit TheoremCentral Limit Theorem
• Roughly, for large Roughly, for large enough enough N,N, all all distributions look distributions look GaussianGaussian when when summing/averaging summing/averaging N N valuesvalues
)( iYP
Surprisingly, N = 30 is large enough! (in most cases at least) - see pg 132 of textbook
0 1Ave Y over N trials
(repeated many times)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Confidence IntervalsConfidence Intervals
errors future
on bounds accurate some determinecan weGaussian, ain
parameters free the, and timatemeasure/escan weIf
)(YP
Y
deviation std
1][Equation )( ,for Solve
in lie willfuture in the measure we the
0.95) (typically M prob with s.t. determine want toWe
-
MYdYP
Y
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
As You Already As You Already Learned in “Stat 101”Learned in “Stat 101”If we estimate If we estimate μμ (mean error rate) (mean error rate)and and σσ (std dev), we can say our ML (std dev), we can say our ML algo’s algo’s error rateerror rate is is
μμ ±± Z ZMM σσ
ZZM M : : value you looked up in a table of value you looked up in a table of N(0,1) for desired confidence; e.g., for N(0,1) for desired confidence; e.g., for 95% confidence it’s 1.9695% confidence it’s 1.96
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Remaining DetailsThe Remaining Details
sZ
eNZM
N
e
N
eNZ
N
e
N
ep
M
M
M
' morefor 5.1 Table See 0.060.10
get we10,100 if 96.1,95.0For
))1((rateError
rateerror on interval confidence M% produceswhich
sizeset test
errorsset test Let
1
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Alternative: BootstrapAlternative: BootstrapConfidence IntervalsConfidence Intervals• Given a data set of Given a data set of NN items, sample items, sample NN
items uniformly items uniformly with replacementwith replacement• Estimate value of interest (e.g., train Estimate value of interest (e.g., train
on bootstrap sample, test on the rest)on bootstrap sample, test on the rest)• Repeat some number of times (1000 Repeat some number of times (1000
or 10,000 typical)or 10,000 typical)• 95% CI: values such that observed is 95% CI: values such that observed is
only lower (higher) on 2.5% of runsonly lower (higher) on 2.5% of runs
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
BootstrapBootstrap
• Statisticians typically use this approach Statisticians typically use this approach more now, given fast computersmore now, given fast computers
• Many applicationsMany applications• CIs on stimates of accuracy, area under CIs on stimates of accuracy, area under
curve, etc.curve, etc.• CIs on stimates of mean squared error or CIs on stimates of mean squared error or
absolute error in real-valued predictionabsolute error in real-valued prediction• P-values for one algorithm vs. another P-values for one algorithm vs. another
according to the above measuresaccording to the above measures
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Contingency TablesContingency Tables
n(1,1)n(1,1)[true pos][true pos]
n(1,0)n(1,0)[false pos][false pos]
n(0,1)n(0,1)[false neg][false neg]
n(0,0)n(0,0)[true neg][true neg]
+
-
+ -
True Answer
AlgorithmAnswer
Counts of occurrences
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
TPR and FPRTPR and FPR
True Positive Rate = n(1,1) / ( n(1,1) + n(0,1) )(TPR) = correctly categorized +’s / total positives
P(algo outputs + | + is correct)
False Positive Rate = n(1,0) / ( n(1,0) + n(0,0) )(FPR) = incorrectly categorized –’s / total neg’s
P(algo outputs + | - is correct)
Can similarly define False Negative Rate and True Negative RateSee http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
ROC CurvesROC Curves
• ROC: Receiver Operating CharacteristicsROC: Receiver Operating Characteristics• Started during radar research during WWIIStarted during radar research during WWII• Judging algorithms on accuracy alone may Judging algorithms on accuracy alone may
not be good enough when not be good enough when getting a getting a positive wrong costspositive wrong costs more than more than getting a getting a negative wrongnegative wrong (or vice versa) (or vice versa)• Eg, medical tests for serious diseasesEg, medical tests for serious diseases• Eg, a movie-recommender (ala’ NetFlix) Eg, a movie-recommender (ala’ NetFlix)
systemsystem
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
ROC Curves ROC Curves GraphicallyGraphically
1.0
1.0False positives rate
Tru
e p
osi
tives
rate
Pro
b (
alg
outp
uts
+ |
+ is
corr
ect
)
Prob (alg outputs + | - is correct)
Ideal Spot
Alg 1
Alg 2
Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false -
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Creating an ROC CurveCreating an ROC Curve- the Standard - the Standard ApproachApproach• You need an ML algorithm that You need an ML algorithm that
outputs NUMERIC results such as outputs NUMERIC results such as prob(example is +)prob(example is +)
• You can use You can use ensemblesensembles (later) to (later) to get this from a model that only get this from a model that only provides Boolean outputsprovides Boolean outputs• Eg, have 100 models vote & count Eg, have 100 models vote & count
votesvotes© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Algo for Creating ROC Algo for Creating ROC CurvesCurves(most common but not only (most common but not only way)way)Step 1: Sort predictions on test setStep 1: Sort predictions on test set
Step 2: Locate a threshold betweenStep 2: Locate a threshold between examples with opposite examples with opposite categoriescategories
Step 3: Compute TPR & FPR for eachStep 3: Compute TPR & FPR for each threshold of Step 2 threshold of Step 2
Step 4: Connect the dotsStep 4: Connect the dots
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Plotting ROC Curves Plotting ROC Curves - Example- Example
Ex 9 .99 +Ex 7 .98 +Ex 1 .72 -Ex 2 .70 +Ex 6 .65 +Ex 10 .51 -Ex 3 .39 -Ex 5 .24 +Ex 4 .11 -Ex 8 .01 -
ML Algo Output (Sorted) Correct
Category 1.0
1.0P(a
lg o
utp
uts
+ |
+ is
corr
ect
)P(alg outputs + | - is correct)
TPR=(2/5), FPR=(0/5)
TPR=(2/5), FPR=(1/5)
TPR=(4/5), FPR=(1/5)
TPR=(4/5), FPR=(3/5)TPR=(5/5), FPR=(3/5)
TPR=(5/5), FPR=(5/5)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
To Get Smoother To Get Smoother Curve, Linearly Curve, Linearly InterpolateInterpolate
1.0
1.0
P(alg outputs + | - is correct)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Note: Note: Each point is a model plus a threshold… call Each point is a model plus a threshold… call thatthat
a prediction algorithma prediction algorithm
Achievable: Achievable: To get points along a linear To get points along a linear interpolation,interpolation,
flip weighted coin to choose betweenflip weighted coin to choose betweenprediction algorithmsprediction algorithms
Convex Hull:Convex Hull:Perform all interpolations, and Perform all interpolations, and discard anydiscard any
point that lies below a linepoint that lies below a line1.0
1.0P(alg outputs + | - is correct)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Be Careful: Be Careful: The prediction algorithms (model andThe prediction algorithms (model andthreshold pairs) that look best on threshold pairs) that look best on
trainingtrainingset may not be the best on future set may not be the best on future
datadata
Lessen Risk:Lessen Risk: Perform all interpolations and build Perform all interpolations and build convexconvex
hull using a tuning sethull using a tuning set1.0
1.0P(alg outputs + | - is correct)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
ROC’s and Many ModelsROC’s and Many Models((notnot in the ensemble in the ensemble sense)sense)• It is not necessary that we learn It is not necessary that we learn
oneone model and then threshold its model and then threshold its output to produce an ROC curveoutput to produce an ROC curve
• You could learn You could learn different modelsdifferent models for for different regionsdifferent regions of ROC space of ROC space
• Eg, see Goadrich, Oliphant, & Eg, see Goadrich, Oliphant, & Shavlik Shavlik ILP ’04 and MLJ ‘06ILP ’04 and MLJ ‘06
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Area Under ROC CurveArea Under ROC Curve
A common metric for experiments is to A common metric for experiments is to numerically integratenumerically integrate the ROC Curve the ROC Curve
1.0
1.0False positives
Tru
e
posi
tives
Area under curve (AUC) -- sometimes written AUCROC to be explicit© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Asymmetric Error Asymmetric Error CostsCosts• Assume that cost(FP) != cost(FN)Assume that cost(FP) != cost(FN)• You would like to pick a threshold that You would like to pick a threshold that
mimimizesmimimizesE(total cost) = E(total cost) =
cost(FP) x prob(FP) x (# of -) +cost(FP) x prob(FP) x (# of -) +
cost(FN) x prob(FN) x (# of +)cost(FN) x prob(FN) x (# of +)
• You could also have (maybe negative) You could also have (maybe negative) costs for TP and TN (assumed zero in costs for TP and TN (assumed zero in above)above)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
ROC’s & Skewed DataROC’s & Skewed Data
• One strength of ROC curves is that they One strength of ROC curves is that they are a good way to deal with skewed data are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (|+| >> |-|) since the axes are fractions (rates) independent of the # of examples(rates) independent of the # of examples
• You must be careful though!You must be careful though!• Low FPR * (many negative ex) Low FPR * (many negative ex)
= sizable number of FP = sizable number of FP• Possibly more than # of TPPossibly more than # of TP
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Precision vs. RecallPrecision vs. Recall(think about search (think about search engines)engines)• PrecisionPrecision = (# of relevant items retrieved) = (# of relevant items retrieved)
/ (total # of items retrieved) / (total # of items retrieved) = n(1,1) / ( n(1,1) + n(1,0) )= n(1,1) / ( n(1,1) + n(1,0) )
P(is pos | called pos)P(is pos | called pos)
• RecallRecall = (# of relevant items retrieved) = (# of relevant items retrieved) / (# of relevant items that exist) / (# of relevant items that exist) = n(1,1) / ( n(1,1) + n(0,1) ) = TPR= n(1,1) / ( n(1,1) + n(0,1) ) = TPR
P(called pos | is pos)P(called pos | is pos)
• Notice that n(0,0) is not used in either formula Notice that n(0,0) is not used in either formula Therefore you get Therefore you get nono credit for filtering out credit for filtering out irirrelevant itemsrelevant items
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
ROC vs. Recall-ROC vs. Recall-PrecisionPrecisionYou can get very different visual results You can get very different visual results on the same dataon the same data
The reason for this is that there may be lots of – ex’s(eg, might need to include 100 neg’s to get 1 more pos)
vs.
P ( + | - ) RecallPre
cisi
on
P (
+ |
+ )
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Recall-Precision Recall-Precision CurvesCurvesYou canYou cannotnot simply simply connect the dotsconnect the dots in in Recall-Precision curves as we did for ROCRecall-Precision curves as we did for ROC
See Goadrich, Oliphant, & Shavlik, See Goadrich, Oliphant, & Shavlik,
ILP ’04 or MLJ ’06ILP ’04 or MLJ ’06
Recall
Pre
cisi
on
x
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Interpolating in PR Interpolating in PR SpaceSpace• Would like to interpolate correctly, Would like to interpolate correctly,
then remove points that lie below then remove points that lie below interpolationinterpolation
• Analogous to convex hull in ROC Analogous to convex hull in ROC spacespace
• Can you do it efficiently?Can you do it efficiently?• Yes – convert to ROC space, take Yes – convert to ROC space, take
convex hull, convert back to PR convex hull, convert back to PR space (Davis & Goadrich, ICML-06)space (Davis & Goadrich, ICML-06)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Relationship between The Relationship between Precision-Recall and ROC Precision-Recall and ROC CurvesCurves
Jesse Davis & Mark Jesse Davis & Mark GoadrichGoadrichDepartment of Computer SciencesDepartment of Computer Sciences
University of WisconsinUniversity of Wisconsin© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Four Questions about Four Questions about PR space and ROC spacePR space and ROC space
• Q1: Does Q1: Does optimizingoptimizing AUC in one AUC in one space space optimize it in the optimize it in the other space? other space?
• Q2: If a curve Q2: If a curve dominatesdominates in one in one space space will it dominate in the will it dominate in the other?other?
• Q3: What is the Q3: What is the “best”“best” PR curve? PR curve?• Q4: How do you Q4: How do you interpolateinterpolate in PR in PR
space?space?© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Optimizing AUCOptimizing AUC
• Interest in learning algorithms that Interest in learning algorithms that optimizeoptimizeArea Under the Curve (AUC)Area Under the Curve (AUC)[[Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005, Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005, Prati and Flach 2005, Yan et al. 2003, Herschtal and Prati and Flach 2005, Yan et al. 2003, Herschtal and
Raskutti 2004Raskutti 2004]]• Q: Does an algorithm that optimizes Q: Does an algorithm that optimizes
AUC-ROC also optimize AUC-PR? AUC-ROC also optimize AUC-PR?• A: No. Can easily construct A: No. Can easily construct
counterexamplecounterexample© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Definition: DominanceDefinition: Dominance
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cis
ion
Algorithm 1
Algorithm 2
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Definition: Area Under Definition: Area Under the Curve (AUC)the Curve (AUC)
Pre
cisi
on
Recall
TP
R
FPR© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
How do we evaluate ML How do we evaluate ML algorithms?algorithms?
• Common evaluation metricsCommon evaluation metrics• ROC curvesROC curves [Provost et al ’98][Provost et al ’98]
• PR curvesPR curves [Raghavan ’89, Manning & Schutze [Raghavan ’89, Manning & Schutze ’99]’99]
• Cost curves Cost curves [Drummond and Holte ‘00, ’04][Drummond and Holte ‘00, ’04]
• If the class distribution is highly If the class distribution is highly skewed, most believe PR curves skewed, most believe PR curves preferable to ROC curvespreferable to ROC curves
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Two Highly Skewed Two Highly Skewed DomainsDomains
?
=
Is an abnormality on a mammogram benign or
malignant?
Do these two identities refer to the same person?
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Predicting AliasesPredicting Aliases[Synthetic data: Davis et al. ICIA 2005][Synthetic data: Davis et al. ICIA 2005]
ROC Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
True
Pos
itive
R
ate
Algorithm 1
Algorithm 2
Algorithm 3
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Predicting AliasesPredicting Aliases[Synthetic data: Davis et al. ICIA 2005][Synthetic data: Davis et al. ICIA 2005]
PR Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0Recall
Pre
cisi
on
Algorithm 1
Algorithm 2
Algorithm 3
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Diagnosing Breast Diagnosing Breast CancerCancer[Real Data: Davis et al. IJCAI 2005][Real Data: Davis et al. IJCAI 2005]
ROC Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
Tru
e P
osi
tive
Rat
e
Algorithm 1
Algorithm 2
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Diagnosing Breast Diagnosing Breast CancerCancer[Real Data: Davis et al. IJCAI 2005][Real Data: Davis et al. IJCAI 2005]
PR Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0Recall
Pre
cisi
on
Algorithm 1
Algorithm 2
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
A2: Dominance TheoremA2: Dominance Theorem
ROC Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
True
Pos
itive
Rat
e
Algorithm 1
Algorithm 2
PR Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Prec
isio
n
Algorithm 1
Algorithm 2
For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Q3: What is the Q3: What is the “best”“best” PR curve? PR curve? • The “best” curve in ROC space for The “best” curve in ROC space for
a set of points is the convex hull a set of points is the convex hull [Provost et al ’98][Provost et al ’98]
• It is achievableIt is achievable• It maximizes AUC It maximizes AUC
Q: Does an analog to convex hull Q: Does an analog to convex hull exist in PR space? exist in PR space?
A2: Yes! We call it the A2: Yes! We call it the Achievable Achievable PR CurvePR Curve
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Convex HullConvex HullROC Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Tru
e P
os
itiv
e R
ate
OriginalPoints
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Convex HullConvex Hull
ROC Space
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Tru
e P
os
itiv
e R
ate
Convex Hull
Original Points
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
A3: Achievable CurveA3: Achievable Curve
PR Space
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cis
ion
Original Points
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
A3: Achievable CurveA3: Achievable Curve
PR Space
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cis
ion
Achievable CurveOriginal Points
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Constructing the Constructing the Achievable CurveAchievable Curve
Given: Set of PR points, fixed number Given: Set of PR points, fixed number positive positive and negative and negative examplesexamples
• Translate PR points to ROC pointsTranslate PR points to ROC points• Construct convex hull in ROC spaceConstruct convex hull in ROC space• Convert the curve into PR spaceConvert the curve into PR spaceCorollary: Corollary:
By dominance theorem, the curve in By dominance theorem, the curve in PR space dominates all other legal PR PR space dominates all other legal PR curves you could construct with the curves you could construct with the given pointsgiven points
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Q4: InterpolationQ4: Interpolation
• Interpolation in Interpolation in ROC space is ROC space is easyeasy
• Linear connection Linear connection between pointsbetween points
TP
R
FPR
A
B
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Linear Interpolation Linear Interpolation Not Achievable in PR Not Achievable in PR SpaceSpace• PrecisionPrecision interpolation is interpolation is
counterintuitive counterintuitive [Goadrich, et al., ILP [Goadrich, et al., ILP 2004]2004]
TPTP FPFP TP TP RateRate
FP FP RateRate RecallRecall PrecPrec
500500 500500 0.500.50 0.060.06 0.500.50 0.500.50
10001000 90009000 1.001.00 1.001.00 1.001.00 0.100.10
Example Counts PR CurvesROC Curves
750 4750 0.75 0.53 0.75 0.14
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Q: For each extra TP covered, how many FPs do you cover?
Example InterpolationExample Interpolation
TP TP FP FP REC REC
PRECPREC
A A 55 550.20.2
55 0.50.5
BB 1010 3030 0.50.5 0.250.25A dataset with 20 positive and 2000 negative A dataset with 20 positive and 2000 negative examplesexamples
TPB-TPA
FPB-FPAA:
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Example InterpolationExample Interpolation
TP TP FP FP REC REC
PRECPREC
A A 55 550.20.2
55 0.50.5
BB 1010 3030 0.50.5 0.250.25A dataset with 20 positive and 2000 negative A dataset with 20 positive and 2000 negative examplesexamples
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Example InterpolationExample Interpolation
TP TP FP FP REC REC
PRECPREC
A A 55 550.20.2
55 0.50.5
.. 66 1010 0.30.30.370.37
55
BB 1010 3030 0.50.5 0.250.25
A dataset with 20 positive and 2000 negative A dataset with 20 positive and 2000 negative examplesexamples
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Example InterpolationExample Interpolation
TP TP FP FP REC REC
PRECPREC
A A 55 550.20.2
55 0.50.5
.. 66 1010 0.30.30.370.37
55
.. 77 15150.30.3
550.310.31
88
.. 88 2020 0.40.40.280.28
66
.. 99 25250.40.4
550.260.26
55
BB 1010 3030 0.50.5 0.250.25
A dataset with 20 positive and 2000 negative A dataset with 20 positive and 2000 negative examplesexamples
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Back to Q2Back to Q2
• A2, A3 and A4 relied on A2A2, A3 and A4 relied on A2
• Now let’s prove A2…Now let’s prove A2…
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Dominance TheoremDominance Theorem
ROC Space
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
False Positive Rate
True
Pos
itive
Rat
e
Algorithm 1
Algorithm 2
PR Space
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
Algorithm 1
Algorithm 2
For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
For Fixed N, P and For Fixed N, P and TPR: FPR Precision TPR: FPR Precision (Not =)(Not =)
7575 100100
2525 900900
+
-
+ -
True Answer
AlgorithmAnswer
NP
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Conclusions about Conclusions about PR and ROC CurvesPR and ROC Curves
• A curve dominates in one space iff itA curve dominates in one space iff itdominates in the other spacedominates in the other space
• Exists analog to convex hull in PR Exists analog to convex hull in PR space,space, which we call the which we call the achievable PR achievable PR curvecurve
• Linear interpolation not achievable in Linear interpolation not achievable in PRPR spacespace
• Optimizing AUC in one space does notOptimizing AUC in one space does notoptimize AUC in the other spaceoptimize AUC in the other space
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
To Avoid Pitfalls, Ask:To Avoid Pitfalls, Ask:• 1. Is my held-aside test data really 1. Is my held-aside test data really
representative of going out to collect representative of going out to collect new data?new data?• Even if your methodology is fine, someone Even if your methodology is fine, someone
may have collected features for positive may have collected features for positive examples differently than for negatives – examples differently than for negatives – should be should be randomizedrandomized
• Example: samples from cancer processed Example: samples from cancer processed by different people or on different days by different people or on different days than samples for normal controlsthan samples for normal controls
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
To Avoid Pitfalls, Ask:To Avoid Pitfalls, Ask:• 2. Did I repeat my entire data 2. Did I repeat my entire data
processing procedure on every fold of processing procedure on every fold of cross-validation, using only the cross-validation, using only the training data for that fold?training data for that fold?• On each fold of cross-validation, did I ever On each fold of cross-validation, did I ever
access in any way the label of a test access in any way the label of a test case? case?
• Any preprocessing done over Any preprocessing done over entire data entire data setset (feature selection, parameter tuning, (feature selection, parameter tuning, threshold selection) must threshold selection) must notnot use labels use labels
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
To Avoid Pitfalls, Ask:To Avoid Pitfalls, Ask:• 3. Have I modified my algorithm so 3. Have I modified my algorithm so
many times, or tried so many many times, or tried so many approaches, on this same data set the approaches, on this same data set the I I (the human) am overfitting it?(the human) am overfitting it?• Have I continually modified my Have I continually modified my
preprocessing or learning algorithm until preprocessing or learning algorithm until I got some improvement on this data set?I got some improvement on this data set?
• If so, I really need to get some additional If so, I really need to get some additional data now to at least test ondata now to at least test on
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Alg 1 vs. Alg 2Alg 1 vs. Alg 2
• Alg 1 has accuracy 80%, Alg 2 Alg 1 has accuracy 80%, Alg 2 82%82%
• Is this difference significant?Is this difference significant?• Depends on how many test cases Depends on how many test cases
these estimates are based onthese estimates are based on• The test we do depends on how The test we do depends on how
we arrived at these estimateswe arrived at these estimates
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Leave-One-Out: Sign Leave-One-Out: Sign TestTest• Suppose we ran leave-one-out cross-Suppose we ran leave-one-out cross-
validation on a data set of 100 casesvalidation on a data set of 100 cases• Divide the cases into (1) Alg 1 won, (2) Divide the cases into (1) Alg 1 won, (2)
Alg 2 won, (3) Ties (both wrong or both Alg 2 won, (3) Ties (both wrong or both right); Throw out the tiesright); Throw out the ties
• Suppose 10 ties and 50 wins for Alg 1Suppose 10 ties and 50 wins for Alg 1• Ask: Under (null) binomial(90,0.5), what Ask: Under (null) binomial(90,0.5), what
is prob of 50+ or 40- successes?is prob of 50+ or 40- successes?
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
What about 10-fold?What about 10-fold?
• Difficult to get significance from sign Difficult to get significance from sign test of 10 casestest of 10 cases
• We’re throwing out the We’re throwing out the numbersnumbers (accuracy estimates) for each fold, (accuracy estimates) for each fold, and just asking which is largerand just asking which is larger
• Use the numbers… t-test… designed Use the numbers… t-test… designed to test for a difference of meansto test for a difference of means
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Paired Student Paired Student tt -tests-tests
• GivenGiven• 10 training/test sets10 training/test sets• 2 ML algorithms2 ML algorithms• Results of the 2 ML algo’s on the 10 test-Results of the 2 ML algo’s on the 10 test-
setssets
• DetermineDetermine• Which algorithm is better on this problem?Which algorithm is better on this problem?• Is the difference Is the difference statistically significantstatistically significant??
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Paired Student Paired Student tt –Tests –Tests (cont.)(cont.)
ExampleExample
Accuracies on TestsetsAccuracies on Testsets
Algorithm 1: Algorithm 1: 80%80% 5050 7575 …… 9999
Algorithm 2:Algorithm 2: 7979 4949 7474 …… 9898δ :: +1 +1 +1+1 +1+1 …… +1+1
• Algorithm 1’s mean is better, but the two Algorithm 1’s mean is better, but the two std. Deviations will clearly overlapstd. Deviations will clearly overlap
• But algorithm1 is always better than But algorithm1 is always better than algorithm 2algorithm 2
i
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Consider random variableConsider random variableδ = = Algo A’s Algo A’s Algo B’s Algo B’s
test-settest-set ii minus minus test-set test-set ii
errorerror error error
The Random Variable The Random Variable in the in the tt -Test-Test
Notice we’re “factoring out” test-set Notice we’re “factoring out” test-set difficultydifficulty by looking by looking at at
relativerelative performance performanceIn general, one tries to explain varianceIn general, one tries to explain variancein results across experimentsin results across experimentsHere we’re saying thatHere we’re saying that
Variance = f(Variance = f( Problem difficultyProblem difficulty ) + g() + g( Algorithm Algorithm strengthstrength ))
i
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
More on the Paired More on the Paired tt --TestTest
Our Our NULLNULL HYPOTHESISHYPOTHESIS is that the two ML is that the two ML algorithms have algorithms have equivalent average equivalent average accuraciesaccuracies• i.e. differences (in the scores) are due to the i.e. differences (in the scores) are due to the
“random fluctuations” about the mean of zero“random fluctuations” about the mean of zero
We compute the probability that the We compute the probability that the observed observed δδ arose from the null hypothesis arose from the null hypothesis• If this probability is If this probability is lowlow we we rejectreject the null hypo the null hypo
and say that the two algo’s appear differentand say that the two algo’s appear different• ‘‘Low’ is usually taken as Low’ is usually taken as prob prob ≤ ≤ 0.050.05© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Null Hypothesis The Null Hypothesis Graphically (View #1)Graphically (View #1)
δδ
Assume zero mean and use the sample’s variance (sample = experiment)
P(δδ)
1.
½ (1 – M ) probability mass in each tail (ie, M inside)Typically M = 0.95
Does our measured δ lie in the regions indicated by arrows? If so, reject null hypothesis, since it is unlikely we’d get such a δ by chance
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
View #2 – The View #2 – The Confidence Interval for Confidence Interval for δδ
δδ
Use sample’s mean and variance
2.
Is zero in the M % of probability mass?If NOT, reject null hypothesis
P(δδ)
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The The tt -Test Calculation-Test Calculation
ComputeCompute• MeanMean
• Sample VarianceSample Variance
• LookupLookup t t value for value for N N folds and folds and MM confidence levelconfidence level
- “- “NN-1” is called the degrees of freedom-1” is called the degrees of freedom
- As - As NN∞,∞, t tM,N-1M,N-1 and and ZZMM equivalent equivalent
Ν
δ
N
iΝ(ΝS1
22
)( i)1
1
1M,Nt See table 5.6 in Mitchell
We don’t know an analytical expression for the variance, so we need to estimate it on
the data
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The The tt -test Calculation (cont.)-test Calculation (cont.)- Using View #2 - Using View #2 (get same result using (get same result using view #1)view #1)
CalculateCalculate
The interval The interval contains 0contains 0 if if
21, x interval St NM
1, NMtS
δ
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Some Jargon: Some Jargon: PP –values–values(Uses View #1)(Uses View #1)
PP -Value-Value = Probability of getting = Probability of getting one’s results or greater, one’s results or greater, given the NULL HYPOTHESISgiven the NULL HYPOTHESIS
(We usually want P (We usually want P ≤≤ 0.05 to 0.05 tobe confident that a difference be confident that a difference is is statistically significantstatistically significant))
measured P
NULL HYPO DISTRIBUTION
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
From WikipediaFrom Wikipedia((http://en.wikipedia.org/wiki/P-value))
The The pp-value of an observed value -value of an observed value XXobservedobserved of of
some random variable some random variable XX isis
the probability that, given that the null the probability that, given that the null hypothesis is true, hypothesis is true, XX will assume a value will assume a value as or more as or more ununfavorable to the null favorable to the null hypothesis as the observed value hypothesis as the observed value XXobservedobserved
""More unfavorableMore unfavorable to the null hypothesis" can in some to the null hypothesis" can in some cases mean greater than, in some cases less than, and cases mean greater than, in some cases less than, and in some cases further away from a specified centerin some cases further away from a specified center
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
““Accepting” the Null Accepting” the Null HypothesisHypothesis
Note: even if the Note: even if the pp –value is high, we–value is high, we can cannotnot assume the null assume the null hypothesis is hypothesis is truetrueEg, if we flip a coin twice and get one head, Eg, if we flip a coin twice and get one head, can we can we statistically infer statistically infer the coin is the coin is fairfair??
Vs. if we flip a coin 100 times and observe 10 Vs. if we flip a coin 100 times and observe 10 heads,heads, we can statistically infer coin is we can statistically infer coin is uunfair because thatnfair because that
is very unlikely to happen with a fair coinis very unlikely to happen with a fair coin
How would we show a coin How would we show a coin isis fair? fair?
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
More on the More on the tt --DistributionDistributionWe typically don’t have We typically don’t have enoughenoughfolds to assume the central-folds to assume the central-limit theorem. (i.e. N < 30)limit theorem. (i.e. N < 30)• So, we need to use the So, we need to use the tt
distributiondistribution• It’s wider (and hence, shorter) It’s wider (and hence, shorter)
than the Gaussian (than the Gaussian (ZZ ) ) distribution (since PDFs distribution (since PDFs integrate to 1)integrate to 1)
• Hence, our confidence intervals Hence, our confidence intervals will be widerwill be wider
• Fortunately, Fortunately, tt -tables exist-tables exist
)(PGaussian
tN
different curve for each N
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Some Assumptions Some Assumptions Underlying our Underlying our CalculationsCalculations
General Central Limit Theorem applies (I.e., >= 30 measurements averaged)
ML-Specific#errors/#tests accurately estimates p, prob of error on 1 ex.
- used in formula for which characterizes expected future deviations about mean (p )
Using independent sample space of possible instances
- representative of future examples- individual ex’s iid drawn
For paired t-tests, learned classifier same for each fold (“stability”) since combining results across folds
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
StabilityStability
StabilityStability = = how much the model an how much the model an algorithm learns changes due to minor algorithm learns changes due to minor perturbations of the training setperturbations of the training set
Paired Paired tt -test assumptions are a better -test assumptions are a better match to stable algorithmmatch to stable algorithm
Example:Example: k k-NN, higher the -NN, higher the kk, the , the more stablemore stable
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
More on Paired More on Paired tt -Test -Test AssumptionAssumption
Ideally train on one data set and then do a 10-fold paired t-test
What we should do: train test1…test10
What we usually do: train1 test1 …
train10 test10
However, not enough data usually to do the ideal
If we assume that train data is part of each paired experimentthen we violate independence assumptions - each train set overlaps 90% with every other train set
Learned model does not varywhile we’re measuring itsperformance
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Note: Many Note: Many Statisticians Prefer Statisticians Prefer Bootstrap InsteadBootstrap Instead• Given a data set of N examples, do Given a data set of N examples, do
M times (where M typically 1K or M times (where M typically 1K or 10K):10K):• Sample N examples from the data set Sample N examples from the data set
randomly, uniformly, randomly, uniformly, with replacementwith replacement• Train both algorithms on the sampled Train both algorithms on the sampled
data set and test on the remaining datadata set and test on the remaining data
• P-value is fraction of runs on which P-value is fraction of runs on which Alg A is no better than Alg BAlg A is no better than Alg B
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Great DebateThe Great Debate(or one of them, at least)(or one of them, at least)
• Should you use a Should you use a oneone-tailed or -tailed or a a twotwo-tailed -tailed tt-test?-test?
• A A twotwo-tailed test asks the -tailed test asks the question: question: Are algorithms A Are algorithms A and B statistically and B statistically differentdifferent??
• A A oneone-tailed test asks the -tailed test asks the question: question:
Is algorithm A statistically Is algorithm A statistically betterbetter than algorithm B?than algorithm B?
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
One vs. Two-Tailed One vs. Two-Tailed GraphicallyGraphically
P(x)
x
2.5% 2.5%2.5%
One-Tailed Test
Two-Tailed Test
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
The Great Debate The Great Debate (More)(More)• Which of these tests should you use when Which of these tests should you use when
comparing your new algorithm to a state-of-comparing your new algorithm to a state-of-the-art algorithm?the-art algorithm?
• You should use You should use two tailedtwo tailed, because by using , because by using it, you are saying it, you are saying there is a chance I am there is a chance I am better and a chance I am worsebetter and a chance I am worse
• One tailedOne tailed is saying, is saying, I know my algorithm is I know my algorithm is no worseno worse, and therefore you are allowed a , and therefore you are allowed a largerlarger margin of error margin of error
See See http://www.psychstat.missouristate.edu/introbook/sbk25m.htm
By being more confident, it is easier to show significance!
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Two Sided vs. One Two Sided vs. One SidedSided
You need to very carefully think about the You need to very carefully think about the question you are askingquestion you are asking
Are we within x of the true error rate?Are we within x of the true error rate?
Measured mean
mean - x
mean + x
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Two Sided vs. One Two Sided vs. One SidedSided
How confident are we that ML How confident are we that ML System A’s accuracy is at least 85%?System A’s accuracy is at least 85%?
85%
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Two Sided vs. One Two Sided vs. One SidedSided
Is ML algorithm A no more Is ML algorithm A no more accurate than algorithm B?accurate than algorithm B?
A - B
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)
Two Sided vs. One Two Sided vs. One SidedSided
Are ML algorithm A and B Are ML algorithm A and B equivalently accurate?equivalently accurate?
A - B
© Jude Shavlik 2006 © Jude Shavlik 2006 David Page 2010 David Page 2010
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)