lecture7 cross validation
TRANSCRIPT
![Page 1: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/1.jpg)
Lecture 7: Tuning hyperparameters using cross validation
Stéphane [email protected]
Sao Paulo 2014
April 4, 2014
![Page 2: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/2.jpg)
Roadmap1 Tuning hyperparameters
MotivationMachine learning without dataAssessing the quality of a trained SVMModel selection
log of the bandwith
log
of C
1.5 2 2.5 3 3.5 4 4.5ï1
0
1
2
3
4
“Evaluation is the key to making real progress in data mining”, [Witten &Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)
![Page 3: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/3.jpg)
Motivation: the influence of C on SVM
0 0.5 1 1.5 2 2.5 3 3.5 40.22
0.24
0.26
0.28
0.3er
ror
C (log. scale)
0
0
1
−1
C too small
0
0
0
1
1
1
−1
−1−1
−1
−1
nice C
0
0 00
0 1
11
1
1
1
−1
−1
−1
−1
−1
C too large
![Page 4: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/4.jpg)
Motivation:
Need for model selection (tuning the hyper parameters)Require a good estimation of the performance on future data Choose arelevant performance measure
![Page 5: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/5.jpg)
Roadmap1 Tuning hyperparameters
MotivationMachine learning without dataAssessing the quality of a trained SVMModel selection
log of the bandwith
log
of C
1.5 2 2.5 3 3.5 4 4.5ï1
0
1
2
3
4
“Evaluation is the key to making real progress in data mining”, [Witten &Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)
![Page 6: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/6.jpg)
Machine learning without data
minimizing IP(error)
![Page 7: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/7.jpg)
Roadmap1 Tuning hyperparameters
MotivationMachine learning without dataAssessing the quality of a trained SVMModel selection
log of the bandwith
log
of C
1.5 2 2.5 3 3.5 4 4.5ï1
0
1
2
3
4
“Evaluation is the key to making real progress in data mining”, [Witten &Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)
![Page 8: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/8.jpg)
Training and test data
Split dataset into two groups randomly picked (hold out strategy)Training set: used to train the classifierTest set: used to estimate the error rate of the trained classifier
(X,y) total available data
(Xa,ya) training data (Xt,yt) test data
(Xa, ya,Xt, yt)← split(X , y , option = 13)
Generally, the larger the training data the better the classifierThe larger the test data the more accurate the error estimate
![Page 9: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/9.jpg)
Assessing the quality of a trained SVM: minimum error rate
Definition (The confusion matrix)A matrix showing the predicted and actual classifications. A confusionmatrix is of size L× L, where L is the number of different classes.
Observed / predicted Positive Negativepositive a bnegative c d
Error rate = 1 - Accuracy =b + c
a + b + c + d=
b + cn
= 1− a + dn
True positive rate (Recall, Sensitivity) d/(c+d).True negative rate (Specificity) a/(a+b).Precision, False positive rate, False negative rate...
![Page 10: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/10.jpg)
Other performances measures
N. Japkowicz & M. Shah, "Evaluating Learning Algorithms: A Classification Perspective", Cambridge University Press, 2011
![Page 11: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/11.jpg)
The learning equation
Learning = training + testing + tuning
Table: my experimental error rates
State of the art my new method Bayes errorproblem 1 10% ± 1.25 8.5% ± .5
11 %
problem 2 5 % (.25) 4 % (.5)
2 %
is my new method good for problem 1?
![Page 12: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/12.jpg)
The learning equation
Learning = training + testing + tuning
Table: my experimental error rates
State of the art my new method Bayes errorproblem 1 10% ± 1.25 8.5% ± .5 11 %problem 2 5 % (.25) 4 % (.5) 2 %
is my new method good for problem 1?
![Page 13: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/13.jpg)
Error bars on Bernouilli trials
Error rate = p̂ B(p)
with confidence α: (Normal approximation interval)
p = IP(error) in p̂ ± u1−α/2
√p̂ (1− p̂)
nt
with confidence α: (improved approximation)
p = IP(error) in1
1+ 1K u2
1−α/2
p̂ ± u1−α/2
√p̂ (1− p̂)
nt
what if p̂ = 0?http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
![Page 14: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/14.jpg)
To improve the estimate
Random Subsampling (The repeated holdout method)K-Fold Cross-Validation (K = 10 or K = 2 or k = n)Leave-one-out Cross-Validation (k = 1)Bootstrap
![Page 15: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/15.jpg)
Error bars: the gaussian approximation
... and to stabilize: iterate K times - do it say K = 10 times
The repeated holdout methodHoldout estimate can be made more reliable by repeating the processwith different subsamplesIn each iteration, use a different random splittingAverage the error rates on the different iterations
mean error rate e =1K
K∑k=1
ek variance σ̂2 =1
K − 1
K∑k=1
(ek − e)2 .
e + tα/2,K−1
√σ̂2
Kt0.025,9 = 2.262
![Page 16: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/16.jpg)
Cross validation
Definition (Cross-validation)A method for estimating the accuracy of an inducer by dividing the datainto K mutually exclusive subsets (the “folds”) of approximately equal size.
Exemple of K = 3-Fold Cross-Validation
training data
test data
How many folds are needed (K =?)large: small bias, large variance as well as computational timesmall: computation time reduced, small variance, large biasA common choice for K-Fold Cross Validation is K=5
![Page 17: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/17.jpg)
Leave one out cross validation
Theoretical guarantees
![Page 18: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/18.jpg)
The bootstrap
![Page 19: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/19.jpg)
Comparing results
Two different issueswhat is the best method for my problem?how good is my learning algorithm?
![Page 20: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/20.jpg)
Comparing two algorithms: Mc Nemar’s testbuild the confusion matrix of the two algorithms
Algo 1 / Algo 2 right wrongright number of examples well
classified by bothe01 number of exampleswell classified by 1 but notby 2
wrong e10 number of examplesmissclassified by 1 but notby 2
number of examples miss-classified by both
H0: if the two algorithms are the same (we expect e10 = e01 = e10+e012 )
(|e10 − e01| − 1)2
e10 + e01∼ χ2
1
Beware: if e10 + e01 < 20 better use the sign testMatlab function:http://www.mathworks.com/matlabcentral/fileexchange/189-discrim/content/discrim/mcnemar.m
J. L. Fleiss (1981) Statistical Methods for Rates and Proportions. Second Edition. Wiley.
![Page 21: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/21.jpg)
Roadmap1 Tuning hyperparameters
MotivationMachine learning without dataAssessing the quality of a trained SVMModel selection
log of the bandwith
log
of C
1.5 2 2.5 3 3.5 4 4.5ï1
0
1
2
3
4
“Evaluation is the key to making real progress in data mining”, [Witten &Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)
![Page 22: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/22.jpg)
Model selection strategy
Model selection criteria attempt to find a good compromise betweenThe complexity of a modelIts prediction accuracy on the training data
1 (Xa, ya,Xt, yt)← split(X , y , options)2 (C , b)← tune(Xa, ya, options)3 model ← train(Xa, ya,C , b, options)4 error ← test(Xt, yt,C , b, options)
Occam’s Razor:the best theory is the smallest one that describes all the facts
![Page 23: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/23.jpg)
Model selection: the tuning function
function (C , b)← tune(Xa, ya, options)1 (X `, y`,Xv , yv)← split(Xa, ya, options)2 loop on a grid for C3 loop on a grid for b
1 model ← train(X `, y`,C , b, options)2 error ← test(Xv , yv ,C , b, options)
The three setsTraining set: a set of examples used for learning: to fit the parametersValidation set: a set of examples used to tune the hyper parametersTest set: independent instances that have played no part in formationof classifier
![Page 24: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/24.jpg)
how to design the grids
A grid on bA much simpler trick is to pick, say 1000 pairs (x,x’) at random from yourdataset, compute the distance of all such pairs and take the median, the0.1 and the 0.9 quantile. Now pick b to be the inverse any of these threenumbers.http://blog.smola.org/post/940859888/easy-kernel-width-choice
A grid on Cfrom Cmin to ∞
to much!
![Page 25: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/25.jpg)
The coarse to fine strategy
1 use a large coarse grid on a few data to localize interesting values2 fine tuning on all data in this zone
1 (Xa, ya,Xt, yt)← split(X , y)2 (C , b)← tune(Xa, ya, coarsegrids, smalltrainingset)3 finegrids ← fit_grid(C , b)4 (C , b)← tune(Xa, ya, finegrids, largetrainingset)5 model ← train(Xa, ya,C , b, options)6 error ← test(Xt, yt,C , b, options)
The computing time is the key issue
![Page 26: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/26.jpg)
Evaluation measures
the span bound
![Page 27: Lecture7 cross validation](https://reader033.vdocuments.mx/reader033/viewer/2022052619/55506117b4c905ae3f8b53ec/html5/thumbnails/27.jpg)
Bibliography
http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdfhttp://www.cs.odu.edu/~mukka/cs795sum13dm/Lecturenotes/Day3/Chapter5.pdfhttp://www.cs.cmu.edu/~epxing/Class/10701-10s/Lecture/lecture8.pdfhttp://www.mohakshah.com/tutorials/icml2012/Tutorial-ICML2012/Tutorial_at_ICML_2012_files/ICML2012-Tutorial.pdf
Stéphane Canu (INSA Rouen - LITIS) April 4, 2014 26 / 26