error estimation data mining ii year 2009-10 lluís belanche alfredo vellido

36
Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Upload: jeffry-conley

Post on 29-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Error estimation

Data Mining IIYear 2009-10Lluís Belanche Alfredo Vellido

Page 2: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

• Introduction

•Resampling methods:

•The Holdout

•Cross validation

• Random subsampling

• k fold Cross validation

• Leave one out

• The Bootstrap

• Error evaluation

• Accuracy and all that

Error estimation

Page 3: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 4: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 5: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 6: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 7: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 8: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 9: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 10: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 11: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido
Page 12: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Bias and variance estimates with the bootstrap

Page 13: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Example: estimating bias & variance

Page 14: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Three-way data splits (1)

Page 15: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Three-way data splits (2)

Page 16: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Summary (data sample of size n) Resubstitution:

optimistically-biased estimate especially when the ratio of n to dimension is small

Holdout (if iterated we get Random subsampling): pessimistically-biased estimate different partitions yield different estimates

K-fold CV (K«n): higher bias than LOOCV; lower than holdout lower variance than LOOCV

LOOCV (n-fold CV): unbiased - large variance Bootstrap:

lower variance than LOOCV useful for very small n

Computational burden

Page 17: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Error Evaluation Given:• Hypothesis h(x): XC, in hypothesis space H, mapping features x to a number of classes• A data sample S of size n

Questions:

• What is the error of h on unseen data?• If we have two competing hypotheses, which one will be better on unseen data?• How do we compare two learning algorithms in the face of limited data?

• How certain are we about the answers to these questions?

Page 18: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Apparent & True Error

1

1( | ) [ ( ) ]

n

i ii

error h S h x yn

We can define two errors:

1) Error(h|S) is the apparent error, measured on the sample S:

2) Error(h|P) is the true error on data sampled from the distribution P(x):

where f(x) is the true hypothesis.

( | ) ( ) [ ( ) ( )]error h P dx P x h x f x

Page 19: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

A note on True Error

True Error need not be zero!Not even if we knew the probabilities P(x)

Causes:Lack of relevant features Intrinsic randomness of the process

A consequence of this is that we shall not attempt to fit hypotheses with zero apparent error, ie error(h|S)=0 !!!

Quite on the contrary, we should favor those hypotheses s.t. error(h|S) ≈ error(h|P)If error(h|S) >> error(h|P), then h is underfitting the sample SIf error(h|S) << error(h|P), then h is overfitting the sample S

Page 20: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

How to estimate True Error (te)?

Estimate te as te in TE Note te is a r.v. CI Let TE- the subset of TE

wrongly predicted by h Let n = |S|, t = |TE| |TE-| follows a binomial

distribution B(te, t)

S

The ML estimation of te is

te = |TE-| / t

This estimator is unbiased: E[te] = te

Var[te] = te(1–te)/t

Page 21: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Confidence Intervals for te

0.8 1.28z

te – s ≤ te ≤ te + s

where s = zN √(te(1–te)/t)In words, te is within zN standard errors of the estimation. This is because, for te(1–te)t>5 or t>30, it is safe to approximate a Binomial by a Gaussian, for which we can compute “z-values”.

“With N% confidence te=error(h|P) is contained in the interval:”

80%

1.28Normal(0,1)

Page 22: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Example 1

n = |S| = 1,000; t = |TE| = 250 (25% of S) Suppose |TE-| = 50 (our h hits 80% of TE) Then te = 0.2. For a CI at the 95% level:

z0.95 = 1.967 and te is in [0.15, 0.25]Exercise: recompute CI at the 99% level,

using z0.99 = 2.326

Page 23: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Example 2: comparing two hypotheses

Assume we need to compare 2 hypotheses h1 and h2 on the same data

We have t = |TE| = 100, on which h1 makes 10 errors and h2 makes 13

The CIs at the 95% (α=0.05) level are: [0.04, 0.16] for h1

[0.06, 0.20] for h2

We cannot conclude that h1 is better than h2

Note: above is written 10%±6% (h1), 13%±7% (h2 )

Page 24: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Size does matter after all …

How large would TE need to be (say T) to affirm that h1 is better than h2 ?

Assume both h1, h2 keep same accuracy

Force that UL of CI for h1 < LL of CI for h2 UL of CI for h1 is 0.10 + 1.967√(0.1*0.9 / T)

LL of CI for h2 is 0.13 – 1.967√(0.13*0.87 / T)

It turns out that T>1,742 (old size was 100!!!)

The probability that this fails is at most (1-α)/2

Page 25: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Paired t-test

1

2

1

1

1( ) ( )

( 1)

k

ii

k

ii

k

sk k

, 1 ( )N kt s

• Chunk the data set S up in subsets s1,...,sk with |si |>30

• Design classifiers h1, h2 on every S\si

• On each subset si compute the errors and define:

• Now compute:

• With N% confidence the difference in error between h1 and h2 is:

• “tN,k-1” is the t-statistic related to the student-t distribution

• Since error(h1 | si) and error(h2 | si) are both approximately Normal

their difference is approximately Normal

( 1| ) ( 2| )i i ierror h s error h s

Page 26: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Exercise: the real case …

A team of doctors has own classifier and sample data of size 500Split it in TR of size 300 and TE of size 200They get an error of 22% on TEThey ask us for further advice …

We design a second classifier It has an error of 15% on same TE

Page 27: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Answer the following questions:

1. Will you affirm that yours is better than theirs?2. How large would TE need to be to (very reasonably) affirm that

yours is better than theirs? 3. What do you deduce from the above?4. Suppose we move to 10-fold CV on the entire data set.

1. Give a new estimation of the error of your classifier2. Perform a statistical test to check if there is any real difference

The doctors’ classifier errors: 0.22, 0.22, 0.29, 0.19, 0.23, 0.22, 0.20, 0.25, 0.19, 0.19Your classifier’ errors: 0.15, 0.17, 0.21, 0.14, 0.13, 0.15, 0.14, 0.19, 0.11, 0.11

Exercise: the real case …

Page 28: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

What is Accuracy?

Accuracy =No. of correct predictions

No. of predictions

=TP + TN

TP + TN + FP + FN

Page 29: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Example

Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C?

classifier TP TN FP FN AccuracyA 25 25 25 25 50%B 50 25 25 0 75%C 25 50 0 25 75%D 37 37 13 13 74%

Accuracy may nottell the whole story

Page 30: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

What is Sensitivity (aka Recall)?

Sensitivity =No. of correct positive predictions

No. of positives

=TP

TP + FN

(wrt positives)

Sometimes sensitivity wrt negatives is termed specificity

Page 31: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

What is Specificity (aka Precision)?

Precision =No. of correct positive predictions

No. of positive predictions

=TP

TP + FP

wrt positives

Page 32: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Precision-Recall Trade-off

A predicts better than B if A has better recall and precision than B

There is a trade-off between recall and precision

In some applications, once you reach a satisfactory precision, you optimize for recall

In some applications, once you reach a satisfactory recall, you optimize for precision

reca

ll

precision

Page 33: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Comparing prediction performance Accuracy is the obvious measure

But it conveys the right intuition only when the positive and negative populations are roughly equal in size

Recall and precision together form a better measureBut what do you do when A has better recall

than B and B has better precision than A?

Page 34: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

F-measure

The harmonic mean of recall and precision

F =2 * recall * precision

recall + precision(wrt positives)

classifier TP TN FP FN Accuracy F-measureA 25 75 75 25 50% 33%B 0 150 0 50 75% undefinedC 50 0 150 0 25% 40%D 30 100 50 20 65% 46%

Does not accord with intuition

Page 35: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

Abstract model of a classifier

Given a test observation x Compute the prediction h(x) Predict x as negative if h(x) < t Predict x as positive if h(x) > t

t is the decision threshold of the classifier

changing t affects the recall and precision,and hence accuracy, of the classifier

Page 36: Error estimation Data Mining II Year 2009-10 Lluís Belanche Alfredo Vellido

ROC Curves By changing t, we get a

range of sensitivities and specificities of a classifier

Leads to ROC curve that plots sensitivity vs. (1 – specificity)

A predicts better than B if A has better sensitivities than B at most specificities

Then the larger the area under the ROC curve, the better

s en s

iti v

ity

1 – specificity

P(TP)

P(FP)