cpsc 810: machine learning evaluation of classifier

34
CpSc 810: Machine Learning Evaluation of Classifier

Upload: rosa-pierce

Post on 03-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CpSc 810: Machine Learning Evaluation of Classifier

CpSc 810: Machine Learning

Evaluation of Classifier

Page 2: CpSc 810: Machine Learning Evaluation of Classifier

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

Page 3: CpSc 810: Machine Learning Evaluation of Classifier

3

Classifier Accuracy Measures

C1 C2

C1 True positive (TP) False negative (FN)

C2 False positive (FP) True negative (TN)

classes buy_computer = yes

buy_computer = no

total

buy_computer = yes 6954 46 7000

buy_computer = no 412 2588 3000

total 7366 2634 10000

Page 4: CpSc 810: Machine Learning Evaluation of Classifier

4

Classifier Accuracy Measures

The sensitivity: the percentage of correctly predicted positive data over the total number of positive data

The specificity: the percentage of correctly identified negative data over the total number of negative data.

The accuracy: the percentage of correctly predicted positive and negative data over the sum of positive and negative data

Positive

TP

FNTP

TPySensitivit

Negative

TN

FPTN

TNySpecificit

FNFPTNTP

TNTPAccuracy

Page 5: CpSc 810: Machine Learning Evaluation of Classifier

5

Classifier Accuracy Measures

The precision: the percentage of correctly predicted positive data over the total number of predicted positive data.

The F-measure is also called F-score. As a weighted average of the precision and recall, it considers both the precision and recall of the test to computer the score.

FPTP

TPPrecision

Recall)(Precision

Recall)(Precision2F

Page 6: CpSc 810: Machine Learning Evaluation of Classifier

6

ROC curves

ROC = Receiver Operating Characteristic

Started in electronic signal detection theory (1940s - 1950s)

Has become very popular method used in machine learning applications to assess classifiers

Page 7: CpSc 810: Machine Learning Evaluation of Classifier

7

ROC curves: simplest case

Consider diagnostic test for a disease

Test has 2 possible outcomes:

‘positive’ = suggesting presence of disease

‘negative’ = suggesting un-presence of disease

An individual can test either positive or negative for the disease

Page 8: CpSc 810: Machine Learning Evaluation of Classifier

8

Hypothesis testing refresher

2 ‘competing theories’ regarding a population parameter:

NULL hypothesis HALTERNATIVE hypothesis A

H: NO DIFFERENCE any observed deviation from what we expect to see is due to chance variability

A: THE DIFFERENCE IS REAL

Page 9: CpSc 810: Machine Learning Evaluation of Classifier

9

Test Statistic

Measure how far the observed data are from what is expected assuming the NULL H by computing the value of a test statistic (TS) from the data

The particular TS computed depends on the parameter

For example, to test the population mean , the TS is the sample mean (or standardized sample mean)

The NULL is rejected if the TS falls in a user-specified ‘rejection region’

Page 10: CpSc 810: Machine Learning Evaluation of Classifier

10

True disease state vs. Test result

not rejected rejected

No disease (D = 0)

specificity

XType I error (False +)

Disease (D = 1) X

Type II error (False -)

Power 1 - ; sensitivity

DiseaseTest

Page 11: CpSc 810: Machine Learning Evaluation of Classifier

11

Specific Example

Test Result

Pts Pts with with diseasdiseasee

Pts Pts without without the the diseasedisease

Page 12: CpSc 810: Machine Learning Evaluation of Classifier

12

Test Result

Call these patients “negative”

Call these patients “positive”

Threshold

Page 13: CpSc 810: Machine Learning Evaluation of Classifier

13

Test Result

Call these patients “negative”

Call these patients “positive”

without the disease

with the disease

True Positives

Page 14: CpSc 810: Machine Learning Evaluation of Classifier

14

Test Result

Call these patients “negative”

Call these patients “positive”

False Positives

Page 15: CpSc 810: Machine Learning Evaluation of Classifier

15

Test Result

Call these patients “negative”

Call these patients “positive”

True negatives

Page 16: CpSc 810: Machine Learning Evaluation of Classifier

16

Test Result

Call these patients “negative”

Call these patients “positive”

False negatives

Page 17: CpSc 810: Machine Learning Evaluation of Classifier

17

Test Result

‘‘‘‘-’-’’’

‘‘‘‘+’+’’’

Moving the Threshold: right

Page 18: CpSc 810: Machine Learning Evaluation of Classifier

18

Test Result

‘‘‘‘-’-’’’

‘‘‘‘+’+’’’

Moving the Threshold: left

Page 19: CpSc 810: Machine Learning Evaluation of Classifier

19

ROC curve

Tru

e P

osi

tive R

ate

(s

en

siti

vit

y)

0%

100%

False Positive Rate (1-specificity)

0%

100%

Page 20: CpSc 810: Machine Learning Evaluation of Classifier

20

Tru

e P

osi

tive

Ra

te

0%

100%

False Positive Rate0%

100%

Tru

e P

osi

tive

Ra

te

0%

100%

False Positive Rate0%

100%

A good test: A poor test:

ROC curve comparison

Page 21: CpSc 810: Machine Learning Evaluation of Classifier

21

Best Test: Worst test:T

rue

Po

sitiv

e R

ate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

The distributions don’t overlap at all

The distributions overlap completely

ROC curve extremes

Page 22: CpSc 810: Machine Learning Evaluation of Classifier

22

Area under ROC curve (AUC)

Overall measure of test performance

Comparisons between two tests based on differences between (estimated) AUC

For continuous data, AUC equivalent to Mann-Whitney U-statistic (nonparametric test of difference in location between two populations)

Page 23: CpSc 810: Machine Learning Evaluation of Classifier

23

Tru

e P

osi

tive

Ra

te

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

AUC = 50%

AUC = 90% AUC =

65%

AUC = 100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

AUC for ROC curves

Page 24: CpSc 810: Machine Learning Evaluation of Classifier

24

Interpretation of AUC

AUC can be interpreted as the probability that the test result from a randomly chosen diseased individual is more indicative of disease than that from a randomly chosen nondiseased individual: P(Xi Xj | Di = 1, Dj = 0)

So can think of this as a nonparametric distance between disease/nondisease test results

Page 25: CpSc 810: Machine Learning Evaluation of Classifier

25

Predictor Error Measures

Measure predictor accuracy: measure how far off the predicted value is from the actual known value

Loss function: measures the error betw. yi and the

predicted value yi’

Absolute error: | yi – yi’|

Squared error: (yi – yi’)2

Page 26: CpSc 810: Machine Learning Evaluation of Classifier

26

Predictor Error Measures

Test error (generalization error): the average loss over the test set

Mean absolute error: Mean squared error:

Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared error

d

yyd

iii

1

|'|

d

yyd

iii

1

2)'(

d

ii

d

iii

yy

yy

1

1

||

|'|

d

ii

d

iii

yy

yy

1

2

1

2

)(

)'(

Page 27: CpSc 810: Machine Learning Evaluation of Classifier

27

Evaluating the Accuracy of a Classifier or Predictor (I)

Holdout methodGiven data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction

Test set (e.g., 1/3) for accuracy estimationRandom sampling: a variation of holdout

Repeat holdout k times, accuracy = avg. of the accuracies obtained

Cross-validation (k-fold, where k = 10 is most popular)Randomly partition the data into k mutually exclusive subsets, each approximately equal sizeAt i-th iteration, use Di as test set and others as training setLeave-one-out: k folds where k = # of examples, for small sized dataStratified cross-validation: folds are stratified so that class distribution in each fold is approximately the same as that in the initial data

Page 28: CpSc 810: Machine Learning Evaluation of Classifier

28

Evaluating the Accuracy of a Classifier or Predictor (II)

BootstrapWorks well with small data setsSamples the given training examples uniformly with replacement

i.e., each time a example is selected, it is equally likely to be selected again and re-added to the training set

Several bootstrap methods, and a common one is .632 bootstrap

Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data examples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

Repeat the sampling procedure k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

Page 29: CpSc 810: Machine Learning Evaluation of Classifier

29

More About Bootstrap

The bootstrap method attempts to determine The bootstrap method attempts to determine the probability distribution from the data itself, the probability distribution from the data itself, without recourse to CLT (without recourse to CLT (Central Limit Theorem)..

The bootstrap method is not a way of The bootstrap method is not a way of reducing the error ! It only tries to estimate it.reducing the error ! It only tries to estimate it.

Basic idea of BootstrapOriginally, from some list of data, one computes an object.Create an artificial list by randomly drawing elements from that list. Some elements will be picked more than once. Compute a new object.Repeat 100-1000 times and look at the distribution of these objects.

Page 30: CpSc 810: Machine Learning Evaluation of Classifier

30

More About Bootstrap

How many bootstraps ? No clear answer to this. Lots of theorems on asymptotic convergence, but no real estimates !Rule of thumb: try it 100 times, then 1000 times, and see if your answers have changed by much.Anyway have NN possible subsamples

Is it reliable ?A very good question !Jury still out on how far it can be applied, but for now nobody is going to shoot you down for using it.Good agreement for Normal (Gaussian) distributions, skewed distributions tend to more problematic, particularly for the tails, (boot strap underestimates the errors).

Page 31: CpSc 810: Machine Learning Evaluation of Classifier

31

Sampling

Sampling is the main technique employed for data selection.

It is often used for both the preliminary investigation of the data and the final data analysis.

Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.

Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Page 32: CpSc 810: Machine Learning Evaluation of Classifier

32

Sampling …

The key principle for effective sampling is the following:

Using a sample will work almost as well as using the entire data sets, if the sample is representative

A sample is representative if it has approximately the same property (of interest) as the original set of data

Page 33: CpSc 810: Machine Learning Evaluation of Classifier

33

Sample Size

8000 points 2000 Points 500 Points

Page 34: CpSc 810: Machine Learning Evaluation of Classifier

34

Types of Sampling

Simple Random SamplingThere is an equal probability of selecting any particular itemSampling without replacement

As each item is selected, it is removed from the population

Sampling with replacement

Objects are not removed from the population as they are selected for the sample.

In sampling with replacement, the same object can be picked up more than once

Stratified samplingSplit the data into several partitions; then draw random samples from each partition