case selection and resampling lucila ohno-machado hst951

38
Case Selection and Resampling Lucila Ohno-Machado HST951

Upload: judith-stevens

Post on 06-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Unusual Data Outlier (discrepancy, unusual observation that may change parameters) Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) Influence = discrepancy x leverage

TRANSCRIPT

Page 1: Case Selection and Resampling Lucila Ohno-Machado HST951

Case Selection and Resampling

Lucila Ohno-MachadoHST951

Page 2: Case Selection and Resampling Lucila Ohno-Machado HST951

Topics

• Case selection (influence detection)• Regression diagnostics

• Sampling procedures– Bootstrap– Jackknife– Cross-validation

Page 3: Case Selection and Resampling Lucila Ohno-Machado HST951

Unusual Data

• Outlier (discrepancy, unusual observation that may change parameters)

• Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters)

• Influence = discrepancy x leverage

Page 4: Case Selection and Resampling Lucila Ohno-Machado HST951

Detecting Outliers: Residuals

• Measure of error

• Studentized residuals can be calculated by removing one observation at a time

• Obs: High-leverage observations may have small residuals

Page 5: Case Selection and Resampling Lucila Ohno-Machado HST951

Assessing Leverage

• Hat values measure the distance of an observation to the means (or centroid) of all observations

• Dependent variables are not involved in determining leverage

Page 6: Case Selection and Resampling Lucila Ohno-Machado HST951

Measuring Influence

• Impact on coefficient of deleting an observation– DFBETA– COX’s D– DFFITS

• Impact on standard error– COVRATIO

Page 7: Case Selection and Resampling Lucila Ohno-Machado HST951

Case selection

• Not all cases are created equal• Some influential cases are good• Some are bad• “Outliers”• Some non-influential cases are redundant

• It would be nice to keep “minimal” set of good cases in training sets for fast on-line training

Page 8: Case Selection and Resampling Lucila Ohno-Machado HST951

Classical Diagnostics

• Unicase selection is determined by removing one observation and inspecting results

• Unicase influence on– Estimated parameters (coefficients)– Fitted value (Y-hat)– Residuals (error)

Page 9: Case Selection and Resampling Lucila Ohno-Machado HST951

When outcomes are binary

• Residuals may not reflect discriminatory performance, but rather calibration

• Remember that a model with good discriminatory performance may be recalibrated

• Same rationale for coefficients

Page 10: Case Selection and Resampling Lucila Ohno-Machado HST951

Influence

• Definition of influence is not fixed

• If the main reason for building models is prediction

• Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases

Page 11: Case Selection and Resampling Lucila Ohno-Machado HST951

Qualifying a case

• Bad cases, when removed, should result in models with better predictions

• Redundant cases, when removed, should not affect predictions

• Good cases, if removed, would result in models with worse predictions

Page 12: Case Selection and Resampling Lucila Ohno-Machado HST951

Defining prediction performance

• Use, for example, areas under ROC curves (or mean square error or cross entropy error)

• For each set of samples:– Evaluate performance on training and holdout sets– Determine which cases to remove

• Determine performance on test or validation sets

Page 13: Case Selection and Resampling Lucila Ohno-Machado HST951

Sequential Multicase Selection

• Sequential procedure– remove most influential case– remove second-most influential case

(conditioned on the first)– and so on…i(C(n,m)), for all i=1 to m,

where C(.) represents the number of subsets of size m that can be built from n cases.

• Problem: cases are not considered en bloc

Page 14: Case Selection and Resampling Lucila Ohno-Machado HST951

Alternatives

• Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search)

• Analogous to variable selection

Page 15: Case Selection and Resampling Lucila Ohno-Machado HST951

Genetic Algorithm

• Given a training set C, and a selection of cases v, we construct a logistic regression model lC(v). We evaluate the model using the AUC, and represent this evaluation as a(lC(v)). For a total number of cases n, and m cases in selection v, we use the following fitness function:

• f(v,C) = a(lC(v)) + (n - m)/n.

Page 16: Case Selection and Resampling Lucila Ohno-Machado HST951
Page 17: Case Selection and Resampling Lucila Ohno-Machado HST951
Page 18: Case Selection and Resampling Lucila Ohno-Machado HST951
Page 19: Case Selection and Resampling Lucila Ohno-Machado HST951
Page 20: Case Selection and Resampling Lucila Ohno-Machado HST951

Resampling

Page 21: Case Selection and Resampling Lucila Ohno-Machado HST951

Bootstrap Motivation

• Sometimes it is not possible to collect many samples from a population

• Sometimes it is not correct to assume a certain distribution for the population

• Goal: Assess sampling variation

Page 22: Case Selection and Resampling Lucila Ohno-Machado HST951

Bootstrap

• Efron (Stanford biostats) late 80’s– “Pulling oneself up by one’s bootstraps”

• Nonparametric approach to statistical inference• Uses computation instead of traditional

distributional assumptions and asymptotic results• Can be used to derive standard errors, confidence

intervals, and test hypothesis

Page 23: Case Selection and Resampling Lucila Ohno-Machado HST951

Example

• Adapted from Fox (1997) “Applied Regression Analysis”

• Goal: Estimate mean difference between Male and Female finding X

• Four pairs of observations are available:

Page 24: Case Selection and Resampling Lucila Ohno-Machado HST951

Observ. Male Female Differ.

1 24 18 6

2 14 17 -3

3 40 35 5

4 44 41 3

Page 25: Case Selection and Resampling Lucila Ohno-Machado HST951

Mean Difference

• Sample mean is (6-3+5+3)/4 = 2.75• If Y were normally distributed, 95% CI

• But we do not know

Y

nY 96.1

Page 26: Case Selection and Resampling Lucila Ohno-Machado HST951

Estimates

• Estimate of is

• Estimate of standard error is

• Assuming population is normally distributed, we can use t-distribution as

1

2

n

YYS i

n

SYES ˆ

nStY n 025.0,1

Page 27: Case Selection and Resampling Lucila Ohno-Machado HST951

Confidence Interval

= 2.75 ± 4.30 (2.015) = 2.75 ± 8.66

-5.91 < < 11.41

HUGE!!!

nStY n 025.0,1

Page 28: Case Selection and Resampling Lucila Ohno-Machado HST951

Sample mean and variance

• Use distribution Y* of sample to estimate distribution Y in population

y* p*(y*)6 .25-3 .25 E*(Y*) = y* p(y*) = 2.75

5 .25 V*(Y*) = [y*-E*]2p(y*)3 .25 = 12.187

Page 29: Case Selection and Resampling Lucila Ohno-Machado HST951

Sample with ReplacementSample Y1* Y2* Y3* Y4* *

1 6 6 6 6 6.002 6 6 6 -3 3.753 6 6 6 5 5.75..100 -3 5 6 3 2.75101 -3 5 -3 6 1.25…255 -3 3 3 5 3.5256 3 3 3 3 3.00

Y

Page 30: Case Selection and Resampling Lucila Ohno-Machado HST951

Calculating the CI

• Mean of 256 bootstrap means is 2.75, but SE is

(no hat since SE is not estimated, but known)

745.134015.2

*)(1

)(ˆ *

YSE

nnYES

745.1*)(* 1

2*

n

n

b b

nYY

YSE

n

Page 31: Case Selection and Resampling Lucila Ohno-Machado HST951

So what?

• We already knew that!

• But with bootstrap– Confidence intervals can be more accurate– Can be used for non-linear statistics without

known standard error formulas

Page 32: Case Selection and Resampling Lucila Ohno-Machado HST951

The population is to the sampleas

the sample is to the bootstrap samples

In practice (as opposed to previous example), not all bootstrap samples are selected

Page 33: Case Selection and Resampling Lucila Ohno-Machado HST951

Procedure

• 1. Specify data-collection scheme that results in observed sampleCollect(population) -> sample

• 2. Use sample as if it were population (with replacement)Collect(sample) -> bootstrap sample1

bootstrap sample 2etc…

Page 34: Case Selection and Resampling Lucila Ohno-Machado HST951

Cont.

• 3. For each bootstrap sample, calculate the estimate you are looking for

• 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample

Page 35: Case Selection and Resampling Lucila Ohno-Machado HST951

Bootstrap Confidence Intervals

• Normal Theory• Percentile Intervals

Example– 95% percentile is calculated by taking– Lower = 0.025 x bootstrap replicates– Upper = 0.975 x bootstrap replicates

• There are corrections for bootstrap intervals

Page 36: Case Selection and Resampling Lucila Ohno-Machado HST951

Bootstrapping Linear Regression

Observed estimate is usually the coefficient(s)- (at least) 2 ways of doing this• Resample observations (usual) and re-regress (X

will vary)• Resample residuals (X are fixed, Y*=Y+E* is new

dependent variable, re-regress X fixed)– Assumes errors are identically distributed– High-leverage outlier impact may be lost

Page 37: Case Selection and Resampling Lucila Ohno-Machado HST951

Bootstrap for other methods

• Used in other classification methods (neural networks, classification trees, etc.)

• Usually useful when sample size is small and no distribution assumptions can be made

• Same principles apply

Page 38: Case Selection and Resampling Lucila Ohno-Machado HST951

Other resampling methods

• Jackknife (take one out) is a special case of bootstrap– Resamples without one case and without replacement

(samples have size n-1)• Cross-validation

– Divides data into training and test

• Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)