robust strategies and model selection · 2013. 1. 12. · 1 regression model 2 least squares 3...

Robust strategies and model selection

Stefan Van Aelst

Department of Applied Mathematics and Computer ScienceGhent University, Belgium

[email protected]

ERCIM09 - COMISEF/COST Tutorial

Outline

1 Regression model

2 Least squares

3 Manual variable selection approach

4 Automatic variable selection approach

5 Robustness

6 Robust variable selection: sequencing

7 Robust variable selection: segmentation

Robust selection procedures Stefan Van Aelst 2

Regression model

Regression setting

Consider a datasetZn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1.

Y is the response variable

X1, . . . ,Xd are the candidate regressors

The corresponding linear model is:

yi = β1xi1 + · · ·+ βdxid + ǫi i = 1, . . . , n

yi = x′iβ + ǫi i = 1, . . . , n

where the errors ǫi are assumed to be iid with E(ǫi) = 0and Var(ǫi) = σ2 > 0.

Estimate the regression coefficients β from the data.


Least squares

Least squares solution

βLS solves minβ

n∑

i=1

(yi − x′iβ

)2

Write

X = (x1, . . . , xn)t

y = (y1, . . . , xn)t

Then, βLS solves minβ

(y − Xβ)t(y − Xβ)

⇒

βLS = (XtX)−1Xty

y = Xβ = X(XtX)−1Xty = Hy


Least squares

Least squares properties

Unbiased estimator: E(βLS) = β

Gauss-Markov theorem: LS has smallest variance amongall unbiased linear estimators of β.

Why do variable selection?


Least squares

Expected prediction error

Assume the true regression function is linear:Y|x = f (x) + ǫ = xtβ + ǫ

Predict the response Y0 at x0: Y0 = xt0β + ǫ0 = f (x0) + ǫ0

Use an estimator of the regression coefficients: β

Estimated prediction: f (x0) = xt0β

Expected prediction error: E[(Y0 − f (x0))

2]


Least squares

Expected prediction error

E[(Y0 − f (x0))2] = E

[(f (x0) + ǫ0 − f (x0))

2]

= σ2 + E[(f (x0)− f (x0))

2]

= σ2 + MSE(f (x0))

σ2: irreducible variance of the new observation y0

MSE(f (x0)) mean squared error of the prediction at x0 bythe estimator f


Least squares

MSE of a prediction

MSE(f (x0)) = E[(f (x0)− f (x0))

2]

= E[[xt

0(β − β)]2]

= E[[xt

0(β − E(β) + E(β)− β)]2]

= bias(f (x0))2 + Var(f (x0))

LS is unbiased ⇒ bias(f (x0)) = 0

LS minimizes Var(f (x0)) (Gauss-Markov)

LS has smallest MSPE among all linear unbiased estimators


Least squares

LS instability

LS becomes unstable with large MSPE if Var(f (x0)) is high.This can happen if

Many noise variables among the candidate regressors

Highly correlated predictors (multicollinearity)

⇒ Improve on least squares MSPE by trading (a little) bias for(a lot of) variance!


Manual variable selection approach

Manual variable selection

Try to determine the set of the most important regressors

Remove the noise regressors from the model

Avoid multicollinearity

Methods

All subsets

Backward elimination

Forward selection

Stepwise selection

→ choose a selection criterion



Submodels

DatasetZn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1.

Let α ⊂ {1, . . . , d} denote the predictors included in asubmodel

The corresponding submodel is:

yi = x′αiβα + ǫαi i = 1, . . . , n.

A selected model is considered a good model if

It is parsimonious

It fits the data well

It yields good predictions for similar data



Some standard selection criteria

Adjusted R2: A(α) = 1 −RSS(α)/(n − d(α))

RSS(1)/(n − 1)

Mallow’s Cp: C(α) =RSS(α)σ2 − (n − 2d(α))

Final Prediction Error: FPE(α) =RSS(α)σ2 + 2d(α)

AIC: AIC(α) = −2L(α) + 2d(α)

BIC: BIC(α) = −2L(α) + log(n)d(α)

where σ is the residual scale estimate in the "full" model



Resampling based selection criteria

Consider the (conditional) expected prediction error:

PE(α) = E

[1n

n∑

i=1

(zi − x′αiβα

)2∣∣∣∣∣ y,X

],

Estimates of the PE can be used as selection criterion.

Estimates can be obtained by cross-validation or bootstrap.

A more advanced selection criterion takes both goodness-of-fitand PE into account:

PPE(α) =1n

n∑

i=1

(yi − x′αiβα

)2+f (n) d(α)+E

[1n

n∑

i=1

(zi − x′αiβα

)2∣∣∣∣∣ y,X

]


Automatic variable selection approach

Automatic variable selection

Try to find a stable model that fits the data well

Shrinkage: constrained least squares optimization

Stagewise forward procedures

Methods

Ridge regression

Lasso

Least Angle regression

L2 Boosting

Elastic Net



Lasso

Least Absolute Shrinkage and Selection Operator

βlasso = arg minβ

n∑

i=1

yi − β0 −

d∑

j=1

βjxij

2

subject to ‖β‖1 =d∑

j=1

|βj| ≤ t

0 < t < ‖βLS‖1 is a tuning parameter



Example: LASSO fits

*

*

*

* * * *

* *

2 4 6 8

−2

02

46

Df

Sta

ndar

dize

d C

oeffi

cien

ts

* *

*

* ** *

* *

* * * * * **

* *

* * * * *

* ** *

* * *

* *

* *

* *

* * * * * * *

**

* * * * * * * * ** * * * *

* *

**

LASSO

63

74

81



Properties of LAR

Least angle regression (LAR) selects the predictors inorder of importance.

LAR changes the contributions of the predictors graduallyas they are needed.

LAR is very similar to LASSO and can easily be adjustedto produce the LASSO solution

LAR only uses the means, variances and correlations ofthe variables.

LAR is computationally as efficient as LS



Example: LAR fits

*

**

* * * *

* *

2 4 6 8

−0.

20.

00.

20.

40.

6

Df

Sta

ndar

dize

d C

oeffi

cien

ts

* *

*

* *

* ** *

* * * * * **

* *

* * * * *

* ** *

* * *

* *

* *

* *

* * * * * * *

**

* * * * * * * * ** * * * *

* *

**

LAR

63

74

21



L2 boosting

Standardize the variables.

1 Put r = y and F0 = 0

2 Select x1 such that |cor(r, x1)| = maxj |cor(r, xj)|.

3 Update r = y − ν f(x1) where 0 < ν ≤ 1 is the step lengthand f(x1) are the fitted values from the LS regression of yon x1.Similarly, update F1 = F0 + ν f(x1)

4 Continue the procedure . . .



Sequencing variables

Several selection algorithms sequence the predictors in "orderof importance" or screen out the most relevant variables

Forward/stepwise selection

Stagewise forward selection

Penalty methods

Least angle regression

L2 boosting

These methods are computationally very efficient because theyare only based on means, variances and correlations.


Robustness

Robustness: Data with outliers

Question: Number of partners men and women desire tohave in the next 30 years?

Men: Mean=64.3, Median=1−→ Mean is sensitive to outliers−→ Median is robust and thus more reliable


Robustness

Least squares regression

3.6 3.8 4.0 4.2 4.4 4.6

4.0

4.5

5.0

5.5

Log Surface Temperature

Log

Ligh

t Int

ensi

ty

LS

LS: Minimize∑

r2i (β)


Robustness

Outliers

3.6 3.8 4.0 4.2 4.4 4.6

4.0

4.5

5.0

5.5

6.0


Log

Ligh

t Int

ensi

ty LS

Outliers attract LS!


Robustness

Robust regression estimators

3.6 3.8 4.0 4.2 4.4 4.6

4.0

4.5

5.0

5.5

6.0


Log

Ligh

t Int

ensi

ty LS

MM

Robust MM estimator is less influenced by outliers!


Robustness

Robust univariate location estimators

The sample mean Xn satisfies the equation

n∑

i=1

(Xi − Xn) = 0

The ML estimator θ solves the equation

n∑

i=1

∂

∂θlog fθ(Xi)|θ=θ = 0

For a suitable score function ψ(x, θ), the M-estimator Tn

solves the equation

n∑

i=1

ψ(Xi − Tn) = 0


Robustness

Univariate location M-estimators

�

�

�

�

n∑

i=1

ψ(Xi − Tn) = 0

Consistent if∫ψ(y)dF(y) = EF(ψ(y)) = 0

Asymptotic efficiency:(∫ψ′dΦ)2∫ψ2dΦ

Robustness: Maximal breakdown point (50%) if ψ(y) isbounded!


Robustness

Examples of M-estimators

Sample mean: ψ(t) = t: Unbounded! Efficiency: 100%

Median: ψ(t) = sign(t): Bounded, efficiency: 63.7%

Huber estimator: ψb(t) = min{b,max{t,−b}}

=

t if |t| ≤ b

sign(t) b if |t| ≥ bwith b > 0


Robustness

Huber psi function

−4 −2 0 2 4

−4

−2

02

4

x

Hub

er p

si

b−b


Robustness

Tuning the Huber M-estimator

Huber M-estimator has maximal breakdown point for anyb <∞→ b can be chosen for good efficiency at Φ

b = 1.37 yields 95% efficiency→ trade-off between robustness and efficiency!


Robustness

Example: Copper content in flour

Copper content (parts per million) in 24 wholemeal floursamples

510

1520

2530


Robustness



Sample mean: 4.28

Sample median: 3.39

Huber M-estimator: 3.21


Robustness

Monotone M-estimates

Huber M-estimator has a monotone psi-function

If the function ψ(t) is monotone, then

Equation∑n

i=1 ψ(Xi − Tn) = 0 has a unique solution

Tn is easy to compute

Tn has maximal breakdown point

Large outliers still affect the estimate (although theeffect remains bounded)


Robustness

Redescending M-estimates

If the function ψ(t) is not monotone, but redescends to zero,then

Equation∑n

i=1 ψ(Xi − Tn) = 0 has multiple solutions

Define ρ(t) such that ρ′(t) = ψ(t), then we need the solution

minTn

n∑

i=1

ρ(Xi − Tn)

Tn can be more difficult to compute

Tn has maximal breakdown point

The effect of large outliers on the estimate reduces to zero!

Increased robustness against large outliers


Robustness

Redescending M-estimates

A popular family of redescending loss functions is the Tukeybiweight (bisquare) family of loss functions:

ρc(t) =

t22 − t4

2c2 +t6

6c4 if |t| ≤ c

c2

6 if |t| ≥ c.

The constant c can be tuned for efficiency


Robustness

Tukey biweight ρ functions

−4 −2 0 2 4

0.0

0.5

1.0

1.5

2.0

t

ρ(t)

c=3

c=2

c=∞


Robustness

Tukey biweight ψ function

−6 −4 −2 0 2 4 6

−4

−2

02

4

x

Psi

func

tion

b−b

c−c

HuberTukey


Robustness



Sample mean: 4.28

Sample median: 3.39

Huber M-estimator: 3.21

Tukey biweight M-estimator: 3.16


Robustness

Univariate scale estimators

Example: Copper content (parts per million) in 24 wholemealflour samples

Standard deviation: 5.30

Median absolute deviation (MAD):

Sn = 1.483 med(|Xi − med(Xj)|)

MAD: 0.53

−→ Standard deviation is sensitive to outliers−→ MAD is robust and thus more reliable


Robustness

M-estimators of scale

M-estimator of scale is the solution Sn such thatn∑

i=1

ψ(Xi/Sn) = 0

Symmetric distributions: use symmetric ψ functionsConsistent if

∫ψ(y)dF(y) = EF(ψ(y)) = 0

The Tukey biweight loss functions ρc are symmetricPut b = EΦ(ρc) and define ψc(t) = ρc(t)− b, then the Tukeybiweight M-estimator of scale Sn solves

n∑

i=1

ψc(Xi/Sn) = 0

or equivalently1n

n∑

i=1

ρc(Xi/Sn) = b


Robustness



Standard deviation: 5.30

Median absolute deviation: 0.53

Tukey biweight M-estimator: 0.66


Robustness

Robust regression

Denote ri(β) = yi − x′iβ the residuals corresponding to β

βLS solves minβ

n∑

i=1

(yi − x′iβ

)2=

n∑

i=1

(ri(β))2

Denote σ(β) =√∑n

i=1(ri(β))2

n−d the estimate of the residualscale

The LS estimator βLS then equivalently solves minβσ(β)

⇒ Instead minimize a robust estimate of the residual scale


Robustness

Least Median of Squares regression

LS LMS

Minimize1

n − d

n∑

i=1

ri(β)2 −→ Minimize med ri(β)

2

Maximal breakdown point (50%)Small biasSlow rate of convergence (n−1/3)Inefficient


Robustness

Least Trimmed Squares regression

LS LTS

Minimize1

n − d

n∑

i=1

ri(β)2 −→ Minimize

1h

h∑

i=1

(r(β)2)i:n

where (r(β)2)1:n ≤ · · · ≤ (r(β)2)n:n

Breakdown point is min{h, n − h}/n ≤ 50%Asymptotically normalTrade-off robustness-efficiencyLow efficiency (less than 10%)


Robustness

Regression S-estimators

LS S-estimate

Minimize1n

n∑

i=1

ri(β)2 −→ Minimize σ(β)

For each β, σ(β) solves 1n

∑ρc

(ri(β)σ

)= b

c determines both robustness and efficiencyTrade-off robustness-efficiencyBreakdown point can be up to 50%Asymptotically normalEfficiency can still be low (less than 35%)


Robustness

Regression M-estimators

LS M-estimate

Minimizen∑

i=1


n∑

i=1

ρ

(ri(β)

σ

)

or solven∑

i=1

ψ

(ri(β)

σ

)xi = 0

Requires a robust scale estimate σ!


Robustness

MM estimates

LS MM-estimate

Minimizen∑

i=1


n∑

i=1

ρ

(ri(β)

σ

)

σ is S-estimator’s M-scale

M and S-estimator both use Tukey biweight ρc functions

S-estimator is tuned for robustness (breakdown point)

Redescending M-estimator is tuned for efficiency


Robustness

MM: loss functions

Tukey biweight family ρc(t) =

{3 t

c2 − 3 t

c4 + t

c6 if |t| ≤ c

1 if |t| > c,

x

loss

−7 0 c0 c1 7

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ0ρ1

ρ0 determines the breakdown point (S-estimator)ρ1 determines the efficiency (MM-estimator)


Robustness

MM estimates

LS MM-estimate

Minimizen∑

i=1


n∑

i=1

ρ

(ri(β)

σ

)

σ is S-estimator’s M-scale

M and S-estimator both use Tukey biweight ρc functions

S-estimator is tuned for robustness (breakdown point)

Redescending M-estimator is tuned for efficiency

Highly robust and efficient!


Robustness

Redescending psi function

⋆ A redescending psi function is needed for robustness, butthis implies

Multiple solutions of score equations

Global solution is needed (high breakdown point)

Difficult (time consuming) to compute


Robust variable selection: sequencing

Robust variable selection

Issues

Robust regression estimators are computationallydemanding

’Outliers’ depend on the model under consideration

High dimensional data: Outlying cases?

Our approach: a two-step procedure

Sequencing: Construct a reduced sequence ofgood predictors in an efficientway.

Segmentation: Build an optimal model fromthe reduced set of predictors.



Sequencing the variables in order of importance

Automatic variable selection methods such asforward/stepwise selection, LAR and L2 boosting arecomputationally efficient methods to sequence predictors

These methods are based only on the means, variancesand correlations of the data.

⇒ Construct computationally efficient, robust methods to se-quence predictors by using computationally efficient and highlyrobust estimates of center, scale and correlation



Robust building blocks

Location: Median

Scatter: Median Absolute Deviation

Correlation: Bivariate Winsorization

Correlation: Bivariate M-estimators

Correlation: Gnanadesikan-Kettenring estimators



Winsorized correlation estimates

1 Robustly standardize the data using median and MAD

2 Transform the data by shifting outliers towards the center

3 Calculate the Pearson correlation of the transformed data



Univariate Winsorization

Componentwise transformationu = ψc(x) = min(max(−c, x), c)

−4 −2 0 2 4

−4

−2

02

4

x

Huber ψ function with c=2ψ

c(x)




Componentwise transformationu = ψc(x) = min(max(−c, x), c)

−4 −2 0 2

−4−3

−2−1

01

23

variable 1

varia

ble 2



Bivariate Winsorization

Bivariate transformation

u = min(√

c/D(x), 1)x with c = F−1χ2

2(0.95)

D(x) = xtR−10 x with R0 an initial bivariate correlation matrix.

−4 −2 0 2

−4−3

−2−1

01

23

variable 1

varia

ble 2





u = min(√


2(0.95)


−4 −2 0 2

−4−3

−2−1

01

23

variable 1

varia

ble 2



Initial correlation estimate

Adjusted Winsorization: Univariate Winsorization with differenttuning constants for different quadrants.

Denote h = ratio of observations in second and fourth quadrantsto the observations in first and third quadrant.

Suppose h ≤ 1, then

Use constant c1 for Winsorizing points in first and third quadrantsUse c2 =

√

hc1 for second and fourth quadrants

R0 is correlation matrix of adjusted Winsorized data




Adjusted Winsorization: Univariate Winsorization with differenttuning constants for different quadrants.

−4 −2 0 2

−4−3

−2−1

01

23

variable 1

varia

ble 2





−4 −2 0 2

−4−3

−2−1

01

23

variable 1

varia

ble 2



Correlation M-estimators

1 First center the two variables using their medians

2 An M-estimate of the covariance matrix is the solution V ofthe equation

1n

∑

i

u2(d2i )xix′i = V,

where d2i = x′iV

−1xi and u2(t) = min(χ22(0.99)/t, 1)

3 Calculate the correlation corresponding to the bivariatecovariance matrix V



Gnanadesikan-Kettenring correlation estimators

Consider the identity

cov(X, Y) =14

(sd(X + Y)2 − sd(X − Y)2)

Replace the sample standard deviations by robustestimates of scale to obtain robust correlation estimates



Robust correlations: Computational efficiency

10000 20000 30000 40000 50000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

sample size

cpu

time

Uni−WinsorAdj−WinsorBi−WinsorMaronna



Robust LAR: Computational efficiency

Computational efficiency of correlations largely determinescomputing time of Robust LAR

50 100 150 200 250 300

050

100

150

dimension

cpu

time

LARSW−RLARSM−RLARS



Bootstrapping the sequencing algorithms

Use bootstrap averages to obtain more reliable and stablesequencesProcedure:

1 Generate 50 bootstrap samples2 Sequence predictors in each sample3 Rank predictors according to their average rank over the

bootstrap samples

Not all predictors have to be ranked in each bootstrapsample



Bootstrap effect on robust LAR

Simulation design

Samples of size 150 in 200 dimensions

10 target predictors

20 noise covariates correlated with target predictors

170 independent noise covariates

10% of symmetric or asymmetric high leverage outliers

We compare with random forests using variableimportance measures to sequence the variables



Bootstrap RLAR vs RLAR/Random Forests

Symmetric high leverage Asymmetric high leverage

0 10 20 30 40 50

24

68

10

Number of Variables

Num

ber

of T

arge

t Var

iabl

es

B−RLARSRLARSRF−OOBRF−IMP

0 10 20 30 40 50

24

68

Number of Variables

Num

ber

of T

arge

t Var

iabl

es



Example: Demographic data

n = 50 states of USA, d = 25 covariates.

Response y = murder rate

One outlier

5-fold cross validation selects a model with 7 variables

We sequence the variables using B-RLARSConstruct learning curve

Graphical tool to select the size of reduced sequence inpracticeBased on a robust R2 measure:

e.g. R2 = 1 −Med(residual2)

MAD2(y)



Demographic data: learning curve

Learning curve

5 10 15 20

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Number of variables in the model

Lear

ning

rat

e

⇒ Reduced set of at most 12 predictors



Demographic data: models

Full CV model: 7 predictors

B-RLAR+CV: 6 predictors

LAR+CV: 8 predictors

RF-SEL: 5 predictors

RF-SEL+CV: 4 predictors

RF-RED+CV: 5 predictors

MSVM-RFE: 8 predictors

MSVM-RFE+CV: 6 predictors



Demographic data: model comparison

Density estimates based on 1000 5-fold CV-MSPE estimates.

0 200 400 600 800 1000

0.00

00.

005

0.01

00.

015

5−fold CV−MSPE

dens

ity

Full−CVLARS+CVB−RLARS+CVRF−SEL+CVRF−RED+CVMSVM−RFE



Example: Protein data

n = 4141 protein sequences, d = 77 covariates.

Training sample of size 2072 and test sample of size 2069.We selected predictors using

B-RLAR: 5 predictorsRF using OOB importance: 22 predictorsMSVM-RFE: 22 predictors

For RF we could determine an optimal submodel in thereduced sequences using robust MM-estimates with robustFPE. ⇒ RF+RFPE: 18 predictors



Protein data: test sample errors

Trimmed means of squared prediction errors

Trimming fractionModel 1% 5% 10%B-RLAR 116.19 97.73 84.67RF 111.11 93.80 81.30RF-RFPE 111.30 93.92 81.27MSVM-RFE 173.70 150.48 133.17



Example: Particle data

Quantum physics data with d = 64 predictors.

Training sample of size 5,000, test sample of size 45,000.

FS and SW produced a model with 25 predictors.

Robust FS and SW produced a model with only 1 predictor.

Indeed for more than 80% of the cases X1 = Y = 0.

For the cases with X1 6= 0, FS produced a model with 5predictors.

We fit the final models using MM-estimators.



Particle data: test sample errors

Trimmed means of squared prediction errors

Trimming fractionModel 1% 5%

FS 0.110 0.012Robust FS 0.032 0.001


Robust variable selection: segmentation

Segmentation: Robust adjusted R-squared

Adjusted R2: A(α) = 1 −RSS(α)/(n − d(α))

RSS(1)/(n − 1)Based on a robust regression estimator we can construct arobust adjusted R2:

RR2a(α) = 1 −

σ2α

(n − d(α))

/σ2

0

(n − 1),

σα is the robust residual scale of the submodel withpredictor indexed by α

σ0 is the robust residual scale of the intercept-onlymodel



Segmentation: Robust FPE

FPE(α) = RSS(α)σ2 + 2d(α) estimates the final prediction error

FPE(α) =1σ2

n∑

i=1

E[(zi − x′αiβα)

2], assuming that the

model is correct.

Consider now the robust final prediction error:

RFPE(α) =n∑

i=1

E

[ρ

(zi − x′αiβα

σ

)]. Assuming that the

model is correct and using a second order Taylorexpansion, this can be estimated by

RFPE(α) =∑n

i=1 ρ(ri(βα)/σn) + d(α)∑n

i=1 ψ2(ri(βα

)/σn)∑n

i=1 ψ′(ri(βα

)/σn)

σn is the robust scale estimate of a ’full’ model αf . Usually,αf = {1, . . . , d}



Robust resampling based selection criteria

Robust equivalents of the resampling based selection criteria:

RPE(α) =σ2

n

nE⋆

[n∑

i=1

ρ

(zi − x′αiβα

σn

)∣∣∣∣∣ y,X]

PRPE(α) =σ2

n

n

{n∑

i=1

ρ

(yi − x′αiβα

σn

)+ f (n) d(α)

}+ Mn(α)

ρ is the MM loss function and βα,n is the MM estimate

f (n)d(α) is the penalty term with e.g. f (n) = 2 log n

σn is the robust scale estimate of a ’full’ model αf . Usually,αf = {1, . . . , d}

E⋆ is a robust resampling estimate of the expected value



Robustness and resampling

Resampling robust estimators causes problems with

robustness

speed

Stratified bootstrap (Müller and Welsh, JASA, 2005) onlysolves the first problem.−→ Limited practical use.

The fast and robust bootstrap solves both problems.



MM-estimators revisited

For the model comparison we use slightly adjustedMM-estimators:The MM-estimates βα satisfy

1n

n∑

i=1

ψ1

(yi − x′αiβα

σn

)xαi = 0 ,

where σn minimizes the M-scale σn(β), which for any β ∈ Rd isdefined as the solution that satisfies

1n

n∑

i=1

ρ0

(yi − x′iβσn(β)

)= b

ρ0 determines the breakdown point (S-estimator)

ρ1 determines the efficiency (MM-estimator)



Bootstrapping MM-estimates

Weighted least squares representation of MM-estimator

βα,n =

[n∑

i=1

ωαi xαi x′αi

]−1 n∑

i=1

ωαi xαi yi

with ωαi = ρ′1(rαi/σn)/rαi and rαi = yi − βα,n′xαi

Let (y⋆i , x⋆αi), i = 1, . . . ,m be a bootstrap sample of size

m ≤ n. Then β⋆

α satisfies

β⋆

α,m =

[m∑

i=1

ω⋆αi x⋆αi x⋆′

αi

]−1 m∑

i=1

ω⋆αi x⋆αi y⋆i

with ω⋆αi = ρ′1(r⋆αi/σ

⋆n)/r⋆αi and r⋆αi = y⋆i − β

⋆

α,m′x⋆αi



Fast and robust bootstrap

Weighted least squares representation of MM-estimator

βα,n =

[n∑

i=1

ωαi xαi x′αi

]−1 n∑

i=1

ωαi xαi yi

with ωαi = ρ′1(rαi/σn)/rαi and rαi = yi − βα,n′xαi

Let (y⋆i , x⋆αi), i = 1, . . . ,m be a bootstrap sample of size

m ≤ n. Define β1,⋆α by

β1,⋆α,m =

[m∑

i=1

ω⋆αi x⋆αi x⋆′

αi

]−1 m∑

i=1

ω⋆αi x⋆αi y⋆i

with ω⋆αi = ρ′1(r⋆αi/σn)/r⋆αi and r⋆αi = y⋆i − βα,n

′x⋆αi

Note that βα,n and σn are not recalculated!



Fast and robust bootstrap

The estimates β1,⋆α,m will under-estimate the variability of the

completely recalculated estimates β⋆

α,m→ A correction is needed

The fast and robust bootstrap estimates βR∗α,m are given by

βR∗α,m = βα,n + Kα,n

(β

1,⋆α,m − βα,n

)

where

Kα,n = σn

[n∑

i=1

ρ′′1 (rαi/ σn) xαi x′αi

]−1 n∑

i=1

ωαixαi x′αi

Note that Kα,n is computed only once for the originalsample.



Properties of fast and robust bootstrap

Computationally efficient: weighted least squarescalculations

Robust: No recalculation of observation weights



Consistent model selection

Suppose a true model α0 ⊂ {1, . . . , d} exists and is included inthe set A of models considered.

If we select the model that minimizes RPE(α) or PRPE(α), thatis

αm n = argminα∈ARPE(α) and αm n = argminα∈APRPE(α),

then, under appropriate regularity conditions, the modelselection criteria are consistent in the sense that

limn→∞

P(αm,n = α0) = 1 and limn→∞

P(αm,n = α0) = 1 .

Two conditions have practical consequences

m = o(n) (m out of n bootstrap)

f (n) = o(n/m)



Examples

We compare the full model with models selected bybackward elimination based on

RPE(α)PRPE(α) with f (n) = log(n)RFPE

For each of the models we report RR2a(α), the adjusted

robust R2

To compare predictive power we calculated the 5-fold CVtrimmed MSPE



Example 1: Ozone data

Los Angeles Ozone Pollution Data, 1976366 observations (different days) on 9 variablesResponse: temperature (degrees F) at El Monte, CACovariates: Measurements of temperature, pressure,humidity, ozone, etc at other places in CA.We start from the full quadratic model (d = 45)

model size RR2a 5% Trimmed MSPE

Full 45 0.8660 10.78

RFPE 23 0.8174 10.66

αm,n 10 0.7583 11.67

αm,n 7 0.7643 10.45



Example 2: Diabetes data

442 observations on 16 variables.Response: Measure of disease progression one year afterbaselineCovariates: 10 baseline variables (age, sex, BMI, bloodpressure, ...)We start from a quadratic model with some interactions(d = 65)

model size RR2a 5% Trimmed MSE

Full 65 0.7731 4988.1

RFPE 16 0.6045 2231.2

αm,n 11 0.5127 2657.2

αm,n 7 0.5302 2497.0


References

◮ Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007).Building a Robust Linear Model with Forward Selection andStepwise Procedures.Computational Statistics and Data Analysis, 52, 239-248.

◮ Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007).Robust Linear Model Selection Based on Least AngleRegression.Journal of the American Statistical Association, 102,1289-1299.

◮ Lutz, R.W., Kalisch, M., and Bühlmann, P. (2008).Robustified L2 boosting.Computational Statistics and Data Analysis, 52, 3331-3341.

◮ Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006).Robust Statistics: Theory and Methods,Wiley: New York.

◮ Salibian-Barrera, M. and Van Aelst S. (2007).Robust Model Selection Using Fast and Robust Bootstrap.Computational Statistics and Data Analysis, 52, 5121-5135


robust strategies and model selection · 2013. 1. 12. · 1 regression model 2 least squares 3...

Documents