robust strategies and model selection · 2013. 1. 12. · 1 regression model 2 least squares 3...
TRANSCRIPT
Robust strategies and model selection
Stefan Van Aelst
Department of Applied Mathematics and Computer ScienceGhent University, Belgium
ERCIM09 - COMISEF/COST Tutorial
Outline
1 Regression model
2 Least squares
3 Manual variable selection approach
4 Automatic variable selection approach
5 Robustness
6 Robust variable selection: sequencing
7 Robust variable selection: segmentation
Robust selection procedures Stefan Van Aelst 2
Regression model
Regression setting
Consider a datasetZn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1.
Y is the response variable
X1, . . . ,Xd are the candidate regressors
The corresponding linear model is:
yi = β1xi1 + · · ·+ βdxid + ǫi i = 1, . . . , n
yi = x′iβ + ǫi i = 1, . . . , n
where the errors ǫi are assumed to be iid with E(ǫi) = 0and Var(ǫi) = σ2 > 0.
Estimate the regression coefficients β from the data.
Robust selection procedures Stefan Van Aelst 3
Least squares
Least squares solution
βLS solves minβ
n∑
i=1
(yi − x′iβ
)2
Write
X = (x1, . . . , xn)t
y = (y1, . . . , xn)t
Then, βLS solves minβ
(y − Xβ)t(y − Xβ)
⇒
βLS = (XtX)−1Xty
y = Xβ = X(XtX)−1Xty = Hy
Robust selection procedures Stefan Van Aelst 4
Least squares
Least squares properties
Unbiased estimator: E(βLS) = β
Gauss-Markov theorem: LS has smallest variance amongall unbiased linear estimators of β.
Why do variable selection?
Robust selection procedures Stefan Van Aelst 5
Least squares
Expected prediction error
Assume the true regression function is linear:Y|x = f (x) + ǫ = xtβ + ǫ
Predict the response Y0 at x0: Y0 = xt0β + ǫ0 = f (x0) + ǫ0
Use an estimator of the regression coefficients: β
Estimated prediction: f (x0) = xt0β
Expected prediction error: E[(Y0 − f (x0))
2]
Robust selection procedures Stefan Van Aelst 6
Least squares
Expected prediction error
E[(Y0 − f (x0))2] = E
[(f (x0) + ǫ0 − f (x0))
2]
= σ2 + E[(f (x0)− f (x0))
2]
= σ2 + MSE(f (x0))
σ2: irreducible variance of the new observation y0
MSE(f (x0)) mean squared error of the prediction at x0 bythe estimator f
Robust selection procedures Stefan Van Aelst 7
Least squares
MSE of a prediction
MSE(f (x0)) = E[(f (x0)− f (x0))
2]
= E[[xt
0(β − β)]2]
= E[[xt
0(β − E(β) + E(β)− β)]2]
= bias(f (x0))2 + Var(f (x0))
LS is unbiased ⇒ bias(f (x0)) = 0
LS minimizes Var(f (x0)) (Gauss-Markov)
LS has smallest MSPE among all linear unbiased estimators
Robust selection procedures Stefan Van Aelst 8
Least squares
LS instability
LS becomes unstable with large MSPE if Var(f (x0)) is high.This can happen if
Many noise variables among the candidate regressors
Highly correlated predictors (multicollinearity)
⇒ Improve on least squares MSPE by trading (a little) bias for(a lot of) variance!
Robust selection procedures Stefan Van Aelst 9
Manual variable selection approach
Manual variable selection
Try to determine the set of the most important regressors
Remove the noise regressors from the model
Avoid multicollinearity
Methods
All subsets
Backward elimination
Forward selection
Stepwise selection
→ choose a selection criterion
Robust selection procedures Stefan Van Aelst 10
Manual variable selection approach
Submodels
DatasetZn = {(yi, xi1, . . . , xid) = (yi, xi); i = 1, . . . , n} ⊂ Rd+1.
Let α ⊂ {1, . . . , d} denote the predictors included in asubmodel
The corresponding submodel is:
yi = x′αiβα + ǫαi i = 1, . . . , n.
A selected model is considered a good model if
It is parsimonious
It fits the data well
It yields good predictions for similar data
Robust selection procedures Stefan Van Aelst 11
Manual variable selection approach
Some standard selection criteria
Adjusted R2: A(α) = 1 −RSS(α)/(n − d(α))
RSS(1)/(n − 1)
Mallow’s Cp: C(α) =RSS(α)σ2 − (n − 2d(α))
Final Prediction Error: FPE(α) =RSS(α)σ2 + 2d(α)
AIC: AIC(α) = −2L(α) + 2d(α)
BIC: BIC(α) = −2L(α) + log(n)d(α)
where σ is the residual scale estimate in the "full" model
Robust selection procedures Stefan Van Aelst 12
Manual variable selection approach
Resampling based selection criteria
Consider the (conditional) expected prediction error:
PE(α) = E
[1n
n∑
i=1
(zi − x′αiβα
)2∣∣∣∣∣ y,X
],
Estimates of the PE can be used as selection criterion.
Estimates can be obtained by cross-validation or bootstrap.
A more advanced selection criterion takes both goodness-of-fitand PE into account:
PPE(α) =1n
n∑
i=1
(yi − x′αiβα
)2+f (n) d(α)+E
[1n
n∑
i=1
(zi − x′αiβα
)2∣∣∣∣∣ y,X
]
Robust selection procedures Stefan Van Aelst 13
Automatic variable selection approach
Automatic variable selection
Try to find a stable model that fits the data well
Shrinkage: constrained least squares optimization
Stagewise forward procedures
Methods
Ridge regression
Lasso
Least Angle regression
L2 Boosting
Elastic Net
Robust selection procedures Stefan Van Aelst 14
Automatic variable selection approach
Lasso
Least Absolute Shrinkage and Selection Operator
βlasso = arg minβ
n∑
i=1
yi − β0 −
d∑
j=1
βjxij
2
subject to ‖β‖1 =d∑
j=1
|βj| ≤ t
0 < t < ‖βLS‖1 is a tuning parameter
Robust selection procedures Stefan Van Aelst 15
Automatic variable selection approach
Example: LASSO fits
*
*
*
* * * *
* *
2 4 6 8
−2
02
46
Df
Sta
ndar
dize
d C
oeffi
cien
ts
* *
*
* ** *
* *
* * * * * **
* *
* * * * *
* ** *
* * *
* *
* *
* *
* * * * * * *
**
* * * * * * * * ** * * * *
* *
**
LASSO
63
74
81
Robust selection procedures Stefan Van Aelst 16
Automatic variable selection approach
Least angle regression
Standardize the variables.
1 Select x1 such that |cor(y, x1)| = maxj |cor(y, xj)|.
2 Put r = y − γx1 where γ is determined such that
|cor(r, x1)| = maxj6=1
|cor(r, xj)|.
3 Select x2 corresponding to the maximum above.Determine the equiangular direction b such that x′1b = x′2b
4 Put r = r − γb where γ is determined such that
|cor(r, x1)| = |cor(r, x2)| = maxj6=1,2
|cor(r, xj)|.
5 Continue the procedure . . .
Robust selection procedures Stefan Van Aelst 17
Automatic variable selection approach
Properties of LAR
Least angle regression (LAR) selects the predictors inorder of importance.
LAR changes the contributions of the predictors graduallyas they are needed.
LAR is very similar to LASSO and can easily be adjustedto produce the LASSO solution
LAR only uses the means, variances and correlations ofthe variables.
LAR is computationally as efficient as LS
Robust selection procedures Stefan Van Aelst 18
Automatic variable selection approach
Example: LAR fits
*
**
* * * *
* *
2 4 6 8
−0.
20.
00.
20.
40.
6
Df
Sta
ndar
dize
d C
oeffi
cien
ts
* *
*
* *
* ** *
* * * * * **
* *
* * * * *
* ** *
* * *
* *
* *
* *
* * * * * * *
**
* * * * * * * * ** * * * *
* *
**
LAR
63
74
21
Robust selection procedures Stefan Van Aelst 19
Automatic variable selection approach
L2 boosting
Standardize the variables.
1 Put r = y and F0 = 0
2 Select x1 such that |cor(r, x1)| = maxj |cor(r, xj)|.
3 Update r = y − ν f(x1) where 0 < ν ≤ 1 is the step lengthand f(x1) are the fitted values from the LS regression of yon x1.Similarly, update F1 = F0 + ν f(x1)
4 Continue the procedure . . .
Robust selection procedures Stefan Van Aelst 20
Automatic variable selection approach
Sequencing variables
Several selection algorithms sequence the predictors in "orderof importance" or screen out the most relevant variables
Forward/stepwise selection
Stagewise forward selection
Penalty methods
Least angle regression
L2 boosting
These methods are computationally very efficient because theyare only based on means, variances and correlations.
Robust selection procedures Stefan Van Aelst 21
Robustness
Robustness: Data with outliers
Question: Number of partners men and women desire tohave in the next 30 years?
Men: Mean=64.3, Median=1−→ Mean is sensitive to outliers−→ Median is robust and thus more reliable
Robust selection procedures Stefan Van Aelst 22
Robustness
Least squares regression
3.6 3.8 4.0 4.2 4.4 4.6
4.0
4.5
5.0
5.5
Log Surface Temperature
Log
Ligh
t Int
ensi
ty
LS
LS: Minimize∑
r2i (β)
Robust selection procedures Stefan Van Aelst 23
Robustness
Outliers
3.6 3.8 4.0 4.2 4.4 4.6
4.0
4.5
5.0
5.5
6.0
Log Surface Temperature
Log
Ligh
t Int
ensi
ty LS
Outliers attract LS!
Robust selection procedures Stefan Van Aelst 24
Robustness
Robust regression estimators
3.6 3.8 4.0 4.2 4.4 4.6
4.0
4.5
5.0
5.5
6.0
Log Surface Temperature
Log
Ligh
t Int
ensi
ty LS
MM
Robust MM estimator is less influenced by outliers!
Robust selection procedures Stefan Van Aelst 25
Robustness
Robust univariate location estimators
The sample mean Xn satisfies the equation
n∑
i=1
(Xi − Xn) = 0
The ML estimator θ solves the equation
n∑
i=1
∂
∂θlog fθ(Xi)|θ=θ = 0
For a suitable score function ψ(x, θ), the M-estimator Tn
solves the equation
n∑
i=1
ψ(Xi − Tn) = 0
Robust selection procedures Stefan Van Aelst 26
Robustness
Univariate location M-estimators
�
�
�
�
n∑
i=1
ψ(Xi − Tn) = 0
Consistent if∫ψ(y)dF(y) = EF(ψ(y)) = 0
Asymptotic efficiency:(∫ψ′dΦ)2∫ψ2dΦ
Robustness: Maximal breakdown point (50%) if ψ(y) isbounded!
Robust selection procedures Stefan Van Aelst 27
Robustness
Examples of M-estimators
Sample mean: ψ(t) = t: Unbounded! Efficiency: 100%
Median: ψ(t) = sign(t): Bounded, efficiency: 63.7%
Huber estimator: ψb(t) = min{b,max{t,−b}}
=
t if |t| ≤ b
sign(t) b if |t| ≥ bwith b > 0
Robust selection procedures Stefan Van Aelst 28
Robustness
Huber psi function
−4 −2 0 2 4
−4
−2
02
4
x
Hub
er p
si
b−b
Robust selection procedures Stefan Van Aelst 29
Robustness
Tuning the Huber M-estimator
Huber M-estimator has maximal breakdown point for anyb <∞→ b can be chosen for good efficiency at Φ
b = 1.37 yields 95% efficiency→ trade-off between robustness and efficiency!
Robust selection procedures Stefan Van Aelst 30
Robustness
Example: Copper content in flour
Copper content (parts per million) in 24 wholemeal floursamples
510
1520
2530
Robust selection procedures Stefan Van Aelst 31
Robustness
Example: Copper content in flour
Copper content (parts per million) in 24 wholemeal floursamples
Sample mean: 4.28
Sample median: 3.39
Huber M-estimator: 3.21
Robust selection procedures Stefan Van Aelst 32
Robustness
Monotone M-estimates
Huber M-estimator has a monotone psi-function
If the function ψ(t) is monotone, then
Equation∑n
i=1 ψ(Xi − Tn) = 0 has a unique solution
Tn is easy to compute
Tn has maximal breakdown point
Large outliers still affect the estimate (although theeffect remains bounded)
Robust selection procedures Stefan Van Aelst 33
Robustness
Redescending M-estimates
If the function ψ(t) is not monotone, but redescends to zero,then
Equation∑n
i=1 ψ(Xi − Tn) = 0 has multiple solutions
Define ρ(t) such that ρ′(t) = ψ(t), then we need the solution
minTn
n∑
i=1
ρ(Xi − Tn)
Tn can be more difficult to compute
Tn has maximal breakdown point
The effect of large outliers on the estimate reduces to zero!
Increased robustness against large outliers
Robust selection procedures Stefan Van Aelst 34
Robustness
Redescending M-estimates
A popular family of redescending loss functions is the Tukeybiweight (bisquare) family of loss functions:
ρc(t) =
t22 − t4
2c2 +t6
6c4 if |t| ≤ c
c2
6 if |t| ≥ c.
The constant c can be tuned for efficiency
Robust selection procedures Stefan Van Aelst 35
Robustness
Tukey biweight ρ functions
−4 −2 0 2 4
0.0
0.5
1.0
1.5
2.0
t
ρ(t)
c=3
c=2
c=∞
Robust selection procedures Stefan Van Aelst 36
Robustness
Tukey biweight ψ function
−6 −4 −2 0 2 4 6
−4
−2
02
4
x
Psi
func
tion
b−b
c−c
HuberTukey
Robust selection procedures Stefan Van Aelst 37
Robustness
Example: Copper content in flour
Copper content (parts per million) in 24 wholemeal floursamples
Sample mean: 4.28
Sample median: 3.39
Huber M-estimator: 3.21
Tukey biweight M-estimator: 3.16
Robust selection procedures Stefan Van Aelst 38
Robustness
Univariate scale estimators
Example: Copper content (parts per million) in 24 wholemealflour samples
Standard deviation: 5.30
Median absolute deviation (MAD):
Sn = 1.483 med(|Xi − med(Xj)|)
MAD: 0.53
−→ Standard deviation is sensitive to outliers−→ MAD is robust and thus more reliable
Robust selection procedures Stefan Van Aelst 39
Robustness
M-estimators of scale
M-estimator of scale is the solution Sn such thatn∑
i=1
ψ(Xi/Sn) = 0
Symmetric distributions: use symmetric ψ functionsConsistent if
∫ψ(y)dF(y) = EF(ψ(y)) = 0
The Tukey biweight loss functions ρc are symmetricPut b = EΦ(ρc) and define ψc(t) = ρc(t)− b, then the Tukeybiweight M-estimator of scale Sn solves
n∑
i=1
ψc(Xi/Sn) = 0
or equivalently1n
n∑
i=1
ρc(Xi/Sn) = b
Robust selection procedures Stefan Van Aelst 40
Robustness
Example: Copper content in flour
Copper content (parts per million) in 24 wholemeal floursamples
Standard deviation: 5.30
Median absolute deviation: 0.53
Tukey biweight M-estimator: 0.66
Robust selection procedures Stefan Van Aelst 41
Robustness
Robust regression
Denote ri(β) = yi − x′iβ the residuals corresponding to β
βLS solves minβ
n∑
i=1
(yi − x′iβ
)2=
n∑
i=1
(ri(β))2
Denote σ(β) =√∑n
i=1(ri(β))2
n−d the estimate of the residualscale
The LS estimator βLS then equivalently solves minβσ(β)
⇒ Instead minimize a robust estimate of the residual scale
Robust selection procedures Stefan Van Aelst 42
Robustness
Least Median of Squares regression
LS LMS
Minimize1
n − d
n∑
i=1
ri(β)2 −→ Minimize med ri(β)
2
Maximal breakdown point (50%)Small biasSlow rate of convergence (n−1/3)Inefficient
Robust selection procedures Stefan Van Aelst 43
Robustness
Least Trimmed Squares regression
LS LTS
Minimize1
n − d
n∑
i=1
ri(β)2 −→ Minimize
1h
h∑
i=1
(r(β)2)i:n
where (r(β)2)1:n ≤ · · · ≤ (r(β)2)n:n
Breakdown point is min{h, n − h}/n ≤ 50%Asymptotically normalTrade-off robustness-efficiencyLow efficiency (less than 10%)
Robust selection procedures Stefan Van Aelst 44
Robustness
Regression S-estimators
LS S-estimate
Minimize1n
n∑
i=1
ri(β)2 −→ Minimize σ(β)
For each β, σ(β) solves 1n
∑ρc
(ri(β)σ
)= b
c determines both robustness and efficiencyTrade-off robustness-efficiencyBreakdown point can be up to 50%Asymptotically normalEfficiency can still be low (less than 35%)
Robust selection procedures Stefan Van Aelst 45
Robustness
Regression M-estimators
LS M-estimate
Minimizen∑
i=1
ri(β)2 −→ Minimize
n∑
i=1
ρ
(ri(β)
σ
)
or solven∑
i=1
ψ
(ri(β)
σ
)xi = 0
Requires a robust scale estimate σ!
Robust selection procedures Stefan Van Aelst 46
Robustness
MM estimates
LS MM-estimate
Minimizen∑
i=1
ri(β)2 −→ Minimize
n∑
i=1
ρ
(ri(β)
σ
)
σ is S-estimator’s M-scale
M and S-estimator both use Tukey biweight ρc functions
S-estimator is tuned for robustness (breakdown point)
Redescending M-estimator is tuned for efficiency
Robust selection procedures Stefan Van Aelst 47
Robustness
MM: loss functions
Tukey biweight family ρc(t) =
{3 t
c2 − 3 t
c4 + t
c6 if |t| ≤ c
1 if |t| > c,
x
loss
−7 0 c0 c1 7
0.0
0.2
0.4
0.6
0.8
1.0
1.2
ρ0ρ1
ρ0 determines the breakdown point (S-estimator)ρ1 determines the efficiency (MM-estimator)
Robust selection procedures Stefan Van Aelst 48
Robustness
MM estimates
LS MM-estimate
Minimizen∑
i=1
ri(β)2 −→ Minimize
n∑
i=1
ρ
(ri(β)
σ
)
σ is S-estimator’s M-scale
M and S-estimator both use Tukey biweight ρc functions
S-estimator is tuned for robustness (breakdown point)
Redescending M-estimator is tuned for efficiency
Highly robust and efficient!
Robust selection procedures Stefan Van Aelst 49
Robustness
Redescending psi function
⋆ A redescending psi function is needed for robustness, butthis implies
Multiple solutions of score equations
Global solution is needed (high breakdown point)
Difficult (time consuming) to compute
Robust selection procedures Stefan Van Aelst 50
Robust variable selection: sequencing
Robust variable selection
Issues
Robust regression estimators are computationallydemanding
’Outliers’ depend on the model under consideration
High dimensional data: Outlying cases?
Our approach: a two-step procedure
Sequencing: Construct a reduced sequence ofgood predictors in an efficientway.
Segmentation: Build an optimal model fromthe reduced set of predictors.
Robust selection procedures Stefan Van Aelst 51
Robust variable selection: sequencing
Sequencing the variables in order of importance
Automatic variable selection methods such asforward/stepwise selection, LAR and L2 boosting arecomputationally efficient methods to sequence predictors
These methods are based only on the means, variancesand correlations of the data.
⇒ Construct computationally efficient, robust methods to se-quence predictors by using computationally efficient and highlyrobust estimates of center, scale and correlation
Robust selection procedures Stefan Van Aelst 52
Robust variable selection: sequencing
Robust building blocks
Location: Median
Scatter: Median Absolute Deviation
Correlation: Bivariate Winsorization
Correlation: Bivariate M-estimators
Correlation: Gnanadesikan-Kettenring estimators
Robust selection procedures Stefan Van Aelst 53
Robust variable selection: sequencing
Winsorized correlation estimates
1 Robustly standardize the data using median and MAD
2 Transform the data by shifting outliers towards the center
3 Calculate the Pearson correlation of the transformed data
Robust selection procedures Stefan Van Aelst 54
Robust variable selection: sequencing
Univariate Winsorization
Componentwise transformationu = ψc(x) = min(max(−c, x), c)
−4 −2 0 2 4
−4
−2
02
4
x
Huber ψ function with c=2ψ
c(x)
Robust selection procedures Stefan Van Aelst 55
Robust variable selection: sequencing
Univariate Winsorization
Componentwise transformationu = ψc(x) = min(max(−c, x), c)
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 56
Robust variable selection: sequencing
Bivariate Winsorization
Bivariate transformation
u = min(√
c/D(x), 1)x with c = F−1χ2
2(0.95)
D(x) = xtR−10 x with R0 an initial bivariate correlation matrix.
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 57
Robust variable selection: sequencing
Bivariate Winsorization
Bivariate transformation
u = min(√
c/D(x), 1)x with c = F−1χ2
2(0.95)
D(x) = xtR−10 x with R0 an initial bivariate correlation matrix.
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 58
Robust variable selection: sequencing
Bivariate Winsorization
Bivariate transformation
u = min(√
c/D(x), 1)x with c = F−1χ2
2(0.95)
D(x) = xtR−10 x with R0 an initial bivariate correlation matrix.
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 59
Robust variable selection: sequencing
Bivariate Winsorization
Bivariate transformation
u = min(√
c/D(x), 1)x with c = F−1χ2
2(0.95)
D(x) = xtR−10 x with R0 an initial bivariate correlation matrix.
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 60
Robust variable selection: sequencing
Initial correlation estimate
Adjusted Winsorization: Univariate Winsorization with differenttuning constants for different quadrants.
Denote h = ratio of observations in second and fourth quadrantsto the observations in first and third quadrant.
Suppose h ≤ 1, then
Use constant c1 for Winsorizing points in first and third quadrantsUse c2 =
√
hc1 for second and fourth quadrants
R0 is correlation matrix of adjusted Winsorized data
Robust selection procedures Stefan Van Aelst 61
Robust variable selection: sequencing
Initial correlation estimate
Adjusted Winsorization: Univariate Winsorization with differenttuning constants for different quadrants.
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 62
Robust variable selection: sequencing
Initial correlation estimate
Univariate Winsorization
−4 −2 0 2
−4−3
−2−1
01
23
variable 1
varia
ble 2
Robust selection procedures Stefan Van Aelst 63
Robust variable selection: sequencing
Correlation M-estimators
1 First center the two variables using their medians
2 An M-estimate of the covariance matrix is the solution V ofthe equation
1n
∑
i
u2(d2i )xix′i = V,
where d2i = x′iV
−1xi and u2(t) = min(χ22(0.99)/t, 1)
3 Calculate the correlation corresponding to the bivariatecovariance matrix V
Robust selection procedures Stefan Van Aelst 64
Robust variable selection: sequencing
Gnanadesikan-Kettenring correlation estimators
Consider the identity
cov(X, Y) =14
(sd(X + Y)2 − sd(X − Y)2)
Replace the sample standard deviations by robustestimates of scale to obtain robust correlation estimates
Robust selection procedures Stefan Van Aelst 65
Robust variable selection: sequencing
Robust correlations: Computational efficiency
10000 20000 30000 40000 50000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
sample size
cpu
time
Uni−WinsorAdj−WinsorBi−WinsorMaronna
Robust selection procedures Stefan Van Aelst 66
Robust variable selection: sequencing
Robust LAR: Computational efficiency
Computational efficiency of correlations largely determinescomputing time of Robust LAR
50 100 150 200 250 300
050
100
150
dimension
cpu
time
LARSW−RLARSM−RLARS
Robust selection procedures Stefan Van Aelst 67
Robust variable selection: sequencing
Bootstrapping the sequencing algorithms
Use bootstrap averages to obtain more reliable and stablesequencesProcedure:
1 Generate 50 bootstrap samples2 Sequence predictors in each sample3 Rank predictors according to their average rank over the
bootstrap samples
Not all predictors have to be ranked in each bootstrapsample
Robust selection procedures Stefan Van Aelst 68
Robust variable selection: sequencing
Bootstrap effect on robust LAR
Simulation design
Samples of size 150 in 200 dimensions
10 target predictors
20 noise covariates correlated with target predictors
170 independent noise covariates
10% of symmetric or asymmetric high leverage outliers
We compare with random forests using variableimportance measures to sequence the variables
Robust selection procedures Stefan Van Aelst 69
Robust variable selection: sequencing
Bootstrap RLAR vs RLAR/Random Forests
Symmetric high leverage Asymmetric high leverage
0 10 20 30 40 50
24
68
10
Number of Variables
Num
ber
of T
arge
t Var
iabl
es
B−RLARSRLARSRF−OOBRF−IMP
0 10 20 30 40 50
24
68
Number of Variables
Num
ber
of T
arge
t Var
iabl
es
Robust selection procedures Stefan Van Aelst 70
Robust variable selection: sequencing
Example: Demographic data
n = 50 states of USA, d = 25 covariates.
Response y = murder rate
One outlier
5-fold cross validation selects a model with 7 variables
We sequence the variables using B-RLARSConstruct learning curve
Graphical tool to select the size of reduced sequence inpracticeBased on a robust R2 measure:
e.g. R2 = 1 −Med(residual2)
MAD2(y)
Robust selection procedures Stefan Van Aelst 71
Robust variable selection: sequencing
Demographic data: learning curve
Learning curve
5 10 15 20
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Number of variables in the model
Lear
ning
rat
e
⇒ Reduced set of at most 12 predictors
Robust selection procedures Stefan Van Aelst 72
Robust variable selection: sequencing
Demographic data: models
Full CV model: 7 predictors
B-RLAR+CV: 6 predictors
LAR+CV: 8 predictors
RF-SEL: 5 predictors
RF-SEL+CV: 4 predictors
RF-RED+CV: 5 predictors
MSVM-RFE: 8 predictors
MSVM-RFE+CV: 6 predictors
Robust selection procedures Stefan Van Aelst 73
Robust variable selection: sequencing
Demographic data: model comparison
Density estimates based on 1000 5-fold CV-MSPE estimates.
0 200 400 600 800 1000
0.00
00.
005
0.01
00.
015
5−fold CV−MSPE
dens
ity
Full−CVLARS+CVB−RLARS+CVRF−SEL+CVRF−RED+CVMSVM−RFE
Robust selection procedures Stefan Van Aelst 74
Robust variable selection: sequencing
Example: Protein data
n = 4141 protein sequences, d = 77 covariates.
Training sample of size 2072 and test sample of size 2069.We selected predictors using
B-RLAR: 5 predictorsRF using OOB importance: 22 predictorsMSVM-RFE: 22 predictors
For RF we could determine an optimal submodel in thereduced sequences using robust MM-estimates with robustFPE. ⇒ RF+RFPE: 18 predictors
Robust selection procedures Stefan Van Aelst 75
Robust variable selection: sequencing
Protein data: test sample errors
Trimmed means of squared prediction errors
Trimming fractionModel 1% 5% 10%B-RLAR 116.19 97.73 84.67RF 111.11 93.80 81.30RF-RFPE 111.30 93.92 81.27MSVM-RFE 173.70 150.48 133.17
Robust selection procedures Stefan Van Aelst 76
Robust variable selection: sequencing
Example: Particle data
Quantum physics data with d = 64 predictors.
Training sample of size 5,000, test sample of size 45,000.
FS and SW produced a model with 25 predictors.
Robust FS and SW produced a model with only 1 predictor.
Indeed for more than 80% of the cases X1 = Y = 0.
For the cases with X1 6= 0, FS produced a model with 5predictors.
We fit the final models using MM-estimators.
Robust selection procedures Stefan Van Aelst 77
Robust variable selection: sequencing
Particle data: test sample errors
Trimmed means of squared prediction errors
Trimming fractionModel 1% 5%
FS 0.110 0.012Robust FS 0.032 0.001
Robust selection procedures Stefan Van Aelst 78
Robust variable selection: segmentation
Segmentation: Robust adjusted R-squared
Adjusted R2: A(α) = 1 −RSS(α)/(n − d(α))
RSS(1)/(n − 1)Based on a robust regression estimator we can construct arobust adjusted R2:
RR2a(α) = 1 −
σ2α
(n − d(α))
/σ2
0
(n − 1),
σα is the robust residual scale of the submodel withpredictor indexed by α
σ0 is the robust residual scale of the intercept-onlymodel
Robust selection procedures Stefan Van Aelst 79
Robust variable selection: segmentation
Segmentation: Robust FPE
FPE(α) = RSS(α)σ2 + 2d(α) estimates the final prediction error
FPE(α) =1σ2
n∑
i=1
E[(zi − x′αiβα)
2], assuming that the
model is correct.
Consider now the robust final prediction error:
RFPE(α) =n∑
i=1
E
[ρ
(zi − x′αiβα
σ
)]. Assuming that the
model is correct and using a second order Taylorexpansion, this can be estimated by
RFPE(α) =∑n
i=1 ρ(ri(βα)/σn) + d(α)∑n
i=1 ψ2(ri(βα
)/σn)∑n
i=1 ψ′(ri(βα
)/σn)
σn is the robust scale estimate of a ’full’ model αf . Usually,αf = {1, . . . , d}
Robust selection procedures Stefan Van Aelst 80
Robust variable selection: segmentation
Robust resampling based selection criteria
Robust equivalents of the resampling based selection criteria:
RPE(α) =σ2
n
nE⋆
[n∑
i=1
ρ
(zi − x′αiβα
σn
)∣∣∣∣∣ y,X]
PRPE(α) =σ2
n
n
{n∑
i=1
ρ
(yi − x′αiβα
σn
)+ f (n) d(α)
}+ Mn(α)
ρ is the MM loss function and βα,n is the MM estimate
f (n)d(α) is the penalty term with e.g. f (n) = 2 log n
σn is the robust scale estimate of a ’full’ model αf . Usually,αf = {1, . . . , d}
E⋆ is a robust resampling estimate of the expected value
Robust selection procedures Stefan Van Aelst 81
Robust variable selection: segmentation
Robustness and resampling
Resampling robust estimators causes problems with
robustness
speed
Stratified bootstrap (Müller and Welsh, JASA, 2005) onlysolves the first problem.−→ Limited practical use.
The fast and robust bootstrap solves both problems.
Robust selection procedures Stefan Van Aelst 82
Robust variable selection: segmentation
MM-estimators revisited
For the model comparison we use slightly adjustedMM-estimators:The MM-estimates βα satisfy
1n
n∑
i=1
ψ1
(yi − x′αiβα
σn
)xαi = 0 ,
where σn minimizes the M-scale σn(β), which for any β ∈ Rd isdefined as the solution that satisfies
1n
n∑
i=1
ρ0
(yi − x′iβσn(β)
)= b
ρ0 determines the breakdown point (S-estimator)
ρ1 determines the efficiency (MM-estimator)
Robust selection procedures Stefan Van Aelst 83
Robust variable selection: segmentation
Bootstrapping MM-estimates
Weighted least squares representation of MM-estimator
βα,n =
[n∑
i=1
ωαi xαi x′αi
]−1 n∑
i=1
ωαi xαi yi
with ωαi = ρ′1(rαi/σn)/rαi and rαi = yi − βα,n′xαi
Let (y⋆i , x⋆αi), i = 1, . . . ,m be a bootstrap sample of size
m ≤ n. Then β⋆
α satisfies
β⋆
α,m =
[m∑
i=1
ω⋆αi x⋆αi x⋆′
αi
]−1 m∑
i=1
ω⋆αi x⋆αi y⋆i
with ω⋆αi = ρ′1(r⋆αi/σ
⋆n)/r⋆αi and r⋆αi = y⋆i − β
⋆
α,m′x⋆αi
Robust selection procedures Stefan Van Aelst 84
Robust variable selection: segmentation
Fast and robust bootstrap
Weighted least squares representation of MM-estimator
βα,n =
[n∑
i=1
ωαi xαi x′αi
]−1 n∑
i=1
ωαi xαi yi
with ωαi = ρ′1(rαi/σn)/rαi and rαi = yi − βα,n′xαi
Let (y⋆i , x⋆αi), i = 1, . . . ,m be a bootstrap sample of size
m ≤ n. Define β1,⋆α by
β1,⋆α,m =
[m∑
i=1
ω⋆αi x⋆αi x⋆′
αi
]−1 m∑
i=1
ω⋆αi x⋆αi y⋆i
with ω⋆αi = ρ′1(r⋆αi/σn)/r⋆αi and r⋆αi = y⋆i − βα,n
′x⋆αi
Note that βα,n and σn are not recalculated!
Robust selection procedures Stefan Van Aelst 85
Robust variable selection: segmentation
Fast and robust bootstrap
The estimates β1,⋆α,m will under-estimate the variability of the
completely recalculated estimates β⋆
α,m→ A correction is needed
The fast and robust bootstrap estimates βR∗α,m are given by
βR∗α,m = βα,n + Kα,n
(β
1,⋆α,m − βα,n
)
where
Kα,n = σn
[n∑
i=1
ρ′′1 (rαi/ σn) xαi x′αi
]−1 n∑
i=1
ωαixαi x′αi
Note that Kα,n is computed only once for the originalsample.
Robust selection procedures Stefan Van Aelst 86
Robust variable selection: segmentation
Properties of fast and robust bootstrap
Computationally efficient: weighted least squarescalculations
Robust: No recalculation of observation weights
Robust selection procedures Stefan Van Aelst 87
Robust variable selection: segmentation
Consistent model selection
Suppose a true model α0 ⊂ {1, . . . , d} exists and is included inthe set A of models considered.
If we select the model that minimizes RPE(α) or PRPE(α), thatis
αm n = argminα∈ARPE(α) and αm n = argminα∈APRPE(α),
then, under appropriate regularity conditions, the modelselection criteria are consistent in the sense that
limn→∞
P(αm,n = α0) = 1 and limn→∞
P(αm,n = α0) = 1 .
Two conditions have practical consequences
m = o(n) (m out of n bootstrap)
f (n) = o(n/m)
Robust selection procedures Stefan Van Aelst 88
Robust variable selection: segmentation
Consistent model selection
Suppose a true model α0 ⊂ {1, . . . , d} exists and is included inthe set A of models considered.
If we select the model that minimizes RPE(α) or PRPE(α), thatis
αm n = argminα∈ARPE(α) and αm n = argminα∈APRPE(α),
then, under appropriate regularity conditions, the modelselection criteria are consistent in the sense that
limn→∞
P(αm,n = α0) = 1 and limn→∞
P(αm,n = α0) = 1 .
Two conditions have practical consequences
m = o(n) (m out of n bootstrap)
f (n) = o(n/m)
Robust selection procedures Stefan Van Aelst 89
Robust variable selection: segmentation
Examples
We compare the full model with models selected bybackward elimination based on
RPE(α)PRPE(α) with f (n) = log(n)RFPE
For each of the models we report RR2a(α), the adjusted
robust R2
To compare predictive power we calculated the 5-fold CVtrimmed MSPE
Robust selection procedures Stefan Van Aelst 90
Robust variable selection: segmentation
Example 1: Ozone data
Los Angeles Ozone Pollution Data, 1976366 observations (different days) on 9 variablesResponse: temperature (degrees F) at El Monte, CACovariates: Measurements of temperature, pressure,humidity, ozone, etc at other places in CA.We start from the full quadratic model (d = 45)
model size RR2a 5% Trimmed MSPE
Full 45 0.8660 10.78
RFPE 23 0.8174 10.66
αm,n 10 0.7583 11.67
αm,n 7 0.7643 10.45
Robust selection procedures Stefan Van Aelst 91
Robust variable selection: segmentation
Example 2: Diabetes data
442 observations on 16 variables.Response: Measure of disease progression one year afterbaselineCovariates: 10 baseline variables (age, sex, BMI, bloodpressure, ...)We start from a quadratic model with some interactions(d = 65)
model size RR2a 5% Trimmed MSE
Full 65 0.7731 4988.1
RFPE 16 0.6045 2231.2
αm,n 11 0.5127 2657.2
αm,n 7 0.5302 2497.0
Robust selection procedures Stefan Van Aelst 92
References
◮ Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007).Building a Robust Linear Model with Forward Selection andStepwise Procedures.Computational Statistics and Data Analysis, 52, 239-248.
◮ Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007).Robust Linear Model Selection Based on Least AngleRegression.Journal of the American Statistical Association, 102,1289-1299.
◮ Lutz, R.W., Kalisch, M., and Bühlmann, P. (2008).Robustified L2 boosting.Computational Statistics and Data Analysis, 52, 3331-3341.
◮ Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006).Robust Statistics: Theory and Methods,Wiley: New York.
◮ Salibian-Barrera, M. and Van Aelst S. (2007).Robust Model Selection Using Fast and Robust Bootstrap.Computational Statistics and Data Analysis, 52, 5121-5135
Robust selection procedures Stefan Van Aelst 93