model assessment and selection: subset, ridge, lasso, and pcr

54
Model Assessment and Selection: Subset, Ridge, Lasso, and PCR Yuan Yao Department of Mathematics Hong Kong University of Science and Technology Most of the materials here are from Chapter 5-6 of Introduction to Statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Spring, 2020

Upload: others

Post on 30-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Model Assessment and Selection:Subset, Ridge, Lasso, and PCR

Yuan Yao

Department of MathematicsHong Kong University of Science and Technology

Most of the materials here are from Chapter 5-6 of Introduction to Statisticallearning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Spring, 2020

Page 2: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Yesterday is a bad day...

2

Page 3: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Model AssessmentCross ValidationBootstrap

Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression

3

Page 4: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Outline

Model AssessmentCross ValidationBootstrap

Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression

Model Assessment 4

Page 5: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Training error is not sufficient enough

I Training error easily computable with training data.

I Because of possibility of over-fit, it cannot be used to properlyassess test error.

I It is possible to “estimate” the test error, by, for example,making adjustments of the training error.

– The adjusted R-squared, Cp, AIC, BIC, etc serve thispurpose.

I General purpose method of prediction/test error estimate:validation.

Model Assessment 5

Page 6: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Ideal scenario for performance assessment

I In a “data-rich” scenario, we can afford to separate the datainto three parts:

– training data: used to train various models.– validation data: used to assess the models and identify the

best.– test data: test the results of the best model.

I Usually, people also call validation data or hold-out data astest data.

Model Assessment 6

Page 7: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Cross validation: overcome the drawback of validation

set approach

I Our ultimate goal is to produce the best model with bestprediction accuracy.

I Validation set approach has a drawback of using ONLYtraining data to fit model.

I The validation data do not participate in model building butonly model assessment.

I A “waste” of data.

I We need more data to participate in model building.

Model Assessment 7

Page 8: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

K-fold cross validation

I Divide the data into K subsets, usually of equal or similar sizes(n/K ).

I Treat one subset as validation set, the rest together as atraining set. Run the model fitting on training set. Calculatethe test error estimate on the validation set, denoted asMSEi , say.

I Repeat the procedures over every subset.

I Average over the above K estimates of the test errors, andobtain

CV(K) =1

K

K∑i=1

MSEi

I Leave-One-Out Cross Validation (LOOCV) is a special case ofK-fold cross validation, actually n-fold cross validation.

Model Assessment 8

Page 9: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

K-fold cross validation

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

Figure: 5.5. A schematic display of 5-fold CV. A set of n observations israndomly split into five non-overlapping groups. Each of these fifths actsas a validation set (shown in beige), and the remainder as a training set(shown in blue). The test error is estimated by averaging the fiveresulting MSE estimates.

Model Assessment 9

Page 10: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Inference of Estimate Uncertainty

I Suppose we have data x1, ..., xn, representing the ages of nrandomly selected people in HK.

I Use sample mean x to estimate the population mean µ, theavearge age of all residents of HK.

I How to justify the estimation error x − µ? Usually byt-confidence interval, test of hypothesis.

I They rely on normality assumption or central limit theorm.

I Is there another reliable way?

I Just bootstrap:

Model Assessment 10

Page 11: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Boostrap as a resampling procedure.

I Take n random sample (with replacement) from x1, ..., xn.

I calculate the sample mean of the “re-sample”, denoted as x∗1 .

I Repeat the above a large number M times. We havex∗1 , x

∗2 , ..., x

∗M .

I Use the distribution of x∗1 − x , ..., x∗M − x to approximate thatof x − µ.

Model Assessment 11

Page 12: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

I Essential idea: Treat the data distribtion (more professionallycalled empirical distributoin) as a proxy of the populationdistribution.

I Mimic the data generation from the true population, by tryingresampling from the empirical distribution.

I Mimic your statistical procedure (such as computing anestimate x) on data, by doing the same on the resampleddata.

I Evalute your statistical procedure (which may be difficultbecause it involves randomness and the unknown populationdistribution) by evaluting your analogue procudres on there-samples.

Model Assessment 12

Page 13: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example

I X and Y are two random variables. Then minimizer ofvar(αX + (1− α)Y )) is

α =σ2Y − σXY

σ2X + σ2Y − 2σXY

I Data: (X1,Y1), ..., (Xn,Yn).

I We can compute sample variances and covariances.

I Estimate α by

α =σ2Y − σXY

σ2X + σ2Y − 2σXY

I How to evalute α− α, (remember α is random and α isunknown).

I Use Bootstrap

Model Assessment 13

Page 14: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example

I Sample n resamples from (X1,Y1), ..., (Xn,Yn), and computethe sample the sample variance and covariances for thisresample. And then compute

α∗ =(σ∗Y )2 − σ∗XY

(σ∗X )2 + (σ∗Y )2 − 2σ∗XYI Repeat this procedure, and we have α∗

1, ..., α∗M for a large M.

I Use the distribution of α∗1 − α, ..., α∗

M − α to approximate thedistribution of α− α.

I For example, we can use

1

M

M∑j=1

(α∗j − α)2

to estimate E (α− α)2.I Use Bootstrap

Model Assessment 14

Page 15: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

0.3 0.4 0.5 0.6 0.7 0.8 0.9

050

100

150

200

True Bootstrap

0.3

0.4

0.5

0.6

0.7

0.8

0.9

αα

α

Figure: 5.10. Left: A histogram of the estimates of α obtained bygenerating 1,000 simulated data sets from the true population. Center: Ahistogram of the estimates of α obtained from 1,000 bootstrap samplesfrom a single data set. Right: The estimates of α displayed in the leftand center panels are shown as boxplots. In each panel, the pink lineindicates the true value of α.

Model Assessment 15

Page 16: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

1.1 2.1 2

Y X Obs

Original Data (Z)

1*Z

2*Z

Z*B

1*α

2*α

α*B

!!

!!

!!

!!

!

!!

!!

!!

!!

!!

!!

!!

!!

Model Assessment 16

Page 17: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Figure 5.11. A graphical illustration of the bootstrap approach ona small sample containing n = 3 observations. Each bootstrapdata set contains n observations, sampled with replacement fromthe original data set. Each bootstrap data set is used to obtain anestimate of α.

Model Assessment 17

Page 18: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Outline

Model AssessmentCross ValidationBootstrap

Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression

Model Selection 18

Page 19: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Interpretability vs. Prediction

Flexibility

Inte

rpre

tabili

ty

Low High

Low

Hig

h Subset SelectionLasso

Least Squares

Generalized Additive ModelsTrees

Bagging, Boosting

Support Vector Machines

Figure: 2.7. As models become flexible, interpretability drops. OccamRazor principle: Everything has to be kept as simple as possible, but notsimpler (Albert Einstein).

Model Selection 19

Page 20: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

About this chapter

I Linear model already addressed in detail in Chapter 3.

Y = β0 + β1X1 + ...+ βpXp + ε

I Model assessment: cross-validation (prediction) error inChapter 5.

I This chapter is about model selection for linear models.

I The model selection techniques can be extended beyond linearmodels.

I Details about AIC, BIC, Mallow’s Cp mentioned in Chapter 3.

Model Selection 20

Page 21: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Feature/variable selection

I Not all existing input variables are useful for predicting theoutput.

I Keeping redundant inputs in model can lead to poorprediction and poor interpretation.

I We consider three ways of variable/model selection:

– Subset selection.– Shrinkage/regularization: constraining some regression

parameters to 0, e.g. Ridge and Lasso.– Dimension reduction: actually using the “derived inputs” by,

for example, principle component approach.

Model Selection 21

Page 22: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Best subset selection

I Exhaust all possible combinations of inputs.

I With p variables, there are 2p many distinct combinations.

I Identify the best model among these models.

Model Selection 22

Page 23: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Pros and Cons of best subset selection

I Seems straightforward to carry out.

I Conceptually clear.

I The search space too large (2p models), may lead to overfit.

I Computationally infeasible: too many models to run.

I if p = 20, there are 220 > 1000, 000 models.

Model Selection 23

Page 24: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Forward stepwise selection

I Start with the null model.

I Find the best one-variable model.

I With the best one-varialbe model, add one more variable toget the best two-variable model.

I With the best two-varialbe model, add one more variable toget the best three-variable model.

I ....

I Find the best among all these best k-variable models.

Model Selection 24

Page 25: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Pros and Cons of forward stepwise selection

I Less computation

I Less models (∑p−1

k=0(p − k) = 1 + p(p + 1)/2 models).

I (if p = 20, only 211 models, compared with more than 1million models for best subset selection).

I No problem for first n-steps if p > n.

I Once an input is in, it does not get out.

Model Selection 25

Page 26: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Backward stepwise selection

I Start with the largest model (all p inputs in).

I Find the best (p − 1)-variable model, by reducing one fromthe largest model

I Find the best (p − 2)-variable model, by reducing one variablefrom the best (p − 1)-variable model.

I Find the best (p − 3)-variable model, by reducing one variablefrom the best (p − 2)-variable model.

I ....

I Find the best 1-varible model, by reducing one variable fromthe best 2-variable model.

I The null model.

Model Selection 26

Page 27: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Pros and Cons of backward stepwise selection

I Less computation

I Less models (∑p−1

k=0(p − k) = 1 + p(p + 1)/2 models).

I (if p = 20, only 211 models, compared with more than 1million models for best subset selection).

I Once an input is out, it does not get in.

I No applicable to the case with p > n.

Model Selection 27

Page 28: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Find the best model based on prediction error.

I General approach by Validation/Cross-Validation (addressedin ISLR Chapter 5).

I Model-based approach by Adjusted R2, AIC, BIC or Cp (ISLRChapter 3).

Model Selection 28

Page 29: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

R-squared

I Residue

εi = yi − β0 −p∑

j=1

βjxij

I Residual Sum of Squares as the Training Error

RSS =n∑

i=1

ε2i =n∑

i=1

(yi − β0 −p∑

j=1

βjxij)2

I R-squared

R2 = 1− RSS

TSS

where the Total Sum of Squares TSS :=∑n

i=1(yi − y)2.

Model Selection 29

Page 30: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example: Credit data

2 4 6 8 10

2e+

07

4e+

07

6e+

07

8e+

07

Number of Predictors

Resid

ual S

um

of S

quare

s

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Number of Predictors

R2

Figure: 6.1. For each possible model containing a subset of the tenpredictors in the Credit data set, the RSS and R2 are displayed. The redfrontier tracks the best model for a given number of predictors, accordingto RSS and R2. Though the data set contains only ten predictors, thex-axis ranges from 1 to 11, since one of the variables is categorical andtakes on three values, leading to the creation of two dummy variables.

Model Selection 30

Page 31: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

The issues of R-squared

I The R-squared is the percentage of the total variation inresponse due to the inputs.

I The R-squared reflects the training error.

I However, a model with larger R-squared is not necessarilybetter than another model with smaller R-squared when weconsider test error!

I If model A has all the inputs of model B, then model A’sR-squared will always be greater than or as large as that ofmodel B.

I If model A’s additional inputs are entirely uncorrelated withthe response, model A contain more noise than model B. As aresult, the prediction based on model A would inevitably bepoorer or no better.

Model Selection 31

Page 32: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

a) Adjusted R-squared

I The adjusted R-squared, taking into account of the degrees offreedom, is defined as

adjusted R2 = 1− RSS/(n − p − 1)

TSS/(n − 1)

I With more inputs, the R2 always increase, but the adjustedR2 could decrease since more inputs is penalized by thesmaller degree of freedom of the residuals.

I Maximizing the adjusted R-squared, is equivalent to

minimize RSS/(n − p − 1).

Model Selection 32

Page 33: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

b) Mallow’s Cp

I Recall that our linear model (2.1) has p covariates, andσ2 = RSS/(n − p − 1) is the unbiased estimator of σ2.

I Suppose we use only d predictors and RSS(d) is the residualsum of squares for the linear model with d predictors.

I The statistic of Mallow’s Cp is defined as

Cp =RSS(d) + 2d σ2

n

I The smaller Mallows’ Cp is, the better the model is.

I The following AIC is more often used, despite that Mallows’Cp and AIC usually give the same best model.

Model Selection 33

Page 34: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

c) AIC

I AIC stands for Akaike information criterion, which is definedas

AIC =RSS(d) + 2d σ2

nσ2,

for a linear model with d ≤ p predictors, whereσ2 = ‖y − y‖2/(n − p − 1) is the unbiased estimator of σ2

using the full model.

I AIC aims at maximizing the expected predictive likelihood (orminimizing the expected predictive error). The model with thesmallest AIC is preferred.

Model Selection 34

Page 35: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

d) BIC

I BIC stands for Schwarz’s Bayesian information criterion,which is defined as

BIC =RSS(d) + d σ2 log(n)

nσ2,

for a linear model with d inputs.

I The model with the smallest BIC is preferred. The derivationof BIC results from Bayesian statistics and has Bayesianinterpretation. It is seen that BIC is formally similar to AIC.The BIC penalizes more heavily the models with more numberof inputs.

Model Selection 35

Page 36: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Penalized log-likelihood

I In general AIC/BIC are penalized maximum likelihood, e.g.BIC aims

minimize− (log likelihood) + d log(n)/n

where, the first term is called deviance (some refer it to−2 log likelihood). In the case of linear regression withnormal errors, the deviance is the same as log(s2).

Model Selection 36

Page 37: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example: credit dataset

Variables Best subset Forward stepwise

one rating ratingtwo rating, income rating, incomethree rating, income, student rating, income, studentfour cards, income, student, limit rating, income, student, limit

TABLE 6.1. The first four selected models for best subset selectionand forward stepwise selection on the Credit data set. The firstthree models are identical but the fourth models differ.

Model Selection 37

Page 38: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example

2 4 6 8 10

10

01

20

14

01

60

18

02

00

22

0

Number of Predictors

Sq

ua

re R

oo

t o

f B

IC

2 4 6 8 10

10

01

20

14

01

60

18

02

00

22

0Number of Predictors

Va

lida

tio

n S

et

Err

or

2 4 6 8 10

10

01

20

14

01

60

18

02

00

22

0

Number of Predictors

Cro

ss−

Va

lida

tio

n E

rro

r

Figure: 6.3. For the Credit data set, three quantities are displayed for thebest model containing d predictors, for d ranging from 1 to 11. Theoverall best model, based on each of these quantities, is shown as a bluecross. Left: Square root of BIC. Center: Validation set errors (75%training data). Right: 10-fold Cross-validation errors.

Model Selection 38

Page 39: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

The one standard deviation rule

I In the above figure, model with 6 inputs do not seem to bemuch better than model with 4 or 3 inputs.

I Keep in mind the Occam’s razor: Choose the simplest modelif they are similar by other criterion.

Model Selection 39

Page 40: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

The one standard deviation rule

I Calculate the standard error of the estimated test MSE foreach model size,

I Consider the models with estimated test MSE of one standarddeviation within the smallest test MSE.

I Among them select the one with the smallest model size.

I (Apply this rule to the Example in Figure 6.3 gives the modelwith 3 variable.)

Model Selection 40

Page 41: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Ridge Regression

I The least squares estimator β is minimizing

RSS =n∑

i=1

(yi − β0 −p∑

j=1

βjxij)2

I The ridge regression βRλ is minimizing

n∑i=1

(yi − β0 −p∑

j=1

βjxij)2 + λ

p∑j=1

β2j

where λ ≥ 0 is a tuning parameter.I The first term measuares goodness of fit, the smaller the

better.I The second term λ

∑pj=1 β

2j is called shrikage penalty, which

shrinks βj towards 0.I The shrinkage reduces variance (at the cost increased bias)!

Model Selection 41

Page 42: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Tuning parameter λ.

I λ = 0: no penalty, βR0 = βLS .

I λ =∞: infinity penalty, βR0 = 0.

I Large λ: heavy penalty, more shrinkage of the estimator.

I Note that β0 is not penalized.

Model Selection 42

Page 43: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Remark.

I If p > n, ridge regression can still perform well by trading offa small increase in bias for a large decrease in variance.

I Ridge regression works best in situations where the leastsquares estimates have high variance.

I Ridge regression also has substantial computationaladvantages

I

βRλ = (XTX) + λI )−1XTy

where I is p + 1 by p + 1 diagonal with diagonal elements(0, 1, 1, ..., 1).

Model Selection 43

Page 44: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example: Ridge Regularization Path in Credit data

1e−02 1e+00 1e+02 1e+04

−300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

λ ‖βRλ ‖2/‖β‖2

Figure: 6.4. The standardized ridge regression coefficients are displayedfor the Credit data set, as a function of λ and ‖βR

λ ‖2/‖β‖2. Here

‖a‖2 =√∑p

j=1 a2j .

Model Selection 44

Page 45: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

The Lasso

I Lasso stands for Least Absolute Shrinkage and SelectionOperator.

I The Lasso estimator βLλ is the minimizer of

n∑i=1

(yi − β0 −p∑

j=1

βjxij)2 + λ

p∑j=1

|βj |

I We may use ‖β‖1 =∑p

j=1 |βj |, which is the l1 norm.

I LASSO often shrinks coefficients to be identically 0. (This isnot the case for ridge)

I Hence it performs variable selection, and yields sparse models.

Model Selection 45

Page 46: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example: Lasso Path in Credit data.

20 50 100 200 500 2000 5000

−200

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

λ ‖βLλ ‖1/‖β‖1

Figure: 6.6. The standardized lasso coefficients on the Credit data set areshown as a function of λ and ‖βL

λ‖1/‖β‖1.

Model Selection 46

Page 47: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Another formulation

I For Lasso: Minimize

n∑i=1

(yi − β0 −p∑

j=1

βjxij)2 subject to

p∑j=1

|βj | ≤ s

I For Ridge: Minimize

n∑i=1

(yi − β0 −p∑

j=1

βjxij)2 subject to

p∑j=1

β2j ≤ s

I For l0: Minimizen∑

i=1

(yi − β0 −p∑

j=1

βjxij)2 subject to

p∑j=1

I (β 6= 0) ≤ s

l0 method penalizes number of non-zero coefficients. Adifficult (NP-hard) problem for optimization.

Model Selection 47

Page 48: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Variable selection property for Lasso

Figure: 6.7. Contours of the error and constraint functions for the lasso(left) and ridge regression (right). The solid blue areas are the constraintregions, |β1|+ |β2| ≤ s and β2

1 + β22 ≤ s , while the red ellipses are the

contours of the RSS.

Model Selection 48

Page 49: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Simple cases

I Consider the simple model yi = βi + εi , i = 1, ..., n and n = p.Then,

– The least squares: βj = yj ;

– The ridge: βRj = yj/(1 + λ);

– The Lasso: βLj = sign(yj)(|yj | − λ/2)+.

I Slightly more generally, suppose input columns of the X arestandardized to be mean 0 and variance 1 and are orthogonal.

βRj = βLSEj /(1 + λ)

βLj = sign(βLSEj )(|βLSEj | − λ/2)+

for j = 1, ..., p.

Model Selection 49

Page 50: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Example

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.5

Coeffic

ient E

stim

ate

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.5

Coeffic

ient E

stim

ate

Lasso

Least Squares

yjyj

Figure: 6.10. The ridge regression and lasso coefficient estimates for asimple setting with n = p and X a diagonal matrix with 1 on thediagonal. Left: The ridge regression coefficient estimates are shrunkenproportionally towards zero, relative to the least squares estimates. Right:The lasso coefficient estimates are soft-thresholded towards zero.

Model Selection 50

Page 51: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Dimension reduction methods (using derived inputs)

I When p is large, we may consider to regress on, not theoriginal inputs x , but some small number of derived featuresφ1, ..., φk with k < p.

yi = θ0 +k∑

j=1

θjφj(xi ) + εi , i = 1, ..., n.

– φj can be linear: linear combinations of X1, . . . ,Xp

– φj can be nonlinear: basis, kernels, neural networks, trees, etc.

Model Selection 51

Page 52: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Principal Component Analysis (PCA)

I Suppose there are n observations of p variables presented asX = (x1, . . . xn)T ∈ Rn×p, where xTi ∈ Rp.

I Define the sample covariance matrix

Σ =1

n

n∑i=1

(xi − µ)T (xi − µ)

where the sample mean µ = 1n

∑i xi .

I Σ has an eigenvalue decomposition

Σ = UΛUT ,

with UTU = Ip (U = [u1, . . . , up]), Λ = diag(λ1, . . . , λp),λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0.

Model Selection 52

Page 53: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

Principal Component Regression

For X = (X1, . . . ,Xp),

I Define φj be the projection on the j-th eigenvector ofcentralized data:

Zj = φj(X ) = uTj (X − µ)

I Principal Component Regression (PCR) model:

yi = θ0 +k∑

j=1

θjZj + εi , i = 1, ..., n.

Model Selection 53

Page 54: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR

A summary table of PCs

eigenvalue eigenvector percent of P.C.s as(variance) (combination variation projections

coefficient) explained of X − µ1st P.C. Z1 λ1 u1 λ1/

∑pj=1 λj Z1 = u′1(X − µ)

2nd P.C. Z2 λ2 u2 λ2/∑p

j=1 λj Z2 = u′2(X − µ)...

......

......

...p-th P.C. Zp λp up λp/

∑pj=1 λj Zp = u′p(X − µ)

I where top k principal components explained the followingpercentage of total variations

k∑j=1

λj/trace(Σ)

Model Selection 54