model selection for linear models with sas/stat software · model selection for linear models with...

Model Selection for Linear Models with SAS/STATSoftware

Funda GunesSAS Institute Inc.

Outline

IntroductionI Analysis 1: Full least squares model

Traditional model selection methodsI Analysis 2: Traditional stepwise selection

Customizing the selection processI Analysis 3–6

Compare analyses 1–6

Penalized regression methods

Special methodsI Model averagingI Selection for nonparametric models with spline effects

1 / 115

Learning Objectives

You will learn:

Problems with the tradition selection methods

Modern penalty-based methods, including LASSO and adaptiveLASSO, as alternatives to traditional methods

Bootstrap-based model averaging to reduce selection bias andimprove predictive performance

You will learn how to:

Use model selection diagnostics, including graphics, for detectingproblems

Use validation data to detect and prevent under-fitting andover-fitting

Customize the selection process using the features of theGLMSELECT procedure

2 / 115

Introduction

Introduction

With improvements in data collection technologies, regression problemsthat have large numbers of candidate predictor variables occur in a widevariety of scientific fields and business problems.

“I’ve got all these variables, but I don’t know which ones to use.”

Statistical model selection seeks an answer to this question.

3 / 115

Introduction

Model Selection and Its Goals

Model Selection: Estimating the performance of different models inorder to choose the approximate best model.

Goals of model selection:

Simple and interpretable models

Accurate predictions

Model selection is often a trade-off between bias and variance.

4 / 115

Introduction

Graphical Illustration of Bias and Variance

5 / 115

Introduction

Bias-Variance Trade-Off

Suppose Y = f (X ) + ε, where ε ∼ N(0, σ2ε ).

Expected prediction error at a point x is

E [(Y − f (x))2] = Bias2 + Variance + Irreducible Error

6 / 115

Introduction

The GLMSELECT Procedure

The GLMSELECT procedure implements statistical model selection in theframework of general linear models for selection from a very largenumber of effects.

Methods include:

Familiar methods such as forward, backward, and stepwise selection

Newer methods such as least absolute shrinkage and selectionoperator (LASSO) (Tibshirani, 1996)

7 / 115

Introduction

Difficulties of Model Selection

The implementation of model selection can lead to difficulties:

A model selection technique produces a single answer to the variableselection problem, although several different subsets might be equallygood for regression purposes.

Model selection might be unduly affected by outliers.

Selection bias

8 / 115

Introduction

The GLMSELECT Procedure

PROC GLMSELECT can partially mitigate these problems with its

Extensive capabilities for customizing the selection

Flexibility and power in specifying complex potential effects

9 / 115

Introduction

Model Specification

The GLMSELECT procedure provides great flexibility for modelspecifications:

Choice of parameterizations for classification effects

Any degree of interaction (crossed effects) and nested effects

Internal partitioning of data into training, validation, and testing roles

Hierarchy among effects

10 / 115

Introduction

Selection Control

The GLMSELECT procedure provides many options for selectioncontrol:

Multiple effect selection methods

Selection from a very large number of effects (tens of thousands)

Selection of individual levels of classification effects

Effect selection based on a variety of selection criteria

Stopping rules based on a variety of model evaluation criteria

Leave-one-out and k-fold cross validation

11 / 115

Introduction

Linear Regression Model

Suppose data arise from a a normal distribution with the followingstatistical model:

Y = f (x) + ε

In linear regression

f (x) = β0 + β1x1 + β2x2 + · · ·+ βpxp

Least squares is the most popular estimation method which picks thecoefficients β = (β0, β1, . . . , βp) that minimize the residual sum of squares:

RSS(β) =N∑i=1

yi − β0 +

p∑j=1

Xijβj

2

12 / 115

Introduction

PROC GLMSELECT with Examples

Learn how to use PROC GLMSELECT in model development withexamples:

1 Simulate data

2 Fit full least squares model

3 Perform model selection by using five different approaches

4 Compare the selected models’ performances

13 / 115

Introduction

Simulate Data

data trainingData testData;

drop i j;

array x{20} x1-x20;

do i=1 to 5000;

/* Continuous predictors */

do j=1 to 20;

x{j} = ranuni(1);

end;

/* Classification variables */

c1 = int(1.5+ranuni(1)*7);

c2 = 1 + mod(i,3);

c3 = int(ranuni(1)*15);

yTrue = 2 + 5*x17 - 8*x5 + 7*x9*c2 - 7*x1*x2 + 6*(c1=2) + 5*(c1=5);

y = yTrue + 6*rannor(1);

if ranuni(1) < 2/3 then output trainingData;

else output testData;

end;

run;

Reserves one-third of the data as test data and the remaining two-thirds as

training data

14 / 115

Introduction

Training and Test Data

Use training data to develop a model.

Use test data to assess your model’s predictive performance.

15 / 115

Introduction

Analysis 1: Full Least Squares Model

proc glmselect data=trainingData testdata=testData plots=asePlot;

class c1 c2 c3;

model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10

|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2

/selection=forward(stop=none);

run;

Because STOP=NONE is specified, the selection proceeds until all the specified

effects are in the model.

16 / 115

Introduction

Dimensions of Full Least Squares Model

Class Level Information

Class Levels Values

c1 8 1 2 3 4 5 6 7 8

c2 3 1 2 3

c3 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Dimensions

Number of Effects 278

Number of Parameters 947

A full model that contains all main effects and their two-way interactions oftenleads to a large number of effects.

When the classification variables have many levels, the number of parametersavailable for selection is even larger.

17 / 115

Introduction

Assess Your Model’s Predictive Performance

You can assess the model’s predictive performance by comparing theaverage square error (ASE) on the test data and the training data:

ASE on the test data:

ntest∑i=1

[(Ytest − (β0 +

p−1∑j=1

βjXtest,j))2]/ntest

where βj ’s are the least squares estimates obtained by using theobservations in the training data.

18 / 115

Introduction

Average Square Error (ASE) Plot

proc glmselect ... plots=asePlot;

Selected Step

0 50 100 150 200 250

Step

40

60

80

Ave

rage

Squ

are

d E

rro

r

TestTraining

Progression of Average Squared Errors by Role for y

19 / 115

Introduction

Overfitting and Variable Selection

So the more variables the better? NO!

Carefully selected variables can improve model accuracy. But adding toomany features can lead to overfitting:

Overfitted models describe random error or noise instead of theunderlying relationship.

Overfitted models generally have poor predictive performance.

Model selection can prove useful in finding a parsimonious model that hasgood predictive performance.

20 / 115

Introduction

Model Assessment

Model assessment aims to

1 Choose the number of predictors for a given technique.

2 Estimate the prediction ability of the chosen model.

For both of these purposes, the best approach is to evaluate the procedureon an independent test data, if one is available.

If possible one should use different test data for (1) and (2):Validation set for (1) and test set for (2).

21 / 115

Introduction

Model Selection for Linear Regression Models

Suppose you have only two models to compare. Then you can use thefollowing methods for model comparison:

F test

Likelihood ratio test

AIC, SBC, and so on

Cross validation

However we usually have more than two models to compare!

For a model selection problem with p predictors, there are 2p models tocompare!

22 / 115

Introduction

Alternatives

Compare all possible subsets – all-subsets regressionI Computationally expensiveI Introduces a large selection bias!

Use search algorithmsI Traditional selection methods: forward, backward and stepwiseI Shrinkage and penalty methods

23 / 115

Traditional Selection Methods

TRADITIONAL SELECTION METHODS

24 / 115



Forward Selection: Begins with just the intercept and at each step addsthe effect that shows the largest contribution to the model.

Backward Elimination: Begins with the full model and at each stepdeletes the effect that shows the smallest contribution to the model.

Stepwise Selection: Modification of the forward selection technique thatdiffers in that effects already in the model do not necessarily stay there.

PROC GLMSELECT extends these methods as implemented in the REGprocedure.

SELECTION= option of the MODEL statement specifies the modelselection method.

25 / 115



In traditional selection methods:

The F statistic and the related p-value reflect an effect’s contributionto the model.

You choose the predictors and then estimate coefficients by usingstandard criteria such as least squares or maximum likelihood.

There are problems with the use of both the F statistic and coefficientestimation!

26 / 115


Analysis 2: Traditional Stepwise Selection

proc glmselect data=analysisData testdata=testData

plots=(CoefficientPanel(unpack) asePlot Criteria);

class c1 c2 c3;


|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2

/selection = stepwise(select=sl);

run;

The SELECT=SL option uses the significance level criterion to determine the

order in which effects enter or leave the model.

27 / 115


Selection SummaryStepwise Selection Summary

StepEffectEntered

EffectRemoved

NumberEffects In

NumberParms In ASE Test ASE F Value Pr > F

0 Intercept 1 1 81.1588 81.4026 0.00 1.0000

1 x9*c2 2 4 52.7431 52.5172 608.97 <.0001

2 x5*c1 3 12 42.1868 43.7057 105.82 <.0001

3 x1*x2 4 13 39.5841 41.9502 222.37 <.0001

4 x17*c1 5 21 36.7616 37.9282 32.38 <.0001

5 c1 6 28 35.7939 36.9287 13.00 <.0001

6 x1*x10 7 29 35.6546 36.8501 13.15 0.0003

7 x9*c1 8 36 35.4158 36.9118 3.24 0.0020

8 x7*c2 9 39 35.2761 37.0700 4.43 0.0041

9 x1*x9 10 40 35.2053 37.1232 6.74 0.0094

10 x9*x12 11 41 35.1544 37.1811 4.86 0.0276

11 x11*x12 12 42 35.1016 37.2530 5.04 0.0249

12 x10*c3 13 57 34.8201 37.6486 1.80 0.0293

13 x8*x11 14 58 34.7753 37.6597 4.31 0.0381

14 x7*x9 15 59 34.7390 37.7003 3.49 0.0620

15 c1*c3 16 171 33.3332 39.8716 1.21 0.0652

16 x3*c1 17 179 33.1678 40.0014 2.00 0.0422

17 c2*c3 18 209 32.7376 40.5367 1.40 0.0749

18 x7*x12 19 210 32.7107 40.4731 2.62 0.1059

19 x3*x12 20 211 32.6825 40.4880 2.75 0.0974

20 x15*c3 21 226 32.4562 40.6799 1.47 0.1060

21 x9*x15 22 227 32.3769 40.7047 7.75 0.0054

22 x2*x17 23 228 32.3523 40.7012 2.41 0.1207

23 x2*x5 24 229 32.2996 40.7218 5.17 0.0231

24 x3*x4 25 230 32.2755 40.6853 2.36 0.1243

25 x4*x12 26 231 32.2474 40.7480 2.75 0.0973

26 x2*x8 27 232 32.2240 40.7776 2.30 0.1294

28 / 115


Stopping Details

Stop Details

CandidateFor Effect

CandidateSignificance

CompareSignificance

Entry x1*c2 0.1536 > 0.1500 (SLE)

Removal x2*x8 0.1294 < 0.1500 (SLS)

The stepwise selection terminates when these two conditions are satisfied:

None of the effects outside the model is significant at the entrysignificance level (SLE=0.15).

Every effect in the model is significant at the stay significance level(SLS=0.15).

29 / 115


The Sequence of p-Values at Each Step

p-Values are not monotone increasing.30 / 115


The Selected Model Overfits the Training Data

The default SLE and SLS value of 0.15 produces a model that overfits thetraining data. 31 / 115


Most Other Criteria Suggest Stopping the Selection beforethe Significance Level Criterion Is Reached

32 / 115

Customizing the Selection Process

CUSTOMIZING THE SELECTION PROCESS

33 / 115


Customize the Selection Process by Using Various Criteria

You can use the following options to customize the selection process:

SELECT= criterion

Specifies the order in which effects enter or leave at each step of thespecified selection method

STOP= criterion

Specifies when to stop the selection process

CHOOSE= criterion

Specifies the final model from the list of models in the steps of theselection process

34 / 115


Examplemodel ... / selection=forward (select=sbc

stop=aic

choose=validate);

35 / 115


Criteria Based on Likelihood Function

The following criteria are based on the likelihood function and areavailable for the SELECT=, STOP=, and CHOOSE= options:

ADJRSQ: Adjusted R-square

CP: Mallow’s Cp statistic

Fit criteriaI AIC: Akaike’s information criterionI AICC: corrected Akaike’s information criterionI SBC: Schwarz Bayesian information criterion

36 / 115


Criteria Based on Estimating the True Prediction Error

The following criteria are based on estimating the true prediction error andare available for the SELECT=, STOP=, and CHOOSE= options:

If you have enough data, set aside a validation data set:I VALIDATE: ASE on validation data

If data are scarce, use cross validation:I CV: k-fold cross validationI PRESS: Leave-one-out cross validation

37 / 115


Data Roles: Training, Validation, and Testing

Training data

Always used to find parameter estimates

Can also be used to select effects, stop selection, and choose the finalmodel

Validation data

Play a role in the selection process

Can be used for one or more of the following:I selecting effects to add or drop (SELECT=VALIDATE)I stopping the selection process (STOP=VALIDATE)I choosing the final model (CHOOSE=VALIDATE)

Test data

Used to assess predictive performance of models in the selectionprocess, but do not affect the selection process

38 / 115


Specifying Data Sets for Different Roles

PROC GLMSELECT statement options:

DATA= specifies training data

VALDATA= specifies validation data

TESTDATA= specifies test data

39 / 115


PARTITION Statement

Another way to specify data sets for different roles is to use thePARTITION statement.

Internally partitions the DATA=data set:

Randomly in specified proportions

partition fraction (validate=0.3 test=0.2);

Based on the formatted value of the ROLEVAR= variable

partition rolevar=myVar (train= ‘group a’

validate= ‘group b’

test= ‘group c’);

40 / 115


Cross Validation

When data are scarce, setting aside validation or test data is usually notpossible.

Cross validation uses part of the training data to fit the model, and adifferent part to estimate the prediction error.

41 / 115


k-Fold Cross Validation

Split the data into k approximately equal-sized parts

Reserve one part of the data for validation, and fit the model to theremaining k − 1 parts of the data.

Use this model to calculate the prediction error for the reserved partof the data.

Do this for all k parts, and combine the k estimates of the predictionerror.

5-Fold cross validation:

42 / 115


Choice of k

As an estimator of true prediction error, cross validation tends to havedecreasing bias but increasing variance as k increases.

A typical choice of k is 5 or 10.

43 / 115


Leave-One-Out Cross Validation

Leave-one-out cross validation is a special case of k-fold cross validationwhere k = n and n is the total number of observations in the training dataset.

Each omitted part consists of one observation.

Predicted residual sum of squares can be efficiently obtained withoutrefitting the model n times.

Approximately unbiased for the true prediction error but can have highvariance because the n “training sets” are so similar to each other.

You can request leave-one-out cross validation by specifying PRESSinstead of CV in the SELECT=, CHOOSE=, and STOP= suboptions.

The statistic can be efficiently calculated without refitting the modeln times.

44 / 115


Analysis 3: Traditional Stepwise Selection withCHOOSE=PRESS


plots=(asePlot Criteria);

class c1 c2 c3;


|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2

/selection = stepwise(select=sl choose=press);

run;

The CHOOSE=PRESS option requests that among the models obtainedat each step of the selection process, the final selected model is the modelthat has the smallest leave-one-out predicted residual sum of squares.

45 / 115


Criterion Panel

The final selected model is the model that has the smallest leave-one-outpredicted residual sum of squares (PRESS).

46 / 115


ASE Plot

Choosing the model at step 14, you limit the overfitting of the trainingdata that occurs when selection proceeds beyond this step.

47 / 115


Problems in the Traditional Implementations of Forward,Backward, and Stepwise Selection

Traditional implementations of forward, backward, and stepwise selectionmethods are based on sequential hypothesis testing at the specifiedsignificance level. However:

F statistics might not follow an F distribution.

Hence p-values cannot reliably be viewed as probabilities.

Prespecified significance limit is not a data driven criteria.

Hence the same significance level can cause overfitting for some dataand underfitting for some other data.

48 / 115


Solution: Modify the Selection Process!

Replace hypothesis-testing-based approach

Use data driven criteria such as information criteria, cross validation,or validation data instead of F statistics

49 / 115


Analysis 4: Default Stepwise Selection (SELECT=SBC)

By default, PROC GLMSELECT uses stepwise selection with the SELECT=SBCand STOP=SBC options.


plots=(asePlot Coefficients Criteria);

class c1 c2 c3;


|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2;

run;

Data Set WORK.ANALYSISDATA

Test Data Set WORK.TESTDATA

Dependent Variable y

Selection Method Stepwise

Select Criterion SBC

Stop Criterion SBC

Effect Hierarchy Enforced None

50 / 115


Selection SummaryStepwise Selection Summary

StepEffectEntered

EffectRemoved

NumberEffects In

NumberParms In SBC ASE Test ASE

0 Intercept 1 1 14933.9344 81.1588 81.4026

1 x9*c2 2 4 13495.1664 52.7431 52.5172

2 x5*c1 3 12 12802.0131 42.1868 43.7057

3 x1*x2 4 13 12593.9481 39.5841 41.9502

4 x17*c1 5 21 12407.8500 36.7616 37.9282

5 c1 6 28 12374.1967 35.7939 36.9287

6 x1*x10 7 29 12369.0918* 35.6546 36.8501

* Optimal Value Of Criterion

Stepwise selection terminates when adding or dropping any effect increases theSBC statistic (≈ 12369)

Stop Details

CandidateFor Effect

CandidateSBC

CompareSBC

Entry x1*x9 12370.6006 > 12369.0918

Removal x1*x10 12374.1967 > 12369.0918

51 / 115


Coefficient Progression Plot

Classification effects join the model along with all their levels.

52 / 115


Parameter Estimates for the Classification Variable c1

Only levels 2 and 5 of the classification effect c1 contribute appreciably tothe model.

53 / 115


Analysis 5: Stepwise Selection with a Split ClassificationVariable

A more parsimonious model that has similar or better predictive powermight be obtained if parameters that correspond to the levels of c1 areallowed to enter or leave the model independently:

proc glmselect data=analysisData testdata=testData;

class c1(split) c2 c3;


|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2

/orderSelect ;

run;

54 / 115


Dimensions Table

Dimensions

Number of Effects 278

Number of Effects after Splits 439

Number of Parameters 947

After splitting, 439 split effects are considered for entry or removal at eachstep of the selection process.

55 / 115


Parameter Estimates

Selected ModelSelected Model

Parameter Estimates

Parameter DF EstimateStandard

Error t Value

Intercept 1 2.763669 0.360337 7.67

x9*c2 1 1 6.677365 0.440050 15.17

x9*c2 2 1 13.793766 0.431579 31.96

x9*c2 3 1 21.082776 0.439905 47.93

x5 1 -8.250059 0.353952 -23.31

c1_2 1 6.062842 0.295250 20.53

x1*x2 1 -6.386971 0.519767 -12.29

x17 1 4.801696 0.357801 13.42

c1_5 1 5.053642 0.295384 17.11

x1*x10 1 -1.964001 0.534991 -3.67

Selected model contains only two parameters for c1 instead of all eight levels.56 / 115


Split Model versus Nonsplit Model

The split model provides:

A model that has fewer degrees of freedom (29 versus 10)

Improved prediction performance (ASE on test data: 36.85 versus36.68)

57 / 115


Analysis 6: Stepwise Selection with Internally PartitionedData and STOP=VALIDATE

proc glmselect data=AnalysisData testdata=TestData;

partition fraction(validate=.25);

class c1(split) c2 c3;


|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2

/select=stepwise (stop=validate);

run;

The PARTITION statement randomly reserves one-quarter of theobservations in the AnalysisData for model validation and the rest formodel training.

The STOP=VALIDATE suboption requests that the selection processterminate when adding or dropping any effect increases the averagesquare error on the validation data.

58 / 115


Number of Observations for Each Role

Observation Profile for Analysis Data

Number of Observations Read 3395

Number of Observations Used 3395

Number of Observations Used for Training 2576

Number of Observations Used for Validation 819

Observation Profile for Test Data

Number of Observations Read 1605

Number of Observations Used 1605

59 / 115


Average Square Errors by Roles

Desirable behavior!60 / 115

Compare Analysis 1–6

COMPARE ANALYSIS 1-6

61 / 115


Summary Slide of All Analyses

Analysis SELECTION= Suboptions CLASS

1. Full least squares FORWARD STOP=NONE C1 C2 C3

2. Traditional STEPWISE SELECT=SL C1 C2 C3

3. Traditional with STEPWISE SELECT=SL C1 C2 C3leave-one-out CV CHOOSE=PRESS

4. Default STEPWISE SELECT=SBC C1 C2 C3

5. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3

6. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3and validate STOP=VALIDATE

62 / 115


Predictive Performance Comparison

Containment of ASEAnalysis Effects Parms Exact Effects Train Test1. Full least squares 274 834 5 26.73 49.28

2. Traditional 26 231 3 32.22 40.78

3. Traditional with 14 58 3 34.74 37.70with leave-one-out CV

4. Default 6 28 2 35.65 36.85

5. Default with split 7 9 5 35.77 36.68

6. Default with split 6 9 5 34.72 36.78and validate

True 5 8 5 35.96 36.73

63 / 115


Predictive Performance Comparison

Analysis 5 and Analysis 6:

Yield more parsimonious models

Capture all the effects in the true model

Have good predictive performance on the test data set

64 / 115


Careful and Informed Use of Subset Selection Methods IsOK!

Despite the difficulties, careful and informed use of traditional variableselection methods still has its place in data analysis.

Example: Foster and Stine (2004) use a modified version of stepwiseselection to build a predictive model for bankruptcy from over 67,000possible predictors and show that this method yields a model whosepredictions compare favorably with those of other recently developed datamining tools.

65 / 115


Subset Selection Is Bad

“Stepwise variable selection has been a very popular technique for manyyears, but if this procedure had just been proposed as a statistical method,it would most likely be rejected because it violates every principle ofstatistical estimation and hypothesis testing.”

—Frank Harrell, Regression Modeling Strategies, 2001

66 / 115


Problems with Subset Selection Methods

Problems arise in both variable selection and coefficient estimation:

Algorithms are greedy. They make the best change at each step,regardless of future effects.

The coefficients and predictions are unstable especially when thereare correlated predictors or the number of input variables greatlyexceeds the number of observations.

Alternative: Penalized regression methods

67 / 115

Penalized Regression Methods

PENALIZED REGRESSION METHODS

68 / 115


Shrinkage and Penalty Methods

Penalized regression methods often introduce bias, but they improvethe prediction accuracy because of the bias and variance trade-off.

Commonly used shrinkage and penalty methods include:

Ridge regression

LASSO (Tibshirani 1996)

Adaptive LASSO (Zou 2006)

Elastic net (Zou et al. 2005)

69 / 115


Shrinkage and Penalty MethodsRegression estimate is defined as the value of β that minimizes

n∑i=1

(yi − (Xβ)i )2

Ridge:

subject to

p∑j=1

β2j ≤ t1

LASSO:

subject to

p∑j=1

| βj |≤ t2

Elastic net:

subject to

p∑j=1

β2j ≤ t1 and

p∑j=1

| βj |≤ t2

where t1 and t2 are the penalty parameters.70 / 115


Motivation for Ridge Regression

When the number of input variables exceeds the number ofobservations the least squares estimation has the following drawbacks:

I Estimates are not unique.I The resulting model heavily overfits the data.

When there are correlated variables the least squares estimates havehigh variances.

These call for extended statistical methodologies!

71 / 115


Ridge Regression

arg minβ

n∑i=1

(yi − (Xβ)i )2 + λ

p∑j=1

β2j

There is a one-to-one correspondence between λ and t1.

I as λ→ 0, βridge → βOLS

I as λ→∞, βridge → 0

Estimates shrink toward zero but never reach zero. So ridgeregression does not provide variable selection.

Introduces bias, but reduces the variance of the estimate

Most useful in the presence of multicollinearity

Solution has a closed analytical form

Available in PROC REG

72 / 115


Shrinkage and Penalty Methods in PROC GLMSELECT

LASSO and adaptive LASSO are available in PROC GLMSELECT

Simultaneous estimation and variable selection techniques

Effectively perform variable selection by modifying the coefficientestimation and reducing some coefficients to zero

73 / 115


Defining the LASSO

For a given tuning parameter t ≥ 0,

arg minβ

n∑i=1

(yi − (Xβ)i )2

subject to:p∑

j=1

| βj |≤ t

The parameter t ≥ 0 controls the amount of shrinkage.

t ≤∑p

j=1 | βOLSj | causes shrinkage of the solutions toward 0.

74 / 115


Geometric Interpretation

The solid blue areas are constraint regions, and the red ellipses are thecontours of the least squares error function.

75 / 115


Penalty Parameter Must Be Set to Obtain the FinalSolution!

How to Determine the Penalty Parameter?

Use criteria based on likelihood function, such as Adj R-Sq,CP, AIC,AICC, BIC, and SBC.

Use criteria based on estimating the true prediction error, such asusing a validation data set and cross validation techniques.

76 / 115


Prostate Data

The data come from a study by Stamey et al. (1989):

97 observations

The response variable is the level of prostate-specific antigen (lpsa).

Predictors are the following clinical measures:I log of cancer volume (lcavol)I log of prostate weight (weight)I age (age)I log of the amount of benign prostatic hyperplasia (lbph)I seminal vasicle invasion (svi)I log of capsular penetration (lcp)I Gleason score (gleason)I percentage of Gleason scores of 4 or 5 (pgg45)

77 / 115


Prostate Data

data Prostate;

input lcavol lweight age lbph svi lcp gleason pgg45 lpsa;

datalines;

-0.58 2.769 50 -1.39 0 -1.39 6 0 -0.43

-0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16

-0.51 2.691 74 -1.39 0 -1.39 7 20 -0.16

.

. more data lines

.

2.883 3.774 68 1.558 1 1.558 7 80 5.478

3.472 3.975 68 0.438 1 2.904 7 20 5.583

;

78 / 115


LASSO Using PROC GLMSELECT

proc glmselect data=prostate plots=all;

model lpsa=lcavol lweight age lbph svi lcp gleason pgg45

/selection=lasso(stop=none choose=sbc);

run;

Note that the SELECT= suboption is not valid with the LAR and LASSOmethods.

PLOTS=ALL option requests all the plots that comes with the analysis.

79 / 115


LASSO Coefficients

80 / 115


Fit Criteria for LASSO Selection

81 / 115


LASSO Estimates

Estimates are calculated using the least angle regression (LARS)algorithm (Efron et al. 2004).

Standard errors of the coefficients are not immediately available.

82 / 115


Adaptive LASSO

LASSO has a non-ignorable bias when it estimates the nonzerocoefficients (Fan and Li 2001).

Adaptive LASSO produces unbiased estimates by allowing arelatively higher penalty for zero coefficients and a lower penalty fornonzero coefficients (Zou 2006).

83 / 115


Adaptive LASSO

Modification of LASSO selection:

arg minβ

n∑i=1

(yi − (Xβ)i )2

subject to:p∑

j=1

| βj |βj≤ t

Adaptive weights ( 1βj

) are applied to each parameter in forming the

LASSO constraint to control shrinking the zero coefficients more than thenonzero coefficients.

84 / 115


Adaptive LASSO Using PROC GLMSELECT


model lpsa=lcavol lweight age lbph svi lcp gleason pgg45

/selection=lasso(adaptive stop=none choose=sbc);

run;

For βj in adaptive weights:

By default ordinary least squares estimates of the parameters in the modelare used.

You can use the INEST= option to name a SAS data set that containsestimates for βj ’s.

85 / 115


Adaptive LASSO Coefficients

Although the solutions paths are slightly different, LASSO and adaptiveLASSO agree on the final model that is chosen by the SBC.

86 / 115


LASSO or Adaptive LASSO?

Adaptive LASSO enjoys the computational advantage of the LASSO.

Because of the variance and bias trade-off, adaptive LASSO mightnot result in optimal prediction performance (Zou 2006).

Hence LASSO can still be advantageous in difficult predictionproblems.

87 / 115


Prefer Modern over Traditional Selection Methods

Effectively perform variable selection by modifying the coefficientestimation and reducing some coefficients to 0

Improve stability and prediction

Supported much with theoretical work

88 / 115

Special Methods

SPECIAL METHODS:

MODEL AVERAGING

SELECTION FOR NONPARAMETRIC MODELSWITH SPLINE EFFECTS

89 / 115

Special Methods Model Averaging

MODEL AVERAGING

90 / 115


Model Averaging

Another way to deal with shortcomings of traditional selection methods isbased on model averaging.

Model averaging repeats model selection for multiple training sets anduses average of the selected models for prediction. It can provide:

More stable inferences

Less selection bias

91 / 115


Model Averaging with the Bootstrap Method

Sample data with replacement.

Select a model for each sample.

Average the predictions across the samples.

Form an average model by averaging parameters across the samples:I Predictions with average model = average of predictionsI Averaging shrinks infrequently selected parametersI Disadvantage: Average model is usually not parsimonious

92 / 115


Model Averaging for Prostate Data

A popular way to obtain a parsimonious average model is to form theaverage of just the frequently selected models.


model lpsa=lcavol weight age lbph svi lcp gleason pgg45

/selection=lasso(stop=none choose=sbc);

modelAverage nsamples=10000 subset(best=1);

run;

SUBSET(BEST=1) specifies that only the most frequently selected modelbe used in forming the average model.

93 / 115


Model Selection FrequencyModel Selection Frequency

TimesSelected

SelectionPercentage

Numberof

EffectsFrequency

Score Effects in Model

1416 14.16 4 1417 Intercept lcavol lweight svi

* 886 8.86 5 886.9 Intercept lcavol lweight svi pgg45

* 882 8.82 6 882.8 Intercept lcavol lweight lbph svi pgg45

* 842 8.42 7 842.8 Intercept lcavol lweight age lbph svi pgg45

* 532 5.32 8 532.7 Intercept lcavol lweight age lbph svi gleason pgg45

* 525 5.25 5 525.9 Intercept lcavol lweight lbph svi

* 524 5.24 7 524.8 Intercept lcavol lweight age lbph svi gleason

* 468 4.68 9 468.7 Intercept lcavol lweight age lbph svi lcp gleason pgg45

* 451 4.51 8 451.7 Intercept lcavol lweight age lbph svi lcp pgg45

* 421 4.21 6 421.8 Intercept lcavol lweight age lbph svi

* 299 2.99 7 299.8 Intercept lcavol lweight lbph svi gleason pgg45

* 270 2.70 5 270.9 Intercept lcavol lweight svi gleason

* 259 2.59 5 259.9 Intercept lcavol lweight age svi

* 254 2.54 6 254.8 Intercept lcavol lweight lbph svi gleason

* 189 1.89 5 189.8 Intercept lcavol lweight svi lcp

* 156 1.56 8 156.7 Intercept lcavol lweight age lbph svi lcp gleason

* 153 1.53 6 153.8 Intercept lcavol lweight svi gleason pgg45

* 148 1.48 6 148.8 Intercept lcavol lweight age svi pgg45

* 136 1.36 7 136.8 Intercept lcavol lweight lbph svi lcp pgg45

* 119 1.19 7 119.7 Intercept lcavol lweight age lbph svi lcp

* Not Used in Model Averaging

Models with regressors “Intercept, lcavol, lweight, svi” are appropriate for these data.94 / 115


Average Parameter Estimates

Average Parameter Estimates

Estimate Quantiles

ParameterNumber

Non-zeroNon-zero

PercentageMean

EstimateStandardDeviation 25% Median 75%

Intercept 1416 100.00 -0.023901 0.673999 -0.450106 -0.030497 0.462740

lcavol 1416 100.00 0.481555 0.067986 0.435344 0.481252 0.528441

lweight 1416 100.00 0.482178 0.177453 0.351844 0.485739 0.592714

svi 1416 100.00 0.508385 0.189062 0.373536 0.499117 0.640241

95 / 115


Parameter Estimate Distributions

Number of Samples = 10000Parameter Estimate Distributions for lpsa

Pe

rce

nt

svi

0.0

2.5

5.0

7.5

10.0

12.5 N = 1416

0.0

2.5

5.0

7.5

10.0

12.5 N = 1416

0.06 0.24 0.42 0.6 0.78 0.96 1.14

lweight

0

5

10

15 N = 1416

0

5

10

15 N = 1416

0.09 0.27 0.45 0.63 0.81 0.99 1.17

lcavol

0

5

10

15 N = 1416

0

5

10

15 N = 1416

0.27 0.33 0.39 0.45 0.51 0.57 0.63 0.69 0.75

Intercept

0.0

2.5

5.0

7.5

10.0

12.5

15.0N = 1416

0.0

2.5

5.0

7.5

10.0

12.5

15.0N = 1416

-2.875 -2.125 -1.375 -0.625 0.125 0.875

You can interpret the range between the 5th and 95th percentiles of eachestimate as an approximate 90% confidence interval for that estimate.

96 / 115


Effects Selected in at Least 20% of the Samples

0 20 40 60 80 100

Percent

lcavol

lweight

svi

lbph

pgg45

age

gleason

lcp

Eff

ec

t

Effects Selected in at Least 20% of the Samples for lpsa

You can build another parsimonious model by using the frequency of effectselection as a measure of effect importance.

97 / 115

Special Methods Selection for Nonparametric Models with Spline Effects

SELECTIONFOR NONPARAMETRIC MODELS

WITH SPLINE EFFECTS

98 / 115


Noisy Sinusoidal DataThe true response function might be a complicated nonlineartransformation of the inputs.

99 / 115


Moving beyond Linearity

One way to incorporate this nonlinearity into the model:

Create additional variables that are transformations of inputs variables

Use these additional variables to form the new design matrix.

Use linear models in this new space of derived inputs.

100 / 115


The EFFECT Statement in PROC GLMSELECT

Enables you to construct special collections of columns for designmatrices.

Provides support for splines of any degree, including cubic B-splinebasis (default).

I A spline function is a piecewise polynomial function in which theindividual polynomials have the same degree and connect smoothly atjoin points (knots).

I You can associate local features in your data with particular B-splinebasis functions.

101 / 115


Selection Using Spline Effects

Spline function bases provide a computationally convenient andflexible way to specify a rich set of basis functions.

Variable selection can be useful for obtaining a parsimonious subset toprevent overfitting.

102 / 115


Smoothing with Spline Effect

proc glmselect data=Sine;

effect spl = spline(x/knotmethod=equal(4) split);

model noisySine = spl;

output out=sineOut p=predicted;

run;

The EFFECT statement creates a constructed effect named “spl” that consists ofthe eight cubic B-spline basis functions that correspond to the four equally spacedinternal knots.

Out of eight B-splines, five are selected.

103 / 115


Smoothing Noisy Sinusoidal Data

A B-spline basis with about four internal knots is appropriate.

104 / 115


Noisy Sinusoidal Data with Bumps

Can you capture the bumps with a finer set of knots?

105 / 115


Noisy Sinusoidal Data with Bumps

You can capture the bumps at the expense of overfitting the data in theregions between the bumps.

106 / 115


Solution: B-Spline Bases at Multiple Scales

The following statements perform effect selection from several sets ofB-spline bases that correspond to different scales in the data.

proc glmselect data=DoJoBumps;

effect spl = spline(x/knotmethod=multiscale(endscale=8)

split details);

model noisyBumpsSine = spl;

run;

The ENDSCALE=8 option requests that the finest scale use cubic B-splinesdefined on 28 equally spaced knots in the interval [0, 1].

Out of 548 B-splines, 27 are selected.

107 / 115


Smoothing Noisy Sinusoidal Data with Bumps

Accurately captures both the low-frequency sinusoidal baseline and thebumps, without overfitting the regions between the bumps.

108 / 115


PROC ADAPTIVEREG in SAS/STAT 12.1

The multivariate adaptive regression splines method is a nonparametricapproach for modeling high-dimensional data:

Introduced by Friedman (1991)

Combines both regression splines and model selection methods

Doesn’t require knots to construct regression spline terms

Automatically models nonlinearities and interactions

109 / 115


PROC QUANTSELECT in SAS/STAT 12.1(Experimental)

The QUANTSELECT procedure performs model selection for quantileregression:

Forward, backward, stepwise, and LASSO selection methods

Variable selection criteria: AIC, SBC, AICC, and so on

Variable selection for both quantiles and the quantile process

EFFECT statement for constructed model effects (splines)

Experimental in 12.1

110 / 115

Summary

SUMMARY

111 / 115

Summary

Summary

The GLMSELECT procedure supports a variety of model selectionmethods for general linear models:

Traditional model selection methods

Modern selection methods

Model averaging

Nonparametric modeling by using spline effects

While doing these it offers:

Extensive capabilities for customizing the selection

Flexibility and power in specifying complex potential effects

112 / 115

Summary

Back to Learning Objectives

You learned:

Problems with the tradition selection methods

Modern penalty-based methods, including LASSO and adaptiveLASSO, as alternatives to traditional methods

Bootstrap-based model averaging to reduce selection bias andimprove predictive performance

You learned how to:

Use model selection diagnostics, including graphics, for detectingproblems

Use validation data to detect and prevent under-fitting andover-fitting

Customize the selection process using the features of theGLMSELECT procedure

113 / 115

Summary

Useful References

Cohen, R. 2006. “Introducing the GLMSELECT Procedure for Model Selection.”Proceedings of SAS Global Forum 2006 Conference. Cary, NC: SAS Institute Inc.

Cohen, R. 2009. “Applications of GLMSELECT Procedure for Megamodel Selection.”Proceedings of SAS Global Forum 2009 Conference. Cary, NC: SAS Institute Inc.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. “Least Angle Regression(with Discussion).” Annals of Statistics 32:407-499.

Eilers, P. H. C. and Marx, B. D. 1996. “Flexible Smoothing with B-Splines and Penalties(with Discussion).” Statistical Science 11:89–121.

Fan, J., and Li, R. 2001. “Variable Selection via Nonconcave Penalized Likelihood and ItsOracle Properties.” Journal of the American Statistical Association 96:1348–1360.

Foster, D. P. and Stine, R. A. 2004. “Variable Selection in Data Mining: Building aPredictive Model for Bankruptcy.” Journal of the American Statistical Association99:303–313.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.”Annals of Statistics19:1–67.

Harrell, F. 2001. Regression Modeling Strategies. New York: Springer-Verlag.

Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning.New York: Springer-Verlag.

114 / 115

Summary

Useful References

Miller, A. 2002. Subset Selection in Regression. 2nd ed. Boca Raton, FL: Chapman &Hall/CRC.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of theRoyal Statistical Society, Series B 58:267–288.

Zou, H. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the AmericanStatistical Association 101:1418–1429.

Zou, H. and Hastie, T. 2005. “Regularization and Variable Selection via the Elastic Net.”Journal of the Royal Statistical Society, Series B 67:301–320.

115 / 115

model selection for linear models with sas/stat software · model selection for linear models with...

Documents