model selection for linear models with sas/stat software · model selection for linear models with...
TRANSCRIPT
Model Selection for Linear Models with SAS/STATSoftware
Funda GunesSAS Institute Inc.
Outline
IntroductionI Analysis 1: Full least squares model
Traditional model selection methodsI Analysis 2: Traditional stepwise selection
Customizing the selection processI Analysis 3–6
Compare analyses 1–6
Penalized regression methods
Special methodsI Model averagingI Selection for nonparametric models with spline effects
1 / 115
Learning Objectives
You will learn:
Problems with the tradition selection methods
Modern penalty-based methods, including LASSO and adaptiveLASSO, as alternatives to traditional methods
Bootstrap-based model averaging to reduce selection bias andimprove predictive performance
You will learn how to:
Use model selection diagnostics, including graphics, for detectingproblems
Use validation data to detect and prevent under-fitting andover-fitting
Customize the selection process using the features of theGLMSELECT procedure
2 / 115
Introduction
Introduction
With improvements in data collection technologies, regression problemsthat have large numbers of candidate predictor variables occur in a widevariety of scientific fields and business problems.
“I’ve got all these variables, but I don’t know which ones to use.”
Statistical model selection seeks an answer to this question.
3 / 115
Introduction
Model Selection and Its Goals
Model Selection: Estimating the performance of different models inorder to choose the approximate best model.
Goals of model selection:
Simple and interpretable models
Accurate predictions
Model selection is often a trade-off between bias and variance.
4 / 115
Introduction
Graphical Illustration of Bias and Variance
5 / 115
Introduction
Bias-Variance Trade-Off
Suppose Y = f (X ) + ε, where ε ∼ N(0, σ2ε ).
Expected prediction error at a point x is
E [(Y − f (x))2] = Bias2 + Variance + Irreducible Error
6 / 115
Introduction
The GLMSELECT Procedure
The GLMSELECT procedure implements statistical model selection in theframework of general linear models for selection from a very largenumber of effects.
Methods include:
Familiar methods such as forward, backward, and stepwise selection
Newer methods such as least absolute shrinkage and selectionoperator (LASSO) (Tibshirani, 1996)
7 / 115
Introduction
Difficulties of Model Selection
The implementation of model selection can lead to difficulties:
A model selection technique produces a single answer to the variableselection problem, although several different subsets might be equallygood for regression purposes.
Model selection might be unduly affected by outliers.
Selection bias
8 / 115
Introduction
The GLMSELECT Procedure
PROC GLMSELECT can partially mitigate these problems with its
Extensive capabilities for customizing the selection
Flexibility and power in specifying complex potential effects
9 / 115
Introduction
Model Specification
The GLMSELECT procedure provides great flexibility for modelspecifications:
Choice of parameterizations for classification effects
Any degree of interaction (crossed effects) and nested effects
Internal partitioning of data into training, validation, and testing roles
Hierarchy among effects
10 / 115
Introduction
Selection Control
The GLMSELECT procedure provides many options for selectioncontrol:
Multiple effect selection methods
Selection from a very large number of effects (tens of thousands)
Selection of individual levels of classification effects
Effect selection based on a variety of selection criteria
Stopping rules based on a variety of model evaluation criteria
Leave-one-out and k-fold cross validation
11 / 115
Introduction
Linear Regression Model
Suppose data arise from a a normal distribution with the followingstatistical model:
Y = f (x) + ε
In linear regression
f (x) = β0 + β1x1 + β2x2 + · · ·+ βpxp
Least squares is the most popular estimation method which picks thecoefficients β = (β0, β1, . . . , βp) that minimize the residual sum of squares:
RSS(β) =N∑i=1
yi − β0 +
p∑j=1
Xijβj
2
12 / 115
Introduction
PROC GLMSELECT with Examples
Learn how to use PROC GLMSELECT in model development withexamples:
1 Simulate data
2 Fit full least squares model
3 Perform model selection by using five different approaches
4 Compare the selected models’ performances
13 / 115
Introduction
Simulate Data
data trainingData testData;
drop i j;
array x{20} x1-x20;
do i=1 to 5000;
/* Continuous predictors */
do j=1 to 20;
x{j} = ranuni(1);
end;
/* Classification variables */
c1 = int(1.5+ranuni(1)*7);
c2 = 1 + mod(i,3);
c3 = int(ranuni(1)*15);
yTrue = 2 + 5*x17 - 8*x5 + 7*x9*c2 - 7*x1*x2 + 6*(c1=2) + 5*(c1=5);
y = yTrue + 6*rannor(1);
if ranuni(1) < 2/3 then output trainingData;
else output testData;
end;
run;
Reserves one-third of the data as test data and the remaining two-thirds as
training data
14 / 115
Introduction
Training and Test Data
Use training data to develop a model.
Use test data to assess your model’s predictive performance.
15 / 115
Introduction
Analysis 1: Full Least Squares Model
proc glmselect data=trainingData testdata=testData plots=asePlot;
class c1 c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
/selection=forward(stop=none);
run;
Because STOP=NONE is specified, the selection proceeds until all the specified
effects are in the model.
16 / 115
Introduction
Dimensions of Full Least Squares Model
Class Level Information
Class Levels Values
c1 8 1 2 3 4 5 6 7 8
c2 3 1 2 3
c3 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Dimensions
Number of Effects 278
Number of Parameters 947
A full model that contains all main effects and their two-way interactions oftenleads to a large number of effects.
When the classification variables have many levels, the number of parametersavailable for selection is even larger.
17 / 115
Introduction
Assess Your Model’s Predictive Performance
You can assess the model’s predictive performance by comparing theaverage square error (ASE) on the test data and the training data:
ASE on the test data:
ntest∑i=1
[(Ytest − (β0 +
p−1∑j=1
βjXtest,j))2]/ntest
where βj ’s are the least squares estimates obtained by using theobservations in the training data.
18 / 115
Introduction
Average Square Error (ASE) Plot
proc glmselect ... plots=asePlot;
Selected Step
0 50 100 150 200 250
Step
40
60
80
Ave
rage
Squ
are
d E
rro
r
TestTraining
Progression of Average Squared Errors by Role for y
19 / 115
Introduction
Overfitting and Variable Selection
So the more variables the better? NO!
Carefully selected variables can improve model accuracy. But adding toomany features can lead to overfitting:
Overfitted models describe random error or noise instead of theunderlying relationship.
Overfitted models generally have poor predictive performance.
Model selection can prove useful in finding a parsimonious model that hasgood predictive performance.
20 / 115
Introduction
Model Assessment
Model assessment aims to
1 Choose the number of predictors for a given technique.
2 Estimate the prediction ability of the chosen model.
For both of these purposes, the best approach is to evaluate the procedureon an independent test data, if one is available.
If possible one should use different test data for (1) and (2):Validation set for (1) and test set for (2).
21 / 115
Introduction
Model Selection for Linear Regression Models
Suppose you have only two models to compare. Then you can use thefollowing methods for model comparison:
F test
Likelihood ratio test
AIC, SBC, and so on
Cross validation
However we usually have more than two models to compare!
For a model selection problem with p predictors, there are 2p models tocompare!
22 / 115
Introduction
Alternatives
Compare all possible subsets – all-subsets regressionI Computationally expensiveI Introduces a large selection bias!
Use search algorithmsI Traditional selection methods: forward, backward and stepwiseI Shrinkage and penalty methods
23 / 115
Traditional Selection Methods
TRADITIONAL SELECTION METHODS
24 / 115
Traditional Selection Methods
Traditional Selection Methods
Forward Selection: Begins with just the intercept and at each step addsthe effect that shows the largest contribution to the model.
Backward Elimination: Begins with the full model and at each stepdeletes the effect that shows the smallest contribution to the model.
Stepwise Selection: Modification of the forward selection technique thatdiffers in that effects already in the model do not necessarily stay there.
PROC GLMSELECT extends these methods as implemented in the REGprocedure.
SELECTION= option of the MODEL statement specifies the modelselection method.
25 / 115
Traditional Selection Methods
Traditional Selection Methods
In traditional selection methods:
The F statistic and the related p-value reflect an effect’s contributionto the model.
You choose the predictors and then estimate coefficients by usingstandard criteria such as least squares or maximum likelihood.
There are problems with the use of both the F statistic and coefficientestimation!
26 / 115
Traditional Selection Methods
Analysis 2: Traditional Stepwise Selection
proc glmselect data=analysisData testdata=testData
plots=(CoefficientPanel(unpack) asePlot Criteria);
class c1 c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
/selection = stepwise(select=sl);
run;
The SELECT=SL option uses the significance level criterion to determine the
order in which effects enter or leave the model.
27 / 115
Traditional Selection Methods
Selection SummaryStepwise Selection Summary
StepEffectEntered
EffectRemoved
NumberEffects In
NumberParms In ASE Test ASE F Value Pr > F
0 Intercept 1 1 81.1588 81.4026 0.00 1.0000
1 x9*c2 2 4 52.7431 52.5172 608.97 <.0001
2 x5*c1 3 12 42.1868 43.7057 105.82 <.0001
3 x1*x2 4 13 39.5841 41.9502 222.37 <.0001
4 x17*c1 5 21 36.7616 37.9282 32.38 <.0001
5 c1 6 28 35.7939 36.9287 13.00 <.0001
6 x1*x10 7 29 35.6546 36.8501 13.15 0.0003
7 x9*c1 8 36 35.4158 36.9118 3.24 0.0020
8 x7*c2 9 39 35.2761 37.0700 4.43 0.0041
9 x1*x9 10 40 35.2053 37.1232 6.74 0.0094
10 x9*x12 11 41 35.1544 37.1811 4.86 0.0276
11 x11*x12 12 42 35.1016 37.2530 5.04 0.0249
12 x10*c3 13 57 34.8201 37.6486 1.80 0.0293
13 x8*x11 14 58 34.7753 37.6597 4.31 0.0381
14 x7*x9 15 59 34.7390 37.7003 3.49 0.0620
15 c1*c3 16 171 33.3332 39.8716 1.21 0.0652
16 x3*c1 17 179 33.1678 40.0014 2.00 0.0422
17 c2*c3 18 209 32.7376 40.5367 1.40 0.0749
18 x7*x12 19 210 32.7107 40.4731 2.62 0.1059
19 x3*x12 20 211 32.6825 40.4880 2.75 0.0974
20 x15*c3 21 226 32.4562 40.6799 1.47 0.1060
21 x9*x15 22 227 32.3769 40.7047 7.75 0.0054
22 x2*x17 23 228 32.3523 40.7012 2.41 0.1207
23 x2*x5 24 229 32.2996 40.7218 5.17 0.0231
24 x3*x4 25 230 32.2755 40.6853 2.36 0.1243
25 x4*x12 26 231 32.2474 40.7480 2.75 0.0973
26 x2*x8 27 232 32.2240 40.7776 2.30 0.1294
28 / 115
Traditional Selection Methods
Stopping Details
Stop Details
CandidateFor Effect
CandidateSignificance
CompareSignificance
Entry x1*c2 0.1536 > 0.1500 (SLE)
Removal x2*x8 0.1294 < 0.1500 (SLS)
The stepwise selection terminates when these two conditions are satisfied:
None of the effects outside the model is significant at the entrysignificance level (SLE=0.15).
Every effect in the model is significant at the stay significance level(SLS=0.15).
29 / 115
Traditional Selection Methods
The Sequence of p-Values at Each Step
p-Values are not monotone increasing.30 / 115
Traditional Selection Methods
The Selected Model Overfits the Training Data
The default SLE and SLS value of 0.15 produces a model that overfits thetraining data. 31 / 115
Traditional Selection Methods
Most Other Criteria Suggest Stopping the Selection beforethe Significance Level Criterion Is Reached
32 / 115
Customizing the Selection Process
CUSTOMIZING THE SELECTION PROCESS
33 / 115
Customizing the Selection Process
Customize the Selection Process by Using Various Criteria
You can use the following options to customize the selection process:
SELECT= criterion
Specifies the order in which effects enter or leave at each step of thespecified selection method
STOP= criterion
Specifies when to stop the selection process
CHOOSE= criterion
Specifies the final model from the list of models in the steps of theselection process
34 / 115
Customizing the Selection Process
Examplemodel ... / selection=forward (select=sbc
stop=aic
choose=validate);
35 / 115
Customizing the Selection Process
Criteria Based on Likelihood Function
The following criteria are based on the likelihood function and areavailable for the SELECT=, STOP=, and CHOOSE= options:
ADJRSQ: Adjusted R-square
CP: Mallow’s Cp statistic
Fit criteriaI AIC: Akaike’s information criterionI AICC: corrected Akaike’s information criterionI SBC: Schwarz Bayesian information criterion
36 / 115
Customizing the Selection Process
Criteria Based on Estimating the True Prediction Error
The following criteria are based on estimating the true prediction error andare available for the SELECT=, STOP=, and CHOOSE= options:
If you have enough data, set aside a validation data set:I VALIDATE: ASE on validation data
If data are scarce, use cross validation:I CV: k-fold cross validationI PRESS: Leave-one-out cross validation
37 / 115
Customizing the Selection Process
Data Roles: Training, Validation, and Testing
Training data
Always used to find parameter estimates
Can also be used to select effects, stop selection, and choose the finalmodel
Validation data
Play a role in the selection process
Can be used for one or more of the following:I selecting effects to add or drop (SELECT=VALIDATE)I stopping the selection process (STOP=VALIDATE)I choosing the final model (CHOOSE=VALIDATE)
Test data
Used to assess predictive performance of models in the selectionprocess, but do not affect the selection process
38 / 115
Customizing the Selection Process
Specifying Data Sets for Different Roles
PROC GLMSELECT statement options:
DATA= specifies training data
VALDATA= specifies validation data
TESTDATA= specifies test data
39 / 115
Customizing the Selection Process
PARTITION Statement
Another way to specify data sets for different roles is to use thePARTITION statement.
Internally partitions the DATA=data set:
Randomly in specified proportions
partition fraction (validate=0.3 test=0.2);
Based on the formatted value of the ROLEVAR= variable
partition rolevar=myVar (train= ‘group a’
validate= ‘group b’
test= ‘group c’);
40 / 115
Customizing the Selection Process
Cross Validation
When data are scarce, setting aside validation or test data is usually notpossible.
Cross validation uses part of the training data to fit the model, and adifferent part to estimate the prediction error.
41 / 115
Customizing the Selection Process
k-Fold Cross Validation
Split the data into k approximately equal-sized parts
Reserve one part of the data for validation, and fit the model to theremaining k − 1 parts of the data.
Use this model to calculate the prediction error for the reserved partof the data.
Do this for all k parts, and combine the k estimates of the predictionerror.
5-Fold cross validation:
42 / 115
Customizing the Selection Process
Choice of k
As an estimator of true prediction error, cross validation tends to havedecreasing bias but increasing variance as k increases.
A typical choice of k is 5 or 10.
43 / 115
Customizing the Selection Process
Leave-One-Out Cross Validation
Leave-one-out cross validation is a special case of k-fold cross validationwhere k = n and n is the total number of observations in the training dataset.
Each omitted part consists of one observation.
Predicted residual sum of squares can be efficiently obtained withoutrefitting the model n times.
Approximately unbiased for the true prediction error but can have highvariance because the n “training sets” are so similar to each other.
You can request leave-one-out cross validation by specifying PRESSinstead of CV in the SELECT=, CHOOSE=, and STOP= suboptions.
The statistic can be efficiently calculated without refitting the modeln times.
44 / 115
Customizing the Selection Process
Analysis 3: Traditional Stepwise Selection withCHOOSE=PRESS
proc glmselect data=analysisData testdata=testData
plots=(asePlot Criteria);
class c1 c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
/selection = stepwise(select=sl choose=press);
run;
The CHOOSE=PRESS option requests that among the models obtainedat each step of the selection process, the final selected model is the modelthat has the smallest leave-one-out predicted residual sum of squares.
45 / 115
Customizing the Selection Process
Criterion Panel
The final selected model is the model that has the smallest leave-one-outpredicted residual sum of squares (PRESS).
46 / 115
Customizing the Selection Process
ASE Plot
Choosing the model at step 14, you limit the overfitting of the trainingdata that occurs when selection proceeds beyond this step.
47 / 115
Customizing the Selection Process
Problems in the Traditional Implementations of Forward,Backward, and Stepwise Selection
Traditional implementations of forward, backward, and stepwise selectionmethods are based on sequential hypothesis testing at the specifiedsignificance level. However:
F statistics might not follow an F distribution.
Hence p-values cannot reliably be viewed as probabilities.
Prespecified significance limit is not a data driven criteria.
Hence the same significance level can cause overfitting for some dataand underfitting for some other data.
48 / 115
Customizing the Selection Process
Solution: Modify the Selection Process!
Replace hypothesis-testing-based approach
Use data driven criteria such as information criteria, cross validation,or validation data instead of F statistics
49 / 115
Customizing the Selection Process
Analysis 4: Default Stepwise Selection (SELECT=SBC)
By default, PROC GLMSELECT uses stepwise selection with the SELECT=SBCand STOP=SBC options.
proc glmselect data=analysisData testdata=testData
plots=(asePlot Coefficients Criteria);
class c1 c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2;
run;
Data Set WORK.ANALYSISDATA
Test Data Set WORK.TESTDATA
Dependent Variable y
Selection Method Stepwise
Select Criterion SBC
Stop Criterion SBC
Effect Hierarchy Enforced None
50 / 115
Customizing the Selection Process
Selection SummaryStepwise Selection Summary
StepEffectEntered
EffectRemoved
NumberEffects In
NumberParms In SBC ASE Test ASE
0 Intercept 1 1 14933.9344 81.1588 81.4026
1 x9*c2 2 4 13495.1664 52.7431 52.5172
2 x5*c1 3 12 12802.0131 42.1868 43.7057
3 x1*x2 4 13 12593.9481 39.5841 41.9502
4 x17*c1 5 21 12407.8500 36.7616 37.9282
5 c1 6 28 12374.1967 35.7939 36.9287
6 x1*x10 7 29 12369.0918* 35.6546 36.8501
* Optimal Value Of Criterion
Stepwise selection terminates when adding or dropping any effect increases theSBC statistic (≈ 12369)
Stop Details
CandidateFor Effect
CandidateSBC
CompareSBC
Entry x1*x9 12370.6006 > 12369.0918
Removal x1*x10 12374.1967 > 12369.0918
51 / 115
Customizing the Selection Process
Coefficient Progression Plot
Classification effects join the model along with all their levels.
52 / 115
Customizing the Selection Process
Parameter Estimates for the Classification Variable c1
Only levels 2 and 5 of the classification effect c1 contribute appreciably tothe model.
53 / 115
Customizing the Selection Process
Analysis 5: Stepwise Selection with a Split ClassificationVariable
A more parsimonious model that has similar or better predictive powermight be obtained if parameters that correspond to the levels of c1 areallowed to enter or leave the model independently:
proc glmselect data=analysisData testdata=testData;
class c1(split) c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
/orderSelect ;
run;
54 / 115
Customizing the Selection Process
Dimensions Table
Dimensions
Number of Effects 278
Number of Effects after Splits 439
Number of Parameters 947
After splitting, 439 split effects are considered for entry or removal at eachstep of the selection process.
55 / 115
Customizing the Selection Process
Parameter Estimates
Selected ModelSelected Model
Parameter Estimates
Parameter DF EstimateStandard
Error t Value
Intercept 1 2.763669 0.360337 7.67
x9*c2 1 1 6.677365 0.440050 15.17
x9*c2 2 1 13.793766 0.431579 31.96
x9*c2 3 1 21.082776 0.439905 47.93
x5 1 -8.250059 0.353952 -23.31
c1_2 1 6.062842 0.295250 20.53
x1*x2 1 -6.386971 0.519767 -12.29
x17 1 4.801696 0.357801 13.42
c1_5 1 5.053642 0.295384 17.11
x1*x10 1 -1.964001 0.534991 -3.67
Selected model contains only two parameters for c1 instead of all eight levels.56 / 115
Customizing the Selection Process
Split Model versus Nonsplit Model
The split model provides:
A model that has fewer degrees of freedom (29 versus 10)
Improved prediction performance (ASE on test data: 36.85 versus36.68)
57 / 115
Customizing the Selection Process
Analysis 6: Stepwise Selection with Internally PartitionedData and STOP=VALIDATE
proc glmselect data=AnalysisData testdata=TestData;
partition fraction(validate=.25);
class c1(split) c2 c3;
model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
|x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
/select=stepwise (stop=validate);
run;
The PARTITION statement randomly reserves one-quarter of theobservations in the AnalysisData for model validation and the rest formodel training.
The STOP=VALIDATE suboption requests that the selection processterminate when adding or dropping any effect increases the averagesquare error on the validation data.
58 / 115
Customizing the Selection Process
Number of Observations for Each Role
Observation Profile for Analysis Data
Number of Observations Read 3395
Number of Observations Used 3395
Number of Observations Used for Training 2576
Number of Observations Used for Validation 819
Observation Profile for Test Data
Number of Observations Read 1605
Number of Observations Used 1605
59 / 115
Customizing the Selection Process
Average Square Errors by Roles
Desirable behavior!60 / 115
Compare Analysis 1–6
COMPARE ANALYSIS 1-6
61 / 115
Compare Analysis 1–6
Summary Slide of All Analyses
Analysis SELECTION= Suboptions CLASS
1. Full least squares FORWARD STOP=NONE C1 C2 C3
2. Traditional STEPWISE SELECT=SL C1 C2 C3
3. Traditional with STEPWISE SELECT=SL C1 C2 C3leave-one-out CV CHOOSE=PRESS
4. Default STEPWISE SELECT=SBC C1 C2 C3
5. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3
6. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3and validate STOP=VALIDATE
62 / 115
Compare Analysis 1–6
Predictive Performance Comparison
Containment of ASEAnalysis Effects Parms Exact Effects Train Test1. Full least squares 274 834 5 26.73 49.28
2. Traditional 26 231 3 32.22 40.78
3. Traditional with 14 58 3 34.74 37.70with leave-one-out CV
4. Default 6 28 2 35.65 36.85
5. Default with split 7 9 5 35.77 36.68
6. Default with split 6 9 5 34.72 36.78and validate
True 5 8 5 35.96 36.73
63 / 115
Compare Analysis 1–6
Predictive Performance Comparison
Analysis 5 and Analysis 6:
Yield more parsimonious models
Capture all the effects in the true model
Have good predictive performance on the test data set
64 / 115
Compare Analysis 1–6
Careful and Informed Use of Subset Selection Methods IsOK!
Despite the difficulties, careful and informed use of traditional variableselection methods still has its place in data analysis.
Example: Foster and Stine (2004) use a modified version of stepwiseselection to build a predictive model for bankruptcy from over 67,000possible predictors and show that this method yields a model whosepredictions compare favorably with those of other recently developed datamining tools.
65 / 115
Compare Analysis 1–6
Subset Selection Is Bad
“Stepwise variable selection has been a very popular technique for manyyears, but if this procedure had just been proposed as a statistical method,it would most likely be rejected because it violates every principle ofstatistical estimation and hypothesis testing.”
—Frank Harrell, Regression Modeling Strategies, 2001
66 / 115
Compare Analysis 1–6
Problems with Subset Selection Methods
Problems arise in both variable selection and coefficient estimation:
Algorithms are greedy. They make the best change at each step,regardless of future effects.
The coefficients and predictions are unstable especially when thereare correlated predictors or the number of input variables greatlyexceeds the number of observations.
Alternative: Penalized regression methods
67 / 115
Penalized Regression Methods
PENALIZED REGRESSION METHODS
68 / 115
Penalized Regression Methods
Shrinkage and Penalty Methods
Penalized regression methods often introduce bias, but they improvethe prediction accuracy because of the bias and variance trade-off.
Commonly used shrinkage and penalty methods include:
Ridge regression
LASSO (Tibshirani 1996)
Adaptive LASSO (Zou 2006)
Elastic net (Zou et al. 2005)
69 / 115
Penalized Regression Methods
Shrinkage and Penalty MethodsRegression estimate is defined as the value of β that minimizes
n∑i=1
(yi − (Xβ)i )2
Ridge:
subject to
p∑j=1
β2j ≤ t1
LASSO:
subject to
p∑j=1
| βj |≤ t2
Elastic net:
subject to
p∑j=1
β2j ≤ t1 and
p∑j=1
| βj |≤ t2
where t1 and t2 are the penalty parameters.70 / 115
Penalized Regression Methods
Motivation for Ridge Regression
When the number of input variables exceeds the number ofobservations the least squares estimation has the following drawbacks:
I Estimates are not unique.I The resulting model heavily overfits the data.
When there are correlated variables the least squares estimates havehigh variances.
These call for extended statistical methodologies!
71 / 115
Penalized Regression Methods
Ridge Regression
arg minβ
n∑i=1
(yi − (Xβ)i )2 + λ
p∑j=1
β2j
There is a one-to-one correspondence between λ and t1.
I as λ→ 0, βridge → βOLS
I as λ→∞, βridge → 0
Estimates shrink toward zero but never reach zero. So ridgeregression does not provide variable selection.
Introduces bias, but reduces the variance of the estimate
Most useful in the presence of multicollinearity
Solution has a closed analytical form
Available in PROC REG
72 / 115
Penalized Regression Methods
Shrinkage and Penalty Methods in PROC GLMSELECT
LASSO and adaptive LASSO are available in PROC GLMSELECT
Simultaneous estimation and variable selection techniques
Effectively perform variable selection by modifying the coefficientestimation and reducing some coefficients to zero
73 / 115
Penalized Regression Methods
Defining the LASSO
For a given tuning parameter t ≥ 0,
arg minβ
n∑i=1
(yi − (Xβ)i )2
subject to:p∑
j=1
| βj |≤ t
The parameter t ≥ 0 controls the amount of shrinkage.
t ≤∑p
j=1 | βOLSj | causes shrinkage of the solutions toward 0.
74 / 115
Penalized Regression Methods
Geometric Interpretation
The solid blue areas are constraint regions, and the red ellipses are thecontours of the least squares error function.
75 / 115
Penalized Regression Methods
Penalty Parameter Must Be Set to Obtain the FinalSolution!
How to Determine the Penalty Parameter?
Use criteria based on likelihood function, such as Adj R-Sq,CP, AIC,AICC, BIC, and SBC.
Use criteria based on estimating the true prediction error, such asusing a validation data set and cross validation techniques.
76 / 115
Penalized Regression Methods
Prostate Data
The data come from a study by Stamey et al. (1989):
97 observations
The response variable is the level of prostate-specific antigen (lpsa).
Predictors are the following clinical measures:I log of cancer volume (lcavol)I log of prostate weight (weight)I age (age)I log of the amount of benign prostatic hyperplasia (lbph)I seminal vasicle invasion (svi)I log of capsular penetration (lcp)I Gleason score (gleason)I percentage of Gleason scores of 4 or 5 (pgg45)
77 / 115
Penalized Regression Methods
Prostate Data
data Prostate;
input lcavol lweight age lbph svi lcp gleason pgg45 lpsa;
datalines;
-0.58 2.769 50 -1.39 0 -1.39 6 0 -0.43
-0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16
-0.51 2.691 74 -1.39 0 -1.39 7 20 -0.16
.
. more data lines
.
2.883 3.774 68 1.558 1 1.558 7 80 5.478
3.472 3.975 68 0.438 1 2.904 7 20 5.583
;
78 / 115
Penalized Regression Methods
LASSO Using PROC GLMSELECT
proc glmselect data=prostate plots=all;
model lpsa=lcavol lweight age lbph svi lcp gleason pgg45
/selection=lasso(stop=none choose=sbc);
run;
Note that the SELECT= suboption is not valid with the LAR and LASSOmethods.
PLOTS=ALL option requests all the plots that comes with the analysis.
79 / 115
Penalized Regression Methods
LASSO Coefficients
80 / 115
Penalized Regression Methods
Fit Criteria for LASSO Selection
81 / 115
Penalized Regression Methods
LASSO Estimates
Estimates are calculated using the least angle regression (LARS)algorithm (Efron et al. 2004).
Standard errors of the coefficients are not immediately available.
82 / 115
Penalized Regression Methods
Adaptive LASSO
LASSO has a non-ignorable bias when it estimates the nonzerocoefficients (Fan and Li 2001).
Adaptive LASSO produces unbiased estimates by allowing arelatively higher penalty for zero coefficients and a lower penalty fornonzero coefficients (Zou 2006).
83 / 115
Penalized Regression Methods
Adaptive LASSO
Modification of LASSO selection:
arg minβ
n∑i=1
(yi − (Xβ)i )2
subject to:p∑
j=1
| βj |βj≤ t
Adaptive weights ( 1βj
) are applied to each parameter in forming the
LASSO constraint to control shrinking the zero coefficients more than thenonzero coefficients.
84 / 115
Penalized Regression Methods
Adaptive LASSO Using PROC GLMSELECT
proc glmselect data=prostate plots=all;
model lpsa=lcavol lweight age lbph svi lcp gleason pgg45
/selection=lasso(adaptive stop=none choose=sbc);
run;
For βj in adaptive weights:
By default ordinary least squares estimates of the parameters in the modelare used.
You can use the INEST= option to name a SAS data set that containsestimates for βj ’s.
85 / 115
Penalized Regression Methods
Adaptive LASSO Coefficients
Although the solutions paths are slightly different, LASSO and adaptiveLASSO agree on the final model that is chosen by the SBC.
86 / 115
Penalized Regression Methods
LASSO or Adaptive LASSO?
Adaptive LASSO enjoys the computational advantage of the LASSO.
Because of the variance and bias trade-off, adaptive LASSO mightnot result in optimal prediction performance (Zou 2006).
Hence LASSO can still be advantageous in difficult predictionproblems.
87 / 115
Penalized Regression Methods
Prefer Modern over Traditional Selection Methods
Effectively perform variable selection by modifying the coefficientestimation and reducing some coefficients to 0
Improve stability and prediction
Supported much with theoretical work
88 / 115
Special Methods
SPECIAL METHODS:
MODEL AVERAGING
SELECTION FOR NONPARAMETRIC MODELSWITH SPLINE EFFECTS
89 / 115
Special Methods Model Averaging
MODEL AVERAGING
90 / 115
Special Methods Model Averaging
Model Averaging
Another way to deal with shortcomings of traditional selection methods isbased on model averaging.
Model averaging repeats model selection for multiple training sets anduses average of the selected models for prediction. It can provide:
More stable inferences
Less selection bias
91 / 115
Special Methods Model Averaging
Model Averaging with the Bootstrap Method
Sample data with replacement.
Select a model for each sample.
Average the predictions across the samples.
Form an average model by averaging parameters across the samples:I Predictions with average model = average of predictionsI Averaging shrinks infrequently selected parametersI Disadvantage: Average model is usually not parsimonious
92 / 115
Special Methods Model Averaging
Model Averaging for Prostate Data
A popular way to obtain a parsimonious average model is to form theaverage of just the frequently selected models.
proc glmselect data=prostate plots=all;
model lpsa=lcavol weight age lbph svi lcp gleason pgg45
/selection=lasso(stop=none choose=sbc);
modelAverage nsamples=10000 subset(best=1);
run;
SUBSET(BEST=1) specifies that only the most frequently selected modelbe used in forming the average model.
93 / 115
Special Methods Model Averaging
Model Selection FrequencyModel Selection Frequency
TimesSelected
SelectionPercentage
Numberof
EffectsFrequency
Score Effects in Model
1416 14.16 4 1417 Intercept lcavol lweight svi
* 886 8.86 5 886.9 Intercept lcavol lweight svi pgg45
* 882 8.82 6 882.8 Intercept lcavol lweight lbph svi pgg45
* 842 8.42 7 842.8 Intercept lcavol lweight age lbph svi pgg45
* 532 5.32 8 532.7 Intercept lcavol lweight age lbph svi gleason pgg45
* 525 5.25 5 525.9 Intercept lcavol lweight lbph svi
* 524 5.24 7 524.8 Intercept lcavol lweight age lbph svi gleason
* 468 4.68 9 468.7 Intercept lcavol lweight age lbph svi lcp gleason pgg45
* 451 4.51 8 451.7 Intercept lcavol lweight age lbph svi lcp pgg45
* 421 4.21 6 421.8 Intercept lcavol lweight age lbph svi
* 299 2.99 7 299.8 Intercept lcavol lweight lbph svi gleason pgg45
* 270 2.70 5 270.9 Intercept lcavol lweight svi gleason
* 259 2.59 5 259.9 Intercept lcavol lweight age svi
* 254 2.54 6 254.8 Intercept lcavol lweight lbph svi gleason
* 189 1.89 5 189.8 Intercept lcavol lweight svi lcp
* 156 1.56 8 156.7 Intercept lcavol lweight age lbph svi lcp gleason
* 153 1.53 6 153.8 Intercept lcavol lweight svi gleason pgg45
* 148 1.48 6 148.8 Intercept lcavol lweight age svi pgg45
* 136 1.36 7 136.8 Intercept lcavol lweight lbph svi lcp pgg45
* 119 1.19 7 119.7 Intercept lcavol lweight age lbph svi lcp
* Not Used in Model Averaging
Models with regressors “Intercept, lcavol, lweight, svi” are appropriate for these data.94 / 115
Special Methods Model Averaging
Average Parameter Estimates
Average Parameter Estimates
Estimate Quantiles
ParameterNumber
Non-zeroNon-zero
PercentageMean
EstimateStandardDeviation 25% Median 75%
Intercept 1416 100.00 -0.023901 0.673999 -0.450106 -0.030497 0.462740
lcavol 1416 100.00 0.481555 0.067986 0.435344 0.481252 0.528441
lweight 1416 100.00 0.482178 0.177453 0.351844 0.485739 0.592714
svi 1416 100.00 0.508385 0.189062 0.373536 0.499117 0.640241
95 / 115
Special Methods Model Averaging
Parameter Estimate Distributions
Number of Samples = 10000Parameter Estimate Distributions for lpsa
Pe
rce
nt
svi
0.0
2.5
5.0
7.5
10.0
12.5 N = 1416
0.0
2.5
5.0
7.5
10.0
12.5 N = 1416
0.06 0.24 0.42 0.6 0.78 0.96 1.14
lweight
0
5
10
15 N = 1416
0
5
10
15 N = 1416
0.09 0.27 0.45 0.63 0.81 0.99 1.17
lcavol
0
5
10
15 N = 1416
0
5
10
15 N = 1416
0.27 0.33 0.39 0.45 0.51 0.57 0.63 0.69 0.75
Intercept
0.0
2.5
5.0
7.5
10.0
12.5
15.0N = 1416
0.0
2.5
5.0
7.5
10.0
12.5
15.0N = 1416
-2.875 -2.125 -1.375 -0.625 0.125 0.875
You can interpret the range between the 5th and 95th percentiles of eachestimate as an approximate 90% confidence interval for that estimate.
96 / 115
Special Methods Model Averaging
Effects Selected in at Least 20% of the Samples
0 20 40 60 80 100
Percent
lcavol
lweight
svi
lbph
pgg45
age
gleason
lcp
Eff
ec
t
Effects Selected in at Least 20% of the Samples for lpsa
You can build another parsimonious model by using the frequency of effectselection as a measure of effect importance.
97 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
SELECTIONFOR NONPARAMETRIC MODELS
WITH SPLINE EFFECTS
98 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Noisy Sinusoidal DataThe true response function might be a complicated nonlineartransformation of the inputs.
99 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Moving beyond Linearity
One way to incorporate this nonlinearity into the model:
Create additional variables that are transformations of inputs variables
Use these additional variables to form the new design matrix.
Use linear models in this new space of derived inputs.
100 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
The EFFECT Statement in PROC GLMSELECT
Enables you to construct special collections of columns for designmatrices.
Provides support for splines of any degree, including cubic B-splinebasis (default).
I A spline function is a piecewise polynomial function in which theindividual polynomials have the same degree and connect smoothly atjoin points (knots).
I You can associate local features in your data with particular B-splinebasis functions.
101 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Selection Using Spline Effects
Spline function bases provide a computationally convenient andflexible way to specify a rich set of basis functions.
Variable selection can be useful for obtaining a parsimonious subset toprevent overfitting.
102 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Smoothing with Spline Effect
proc glmselect data=Sine;
effect spl = spline(x/knotmethod=equal(4) split);
model noisySine = spl;
output out=sineOut p=predicted;
run;
The EFFECT statement creates a constructed effect named “spl” that consists ofthe eight cubic B-spline basis functions that correspond to the four equally spacedinternal knots.
Out of eight B-splines, five are selected.
103 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Smoothing Noisy Sinusoidal Data
A B-spline basis with about four internal knots is appropriate.
104 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Noisy Sinusoidal Data with Bumps
Can you capture the bumps with a finer set of knots?
105 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Noisy Sinusoidal Data with Bumps
You can capture the bumps at the expense of overfitting the data in theregions between the bumps.
106 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Solution: B-Spline Bases at Multiple Scales
The following statements perform effect selection from several sets ofB-spline bases that correspond to different scales in the data.
proc glmselect data=DoJoBumps;
effect spl = spline(x/knotmethod=multiscale(endscale=8)
split details);
model noisyBumpsSine = spl;
run;
The ENDSCALE=8 option requests that the finest scale use cubic B-splinesdefined on 28 equally spaced knots in the interval [0, 1].
Out of 548 B-splines, 27 are selected.
107 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
Smoothing Noisy Sinusoidal Data with Bumps
Accurately captures both the low-frequency sinusoidal baseline and thebumps, without overfitting the regions between the bumps.
108 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
PROC ADAPTIVEREG in SAS/STAT 12.1
The multivariate adaptive regression splines method is a nonparametricapproach for modeling high-dimensional data:
Introduced by Friedman (1991)
Combines both regression splines and model selection methods
Doesn’t require knots to construct regression spline terms
Automatically models nonlinearities and interactions
109 / 115
Special Methods Selection for Nonparametric Models with Spline Effects
PROC QUANTSELECT in SAS/STAT 12.1(Experimental)
The QUANTSELECT procedure performs model selection for quantileregression:
Forward, backward, stepwise, and LASSO selection methods
Variable selection criteria: AIC, SBC, AICC, and so on
Variable selection for both quantiles and the quantile process
EFFECT statement for constructed model effects (splines)
Experimental in 12.1
110 / 115
Summary
SUMMARY
111 / 115
Summary
Summary
The GLMSELECT procedure supports a variety of model selectionmethods for general linear models:
Traditional model selection methods
Modern selection methods
Model averaging
Nonparametric modeling by using spline effects
While doing these it offers:
Extensive capabilities for customizing the selection
Flexibility and power in specifying complex potential effects
112 / 115
Summary
Back to Learning Objectives
You learned:
Problems with the tradition selection methods
Modern penalty-based methods, including LASSO and adaptiveLASSO, as alternatives to traditional methods
Bootstrap-based model averaging to reduce selection bias andimprove predictive performance
You learned how to:
Use model selection diagnostics, including graphics, for detectingproblems
Use validation data to detect and prevent under-fitting andover-fitting
Customize the selection process using the features of theGLMSELECT procedure
113 / 115
Summary
Useful References
Cohen, R. 2006. “Introducing the GLMSELECT Procedure for Model Selection.”Proceedings of SAS Global Forum 2006 Conference. Cary, NC: SAS Institute Inc.
Cohen, R. 2009. “Applications of GLMSELECT Procedure for Megamodel Selection.”Proceedings of SAS Global Forum 2009 Conference. Cary, NC: SAS Institute Inc.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. “Least Angle Regression(with Discussion).” Annals of Statistics 32:407-499.
Eilers, P. H. C. and Marx, B. D. 1996. “Flexible Smoothing with B-Splines and Penalties(with Discussion).” Statistical Science 11:89–121.
Fan, J., and Li, R. 2001. “Variable Selection via Nonconcave Penalized Likelihood and ItsOracle Properties.” Journal of the American Statistical Association 96:1348–1360.
Foster, D. P. and Stine, R. A. 2004. “Variable Selection in Data Mining: Building aPredictive Model for Bankruptcy.” Journal of the American Statistical Association99:303–313.
Friedman, J. 1991. “Multivariate Adaptive Regression Splines.”Annals of Statistics19:1–67.
Harrell, F. 2001. Regression Modeling Strategies. New York: Springer-Verlag.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning.New York: Springer-Verlag.
114 / 115
Summary
Useful References
Miller, A. 2002. Subset Selection in Regression. 2nd ed. Boca Raton, FL: Chapman &Hall/CRC.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of theRoyal Statistical Society, Series B 58:267–288.
Zou, H. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the AmericanStatistical Association 101:1418–1429.
Zou, H. and Hastie, T. 2005. “Regularization and Variable Selection via the Elastic Net.”Journal of the Royal Statistical Society, Series B 67:301–320.
115 / 115