interactive model building with jmp pro · 2019-11-22 · a single split results in an r2 of 22.8%....
TRANSCRIPT
Copyright © 2015 SAS Institute Inc. All rights reserved.
Interactive Model Building with JMP ProClay Barker – [email protected] Crotty – [email protected] Stephens – [email protected]
2
Copyright © 2015, SAS Institute Inc. All rights reserved.
What is JMP (and JMP Pro)? Statistical Discovery Software from SAS
Developed in 1989 as a desktop application
Comprehensive basic statistics and graphical summaries advanced tools and techniques
Extendible powerful scripting language application and add-in builders integrates with Excel, R, MATLAB, and SAS
Visual, dynamic and interactive
JMP Pro – Advanced tools for analytics and modeling
3
Copyright © 2015, SAS Institute Inc. All rights reserved.
Outline Introduction to JMP Software
A Motivating Example – Boston Home Prices
The Modeling Process
Simple and Multiple Linear Regression
Classification and Regression Trees
Advanced Tree Methods
Short Break
Penalized Regression Techniques
Return to Boston Home Prices
Discussion and Q&A
4
Copyright © 2015, SAS Institute Inc. All rights reserved.
The Modeling Process Explore data (know the data, identify key features) One variable at a time Two variables at a time Many variables at a time
Identify potential data quality issues
Prepare data for modeling Missing values Data cleanup (recode, binning) Transformations Create validation column (holdout sets)
Model building, selection, and comparison
5
Copyright © 2015, SAS Institute Inc. All rights reserved.
Example: Boston Home Building Value
Graph Builder and Geographic Mapping
Data Filter Columns Viewer Distribution
Multivariate Create a Validation Column Regression Regression Trees Neural Networks
Situation:
Predict total value for single family owner occupied homes in a Boston neighborhood.
Data:
2014 appraisal information on over 25K Boston homes, publicly available from https://data.cityofboston.gov
6
Copyright © 2015, SAS Institute Inc. All rights reserved.
Classification and Regression Trees
7
Copyright © 2015, SAS Institute Inc. All rights reserved.
Overview
Why Trees?
Algorithm and Decision Tree Example
Column Contributions
Cross Validation
Bootstrap Forest
Boosted Tree
8
Copyright © 2015, SAS Institute Inc. All rights reserved.
Why Trees?
Easy to interpret and explain Not a “black box” model Can be represented with a tree (flow-chart) diagram Interpretable as a series of “if-then” statements that lead to a
classification or a prediction
Flexible for response types Categorical response => classification tree Continuous response => regression tree
Flexible for input factor types Categorical factors get split into two groups of levels Continuous factors get split based on a cutting value
9
Copyright © 2015, SAS Institute Inc. All rights reserved.
Algorithm
Build a tree by making optimal splits. Examine candidate splits across all factors. Split on the factor that maximizes the split criterion, using the
split that optimizes the criterion.
Repeat above steps until you decide to stop splitting or until a stopping criterion is reached.
10
Copyright © 2015, SAS Institute Inc. All rights reserved.
Basic Example
Corn data (142 observations)
Predict yield from nitrate: What is the optimal split of
nitrate values to maximize the ability to predict yield? What is the optimal split of
nitrate values to maximize the ANOVA model SS?
Prediction after one split: For nitrate < 26.32, predict yield
of 6756. For nitrate ≥ 26.32, predict yield
of 8606.
11
Copyright © 2015, SAS Institute Inc. All rights reserved.
Basic Example
A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions.
Split the left side leaf at nitrate < 11.44. R2 increases to 31.3%.
Keep splitting until we hit the minimum size split…
Splitting terminates at 21 splits, with R2 up to 45.3%. Leads to an unwieldy model.
We used all the observations in our data to fit the model. Runs risk of bad predictions for new data, because of overfitting. We can’t test how well our model fits for new observations.
12
Copyright © 2015, SAS Institute Inc. All rights reserved.
Basic Example Tree View
13
Copyright © 2015, SAS Institute Inc. All rights reserved.
Column Contributions
Simple trees are intuitive. Complex trees might not be intuitive. Advanced tree methods don’t have a single tree representation.
We want to get a sense of which variables most influence our response variable. This is especially true when doing exploratory (rather than
predictive) modeling.
This report shows each column’s contribution to the fit.
14
Copyright © 2015, SAS Institute Inc. All rights reserved.
Cross Validation
We want to avoid overfitting our data, because overfitting leads to bad predictions for new observations.
Cross validation is one way to avoid overfitting.
We focus on one type of cross validation for trees: Holdback sets (Training/Validation/(Test))
15
Copyright © 2015, SAS Institute Inc. All rights reserved.
Holdback Cross Validation
Cross validation refers to the process of randomly dividing our data set into training and validation sets. Use the training set to estimate our model parameters. Use the validation set to evaluate how well the model fits. This is
a measure of how well the model will fit on new observations. Retain the model that fits best on our validation set.
Most simple case: use a single training set and a single validation set.
16
Copyright © 2015, SAS Institute Inc. All rights reserved.
Training and Validation Set
Suppose we have 150 observations. One strategy would be to use the first 100 for training and the last 50 for validation.
For each split on the training data, we measure the tree’s ability to predict on the validation data.
The tree/model with the best R2 for the validation set is our “best” model.
𝑥𝑥1,1 … 𝑥𝑥1,𝑝𝑝⋮⋮ ⋱ ⋮
⋮𝑥𝑥100,1 … 𝑥𝑥100,𝑝𝑝
𝑥𝑥101,1 … 𝑥𝑥101,𝑝𝑝⋮ ⋱ ⋮
𝑥𝑥150,1 … 𝑥𝑥150,𝑝𝑝
𝑦𝑦1⋮⋮
𝑦𝑦100
𝑦𝑦101⋮
𝑦𝑦150
Training
Validation
17
Copyright © 2015, SAS Institute Inc. All rights reserved.
Training and Validation Example Back to the Corn data (142 observations):
Randomly select 42 obs. for validation; use the rest for training.
For each additional split, calculate the validation R2 for the model.
Continue until the validation R2 fails to improve for 10 consecutive splits.
Select the model defined by the maximum validation R2.
18
Copyright © 2015, SAS Institute Inc. All rights reserved.
Training, Validation, and Test
Another cross validation option is to divide the data into three sets: Training set Validation set Test set
The training and validation sets are used as previously illustrated.
Then we fit our final model on the Test set to give us an independent assessment of predictive performance. The test set allows us to compare models against each other on
data that has not been used in the development of the models. Recall that the validation set was used in model development.
19
Copyright © 2015, SAS Institute Inc. All rights reserved.
Possible Pitfalls of One Validation Set Model selection can be sensitive to the particular
validation set chosen.
This is especially concerning with limited data.
20
Copyright © 2015, SAS Institute Inc. All rights reserved.
Advanced Tree Methods in JMP Pro
21
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Bootstrap Forest
Why build only one tree when you can build a forest? The bootstrap forest method makes many trees and averages
the predicted values to get final predictions.
How do we build many trees on one data set? Each tree is built using a bootstrap sample (sampled with
replacement). Each split on each tree only considers a random sample of
candidate columns for splitting. This is also known as bootstrap-averaging (or bagging).
Validation can be used to control the number of trees. Using validation allows you to use early stopping rules.
22
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Bootstrap Forest
What are some options we can specify? # trees (to build and average over) # columns sampled in each split Minimum & Maximum splits in each tree Early stopping (if using validation)
What output do we get? Could view individual trees, but it’s hard to interpret visually. Look at R2 for validation and test sets. Column Contributions report is useful as well. Continuous response has residual diagnostic plots. Categorical response has ROC and Lift curves.
23
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Bootstrap Forest
24
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Boosted Tree
Rather than build a forest of trees and average them, why not build trees sequentially and add them together?
Boosted trees do just that. Build a small tree and get the residuals. Then build another tree on those residuals. Repeat this process and then add all the small trees together.
Final tree is sum of estimates for each terminal node.
Note: For categorical responses, JMP only supports responses with 2 levels.
25
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Boosted Tree
What are some options we can specify? # layers of fits (or stages) # splits per tree Learning Rate (between 0 and 1) Overfit Penalty (to avoid prob=0 with categorical responses) Minimum Split Size Early stopping (if using validation)
26
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Boosted Tree
What output do we get? R2 for validation and test sets Confusion matrix for categorical responses Cumulative validation plot (fit statistic vs stage number) Could view individual trees, but that is unwieldy
» Column Contributions report is useful again!
27
Copyright © 2015, SAS Institute Inc. All rights reserved.
Extension: Boosted Tree
28
Copyright © 2015, SAS Institute Inc. All rights reserved.
Comparison of Tree Models
Compare results for the Boston Home Prices data:
29
Copyright © 2015, SAS Institute Inc. All rights reserved.
Parting Thoughts
Trees are flexible and (usually) interpretable, or at least fairly easy to explain conceptually to people.
JMP offers decision trees.
JMP Pro extends decision trees: Bootstrap Forest Boosted Trees
All the methods can be evaluated with Column Contributions report and using Model Comparison tool.
30
Copyright © 2015, SAS Institute Inc. All rights reserved.
Short Break
31
Copyright © 2015, SAS Institute Inc. All rights reserved.
Penalized Regression Techniques
32
Copyright © 2015, SAS Institute Inc. All rights reserved.
The Diabetes data
Suppose that we want to use information like age, gender, cholesterol, … to predict the progression of diabetes over one year.
𝐸𝐸 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝐴𝐴𝐴𝐴𝑒𝑒𝑖𝑖 + 𝛽𝛽2𝐺𝐺𝑒𝑒𝐺𝐺𝐺𝐺𝑒𝑒𝑟𝑟𝑖𝑖 + 𝛽𝛽3𝐵𝐵𝐵𝐵𝐼𝐼𝑖𝑖 + …= 𝒙𝒙𝑖𝑖𝛽𝛽
33
Copyright © 2015, SAS Institute Inc. All rights reserved.
The Diabetes model
We have recorded 10 different attributes.
Including interactions and quadratic terms, we end up with a model that contains 65 coefficients.
We can use a subset of the data to estimate our 65 regression coefficients and use the remainder of the data to evaluate the performance of the model.
Training set:𝑋𝑋 is 266 × 65 matrix𝑌𝑌 is 266 × 1 vector
Test Set:64 observations
34
Copyright © 2015, SAS Institute Inc. All rights reserved.
Least squares estimation
We often fit our model using least squares estimation.
�̂�𝛽 = arg min𝛽𝛽
�𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝒙𝒙𝑖𝑖𝛽𝛽 2
LS is equivalent to maximum likelihood when we assume that our errors are independent and normally distributed.
This minimization gives us the familiar estimate:�̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1XTy
𝑋𝑋 is our 𝐺𝐺 × 𝑝𝑝 design matrix (266 × 65 here)
𝑌𝑌 is our 𝐺𝐺 × 1 observed response vector (266 × 1 here)
35
Copyright © 2015, SAS Institute Inc. All rights reserved.
So how did we do?
For the portion of the data used to do estimation,
our model fits pretty well.
For the set of data not involved in estimation,
our model fits pretty poorly.
This looks like a classic case of overfitting!
36
Copyright © 2015, SAS Institute Inc. All rights reserved.
Overfitting
Overfitting occurs when our model is more complex than necessary and it starts to model random noise in the data instead of the underlying relationships.
When we overfit… our model may fit very well on the data we observed, but fail dramatically when predicting new observations.
Did we really need all 65 terms in our model?
Would something other than Least Squares be better?
37
Copyright © 2015, SAS Institute Inc. All rights reserved.
Prediction error
Suppose that we observe data of the form
𝑌𝑌𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 + 𝜖𝜖𝑖𝑖 𝑖𝑖 = 1 …𝐺𝐺
where 𝜖𝜖𝑖𝑖 ∼ 𝑁𝑁 0,𝜎𝜎2 , 𝑥𝑥𝑖𝑖 is a 𝑝𝑝 × 1 vector, and 𝑓𝑓(⋅) is a function that describes the true relationship between the response and predictors.
We fit a model and want to predict a new point 𝑌𝑌𝑛𝑛+1 given 𝑥𝑥𝑛𝑛+1.
Prediction error is one way to measure how well our model will predict the new point.
Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12
38
Copyright © 2015, SAS Institute Inc. All rights reserved.
Rewriting the prediction error
We can get the prediction error into a familiar form…
PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12
= E{ 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 + 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}= E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 2 + E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1
2
+ 2E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1
We know that E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 0 because E 𝜖𝜖𝑖𝑖 = 0.
⇒ PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + E{ 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}
MSEPE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1
39
Copyright © 2015, SAS Institute Inc. All rights reserved.
Bias variance trade-off
Prediction error breaks down into three familiar pieces.
𝜎𝜎2 is the true error variance, we’re stuck with it.
But we can work with the other two pieces:
the bias/variance trade-off
Maybe accepting some bias will yield better predictions?
Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1
= �𝜎𝜎2fixed
+ E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12
bias2+ var[𝑓𝑓 𝑥𝑥𝑛𝑛+1 ]
variance
40
Copyright © 2015, SAS Institute Inc. All rights reserved.
Prediction error for LS regression
PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
{bias 𝑓𝑓 𝑥𝑥𝑖𝑖0
}2 +1𝑁𝑁�𝑖𝑖=1
𝑁𝑁
var[𝑓𝑓(𝑥𝑥𝑖𝑖)]
Recall that �̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦 is unbiased and cov �̂�𝛽 = 𝜎𝜎2 𝑋𝑋𝑇𝑇𝑋𝑋 −1.
⇒ PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁
trace 𝑋𝑋 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝜎𝜎2
= 𝜎𝜎2 + 𝜎𝜎2
𝑁𝑁rank 𝑋𝑋𝑇𝑇𝑋𝑋
= 𝜎𝜎2 + 𝜎𝜎2𝑝𝑝𝑁𝑁
𝑋𝑋𝑛𝑛×p and full column rank
= 𝜎𝜎2(1 + 𝑝𝑝𝑁𝑁
)
41
Copyright © 2015, SAS Institute Inc. All rights reserved.
Penalized regression
LS is unbiased, but could a biased estimator do better?
Penalized regression is one way of adding bias.
Ridge regression minimizes a penalized sum of squares:
�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = arg min𝛽𝛽
�𝑖𝑖
𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗2
= 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝑋𝑋𝑇𝑇𝑦𝑦
Here 𝜆𝜆 is a tuning parameter that controls the magnitude of the parameter estimates. 𝜆𝜆 = 0 gives us OLS. 𝜆𝜆 → ∞ is moving toward a vector of zeros.
42
Copyright © 2015, SAS Institute Inc. All rights reserved.
Ridge regression
Rewrite the ridge estimator as a function of LS:
�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1𝑋𝑋𝑇𝑇𝑦𝑦
= 𝑋𝑋𝑇𝑇𝑋𝑋[𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1] −1𝑋𝑋𝑇𝑇𝑦𝑦
= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦
= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 �̂�𝛽𝐿𝐿𝐿𝐿
This makes it easier to calculate MSE.
Note that for 𝜆𝜆 < ∞, the estimates will be nonzero.
43
Copyright © 2015, SAS Institute Inc. All rights reserved.
Ridge MSE
E �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 E(�̂�𝛽𝐿𝐿𝐿𝐿)
= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝛽𝛽 bc LS is unbiased
var 𝑥𝑥𝑖𝑖 �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝜎𝜎2𝑥𝑥𝑖𝑖𝑇𝑇 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝑥𝑥𝑖𝑖
When 𝜆𝜆 > 0: Ridge estimates are biased by shrinking coefficients toward zero Ridge estimates are less variable than LS
So can ridge outperform LS?
44
Copyright © 2015, SAS Institute Inc. All rights reserved.
MSE for LS and Ridge
• Example with N=100, p=50, 𝜎𝜎 = 1.
• 𝛽𝛽1−10 large• 𝛽𝛽11−50 small
• Ridge outperforms LS for a range of 𝜆𝜆values and then starts to do worse.
• Choosing 𝜆𝜆 carefully is crucial. More on this later…
45
Copyright © 2015, SAS Institute Inc. All rights reserved.
Back to the diabetes example
Ridge predicts better on the test data than LS did!
46
Copyright © 2015, SAS Institute Inc. All rights reserved.
Penalized regression
Ridge regression uses a quadratic penalty to introduce bias.
There is an entire family of penalized regression techniques that attempt to do the same.
�̂�𝛽 = arg min𝛽𝛽
�𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1
𝑝𝑝
𝜌𝜌(𝛽𝛽𝑗𝑗)
𝝆𝝆(𝒙𝒙) Technique𝑥𝑥2 Ridge (L2 norm)
|𝑥𝑥| Lasso (L1 norm)
𝐼𝐼(𝑥𝑥 ≠ 0) Best Subset (L0 norm)
𝐼𝐼 𝑥𝑥 ≤ 𝜆𝜆 +𝑎𝑎𝜆𝜆 − 𝑥𝑥 +𝑎𝑎 − 1 𝜆𝜆 𝐼𝐼(𝑥𝑥 > 𝜆𝜆)
Smoothly clipped absolutedeviation
47
Copyright © 2015, SAS Institute Inc. All rights reserved.
The lasso
Lasso is a promising penalized regression technique.
�̂�𝛽𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = arg min𝛽𝛽
�𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1
𝑝𝑝
|𝛽𝛽𝑗𝑗|
Like ridge regression, lasso biases estimates by shrinking them toward zero.
But lasso also does selection by shrinking some coefficients all the way to zero:
least absolute shrinkage and selection operator
Improved prediction and interpretation is a big win!
48
Copyright © 2015, SAS Institute Inc. All rights reserved.
Ridge vs Lasso penalties
• For 𝛽𝛽𝑗𝑗 close to zero, the lasso penalty is much stiffer than the ridge penalty. This partially explains why lasso can zero terms out but ridge cannot.
49
Copyright © 2015, SAS Institute Inc. All rights reserved.
Ridge and Lasso geometry Instead of thinking about penalizing the SSE, we could think about
these techniques as constrained optimization problems. Lasso → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 |𝛽𝛽𝑗𝑗| ≤ 𝑠𝑠
Ridge → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 𝛽𝛽𝑗𝑗2 ≤ 𝑠𝑠
For a simple two predictor problem, it is easy to visualize the feasible region for both ridge and lasso.
50
Copyright © 2015, SAS Institute Inc. All rights reserved.
More geometry
The lasso feasible region has corners which allow for intersections at zero, unlike ridge.
Ridge (left) and Lasso (right) feasible regions with SSE contours.
51
Copyright © 2015, SAS Institute Inc. All rights reserved.
Lasso and the Diabetes example The lasso solution and MSE are complicated, but…
As the penalty increases, bias increases and prediction variance decreases. When only a subset of the predictors are active, we expect lasso
to outperform ridge.• Lasso performs best on the Diabetes Test set and only has 20 active
terms (not 65).
52
Copyright © 2015, SAS Institute Inc. All rights reserved.
Ridge vs the Lasso
Both are both penalized regression techniques that can produce better predictions than least squares.
Ridge: �̂�𝛽𝑗𝑗 ≠ 0 for all 𝑗𝑗 (even when 𝐺𝐺 < 𝑝𝑝) Naturally handles collinearity and even linear dependencies
Lasso: Performs variable selection and estimation simultaneously Provides estimates for up to 𝐺𝐺 terms If 𝑥𝑥1 and 𝑥𝑥2 are highly correlated, only one of them tends to enter
the model.
Can we get the best of both worlds?
53
Copyright © 2015, SAS Institute Inc. All rights reserved.
The Elastic Net
Elastic Net is a combination of ridge and the lasso
Penalty function: 𝜌𝜌 𝛽𝛽 = 1−𝛼𝛼2𝛽𝛽2 + 𝛼𝛼 𝛽𝛽 𝛼𝛼 ∈ [0,1]
Ridge (𝛼𝛼 = 0) and lasso (𝛼𝛼 = 1) are special cases
𝛼𝛼 tuning parameter controls the mix of ℓ1 and ℓ2 penalties
For 𝛼𝛼 ∈ 0,1 , We get selection and shrinkage We can handle collinearity and dependencies We can estimate more than 𝐺𝐺 coefficients
𝛼𝛼 ≈ 1 is a popular choice since we get the lasso with a bit of singularity handling.
54
Copyright © 2015, SAS Institute Inc. All rights reserved.
Elastic net penalty for various 𝜶𝜶
The elastic net penalty is a compromise between ridge regression and the lasso.
55
Copyright © 2015, SAS Institute Inc. All rights reserved.
Elastic net vs Lasso
Suppose we have 10 candidate predictors.
The only true active predictors are 𝑥𝑥3 and 𝑥𝑥5 which are highly correlated. The lasso will likely only include 𝑥𝑥3 or 𝑥𝑥5 in the final model. The elastic net will likely include 𝑥𝑥3 and 𝑥𝑥5 in the final model.
This is where the name “elastic net” comes from, it stretches to capture correlated predictors.
Which solution is “better” may depend on your objectives.
56
Copyright © 2015, SAS Institute Inc. All rights reserved.
Back to the Diabetes data again…
Elastic Net performs almost identical to the lasso.
Because of some highly correlated predictors, the elastic net model has 24 terms whereas the lasso has 20.
57
Copyright © 2015, SAS Institute Inc. All rights reserved.
But what about Forward Selection?
Forward selection is a popular variable selection technique that has been around a long time.
FS algorithm usually looks something like…1. Start with just an intercept2. Enter the most significant predictor (Wald test, score test, …)3. Repeat step 2 until everything enters or we run out of DF.4. Keep the best model in the sequence .
You can look at FS as a way of approximating Best Subset regression.
Important: FS does selection, but not shrinkage!
58
Copyright © 2015, SAS Institute Inc. All rights reserved.
One last look at the diabetes data
Forward selection chooses 12 terms for our final model.
• But it doesn’t do very well predicting new observations.• In general, we should expect shrinkage techniques to yield
better predictions than FS.
59
Copyright © 2015, SAS Institute Inc. All rights reserved.
Shrinkage and selection
Lasso, Elastic Net RidgeForward Selection OLS, ML
Yes No
Yes
No
Variable Selection
Shrinkage
• How we choose to do estimation is crucial!• Shrinkage techniques have shown great promise in building
models that are parsimonious and that will predict well.
60
Copyright © 2015, SAS Institute Inc. All rights reserved.
The Adaptive Lasso
If we knew which predictors were most important, we may penalize their coefficients less severely.
This leads us to the adaptive lasso:
�̂�𝛽𝐴𝐴𝐿𝐿 = arg min𝛽𝛽
�𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1
𝑝𝑝
𝑤𝑤𝑗𝑗|𝛽𝛽𝑗𝑗|
If we choose the weights carefully, we get the oracle property: Asymptotically we will get the right active set. Asymptotically we will predict as well as if we had known the true model in
advance.
Using the inverse of the LS estimates gets us the oracle property.𝑤𝑤𝑗𝑗 = �1 |�̂�𝛽𝑗𝑗,𝐿𝐿𝐿𝐿|
61
Copyright © 2015, SAS Institute Inc. All rights reserved.
Penalized Generalized Linear Models
So far we have only considered penalized least squares, but the idea extends naturally to generalized linear models.
For GLMs, we penalize the negative log-likelihood
�̂�𝛽 = arg min𝛽𝛽
−𝑙𝑙 𝛽𝛽|𝑦𝑦 + 𝜆𝜆�𝑗𝑗=1
𝑝𝑝
𝜌𝜌(𝛽𝛽𝑗𝑗)
The ideas are the exact same, but estimation can be much trickier.
JMP allows us to do lasso/elastic net fits for a handful of distributions: Poisson, Binomial, Gamma, …
62
Copyright © 2015, SAS Institute Inc. All rights reserved.
Using a penalized regression model
We have ignored the choice of the tuning parameter 𝜆𝜆 so far.
Rather than a single estimate, we end up with a sequence of fits defined by a range of tuning parameter values.
In practice, we will want to use a value of the tuning parameter that gives us the best fit.
Elastic net also depends on the choice of 𝛼𝛼, but we usually just pick a single value like 𝛼𝛼 = .99.
Tuning 𝝀𝝀𝟏𝟏 𝝀𝝀𝟐𝟐 … 𝝀𝝀𝒌𝒌−𝟏𝟏 𝝀𝝀𝒌𝒌Estimate �̂�𝛽1 �̂�𝛽2 … �̂�𝛽𝑘𝑘−1 �̂�𝛽𝑘𝑘
63
Copyright © 2015, SAS Institute Inc. All rights reserved.
The solution path The solution path is a convenient summary of the sequence of fits.
• Each line represents an estimated coefficient in the model. • 𝜆𝜆 decreases as we move left to right, allowing more predictors to enter.
• BMI and LTG are the first two terms to enter the model.
• HDL enters with a negative coefficient, but later becomes positive as the penalty is relaxed.
64
Copyright © 2015, SAS Institute Inc. All rights reserved.
Tuning
Since each value of 𝜆𝜆 leads to a different model, how do we choose a value that leads to a good model?
In reality 𝜆𝜆 is a continuous number, but we break it up into a grid [𝜆𝜆1, 𝜆𝜆2, … , 𝜆𝜆𝑟𝑟−1, 𝜆𝜆𝑟𝑟]
Then we can try each value of 𝜆𝜆 and keep the model that fits best.
But how we determine which fit is best? Cross-validation (hold-out or k-fold) Information criteria (AICc or BIC)
In our diabetes examples so far, we have used Training and Validation sets to tune our models.
65
Copyright © 2015, SAS Institute Inc. All rights reserved.
K-Fold Cross-Validation
We may worry that our model may be sensitive to the particular validation set used to choose the tuning parameter. This is especially concerning with limited data.
An alternative is to break our data set into K pieces or “folds” of similar size.
At each iteration, we treat one of the folds as the validation set and the remaining folds as the training set. We sum the error over the validation sets.
When K = N, we call that leave-one-out or jackknife cross-validation.
66
Copyright © 2015, SAS Institute Inc. All rights reserved.
5-Fold Cross-Validation Example
For each value of our tuning parameter 𝜆𝜆_𝑗𝑗 we will be doing five fits. Fit 1: we fit the model on Sets 1,2,3, and 5 and then evaluate the
model on Set 4, giving us 𝐶𝐶𝑉𝑉_1 (𝜆𝜆_𝑗𝑗). Fit 2: we fit the model on Sets 1, 3, 4, and 5 and evaluate the
model on Set 2, giving us 𝐶𝐶𝑉𝑉_2 (𝜆𝜆_𝑗𝑗 ). …and so on. Our “best” value of 𝜆𝜆 minimizes 𝐶𝐶𝑉𝑉_1+𝐶𝐶𝑉𝑉_2+𝐶𝐶𝑉𝑉_3+𝐶𝐶𝑉𝑉_4+𝐶𝐶𝑉𝑉_5.
Set 1 Set 2 Set 3 Set 4 Set 5Training Training Training Validation TrainingTraining Validation Training Training Training
Validation Training Training Training TrainingTraining Training Training Training ValidationTraining Training Validation Training Training
67
Copyright © 2015, SAS Institute Inc. All rights reserved.
K-Fold for Tuning
So how do we choose K? If K is too small, we may not have avoided the drawbacks of a
single validation set. If K is too large, we should be concerned about the training sets
being highly correlated which results in highly variable error estimates. K very large can also lead to long computation times.
K = 5 and K = 10 are very popular choices.
68
Copyright © 2015, SAS Institute Inc. All rights reserved.
Is the best really the best? When building our model, it is tempting to find the model that
produces the best CV error or AIC/BIC and stick with it.
If a simpler model performs just as well, should we go smaller?
If our goal is to capture the true effects, should we go bigger?
69
Copyright © 2015, SAS Institute Inc. All rights reserved.
Similar models AIC and BIC provide guidance on which models are similar to the
best model.
AIC – Best AIC < 4 ⇒ strong evidence supporting lesser model
4 ≤ AIC – Best AIC < 10 ⇒ weak evidence supporting lesser model
…and we should probably avoid anything worse than that.
70
Copyright © 2015, SAS Institute Inc. All rights reserved.
Similar models and k-fold There is a similar concept when using k-fold.
For the best model, the validation error is made up of 𝐾𝐾 pieces:
CV 𝛽𝛽 = CV1 𝛽𝛽 + CV2(𝛽𝛽) + … + CVK 𝛽𝛽
Taking the sample standard error of the CVj(𝛽𝛽) gives us 𝑠𝑠𝑒𝑒 𝛽𝛽 .
Then if 𝐶𝐶𝑉𝑉 𝛽𝛽 ≤ 𝐶𝐶𝑉𝑉 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏 + 𝑠𝑠𝑒𝑒 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏we should feel good about 𝛽𝛽.
71
Copyright © 2015, SAS Institute Inc. All rights reserved.
Penalized Regression and JMP Pro
Penalized regression tools are found in the Generalized Regression platform in JMP Pro.
Ridge, Lasso, and Elastic net (and FS and ML too)
A variety of response distributions (normal, binomial, Poisson, gamma, negative binomial,…)
And quantile regression too (but only ML, no selection)
72
Copyright © 2015, SAS Institute Inc. All rights reserved.
References
Sall, J. (2002), “Monte Carlo Calibration of Distributions of Partition Statistics,” SAS Institute. Retrieved July 29, 2015 from http://www.jmp.com/content/dam/jmp/documents/en/white-papers/montecarlocal.pdf
SAS Institute Inc. 2015. JMP® 12 Specialized Models. Cary, NC: SAS Institute Inc.
SAS Press 2015. Building Better Models with JMP Pro. Cary, NC: SAS Institute Inc.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” JRSS-B, 58-1, pp. 267-288.
Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net,” JRSS-B, 67-2, pp. 301-320.
Copyright © 2015 SAS Institute Inc. All rights reserved.
Clay Barker – [email protected] Crotty – [email protected] Stephens – [email protected]
Discussion and Q&A