interactive model building with jmp pro · 2019-11-22 · a single split results in an r2 of 22.8%....

73
Copyright © 2015 SAS Institute Inc. All rights reserved. Interactive Model Building with JMP Pro Clay Barker – [email protected] Michael Crotty – [email protected] Mia Stephens – [email protected]

Upload: others

Post on 23-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

Copyright © 2015 SAS Institute Inc. All rights reserved.

Interactive Model Building with JMP ProClay Barker – [email protected] Crotty – [email protected] Stephens – [email protected]

Page 2: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

2

Copyright © 2015, SAS Institute Inc. All rights reserved.

What is JMP (and JMP Pro)? Statistical Discovery Software from SAS

Developed in 1989 as a desktop application

Comprehensive basic statistics and graphical summaries advanced tools and techniques

Extendible powerful scripting language application and add-in builders integrates with Excel, R, MATLAB, and SAS

Visual, dynamic and interactive

JMP Pro – Advanced tools for analytics and modeling

Page 3: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

3

Copyright © 2015, SAS Institute Inc. All rights reserved.

Outline Introduction to JMP Software

A Motivating Example – Boston Home Prices

The Modeling Process

Simple and Multiple Linear Regression

Classification and Regression Trees

Advanced Tree Methods

Short Break

Penalized Regression Techniques

Return to Boston Home Prices

Discussion and Q&A

Page 4: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

4

Copyright © 2015, SAS Institute Inc. All rights reserved.

The Modeling Process Explore data (know the data, identify key features) One variable at a time Two variables at a time Many variables at a time

Identify potential data quality issues

Prepare data for modeling Missing values Data cleanup (recode, binning) Transformations Create validation column (holdout sets)

Model building, selection, and comparison

Page 5: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

5

Copyright © 2015, SAS Institute Inc. All rights reserved.

Example: Boston Home Building Value

Graph Builder and Geographic Mapping

Data Filter Columns Viewer Distribution

Multivariate Create a Validation Column Regression Regression Trees Neural Networks

Situation:

Predict total value for single family owner occupied homes in a Boston neighborhood.

Data:

2014 appraisal information on over 25K Boston homes, publicly available from https://data.cityofboston.gov

Page 6: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

6

Copyright © 2015, SAS Institute Inc. All rights reserved.

Classification and Regression Trees

Page 7: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

7

Copyright © 2015, SAS Institute Inc. All rights reserved.

Overview

Why Trees?

Algorithm and Decision Tree Example

Column Contributions

Cross Validation

Bootstrap Forest

Boosted Tree

Page 8: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

8

Copyright © 2015, SAS Institute Inc. All rights reserved.

Why Trees?

Easy to interpret and explain Not a “black box” model Can be represented with a tree (flow-chart) diagram Interpretable as a series of “if-then” statements that lead to a

classification or a prediction

Flexible for response types Categorical response => classification tree Continuous response => regression tree

Flexible for input factor types Categorical factors get split into two groups of levels Continuous factors get split based on a cutting value

Page 9: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

9

Copyright © 2015, SAS Institute Inc. All rights reserved.

Algorithm

Build a tree by making optimal splits. Examine candidate splits across all factors. Split on the factor that maximizes the split criterion, using the

split that optimizes the criterion.

Repeat above steps until you decide to stop splitting or until a stopping criterion is reached.

Page 10: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

10

Copyright © 2015, SAS Institute Inc. All rights reserved.

Basic Example

Corn data (142 observations)

Predict yield from nitrate: What is the optimal split of

nitrate values to maximize the ability to predict yield? What is the optimal split of

nitrate values to maximize the ANOVA model SS?

Prediction after one split: For nitrate < 26.32, predict yield

of 6756. For nitrate ≥ 26.32, predict yield

of 8606.

Page 11: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

11

Copyright © 2015, SAS Institute Inc. All rights reserved.

Basic Example

A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions.

Split the left side leaf at nitrate < 11.44. R2 increases to 31.3%.

Keep splitting until we hit the minimum size split…

Splitting terminates at 21 splits, with R2 up to 45.3%. Leads to an unwieldy model.

We used all the observations in our data to fit the model. Runs risk of bad predictions for new data, because of overfitting. We can’t test how well our model fits for new observations.

Page 12: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

12

Copyright © 2015, SAS Institute Inc. All rights reserved.

Basic Example Tree View

Page 13: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

13

Copyright © 2015, SAS Institute Inc. All rights reserved.

Column Contributions

Simple trees are intuitive. Complex trees might not be intuitive. Advanced tree methods don’t have a single tree representation.

We want to get a sense of which variables most influence our response variable. This is especially true when doing exploratory (rather than

predictive) modeling.

This report shows each column’s contribution to the fit.

Page 14: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

14

Copyright © 2015, SAS Institute Inc. All rights reserved.

Cross Validation

We want to avoid overfitting our data, because overfitting leads to bad predictions for new observations.

Cross validation is one way to avoid overfitting.

We focus on one type of cross validation for trees: Holdback sets (Training/Validation/(Test))

Page 15: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

15

Copyright © 2015, SAS Institute Inc. All rights reserved.

Holdback Cross Validation

Cross validation refers to the process of randomly dividing our data set into training and validation sets. Use the training set to estimate our model parameters. Use the validation set to evaluate how well the model fits. This is

a measure of how well the model will fit on new observations. Retain the model that fits best on our validation set.

Most simple case: use a single training set and a single validation set.

Page 16: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

16

Copyright © 2015, SAS Institute Inc. All rights reserved.

Training and Validation Set

Suppose we have 150 observations. One strategy would be to use the first 100 for training and the last 50 for validation.

For each split on the training data, we measure the tree’s ability to predict on the validation data.

The tree/model with the best R2 for the validation set is our “best” model.

𝑥𝑥1,1 … 𝑥𝑥1,𝑝𝑝⋮⋮ ⋱ ⋮

⋮𝑥𝑥100,1 … 𝑥𝑥100,𝑝𝑝

𝑥𝑥101,1 … 𝑥𝑥101,𝑝𝑝⋮ ⋱ ⋮

𝑥𝑥150,1 … 𝑥𝑥150,𝑝𝑝

𝑦𝑦1⋮⋮

𝑦𝑦100

𝑦𝑦101⋮

𝑦𝑦150

Training

Validation

Page 17: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

17

Copyright © 2015, SAS Institute Inc. All rights reserved.

Training and Validation Example Back to the Corn data (142 observations):

Randomly select 42 obs. for validation; use the rest for training.

For each additional split, calculate the validation R2 for the model.

Continue until the validation R2 fails to improve for 10 consecutive splits.

Select the model defined by the maximum validation R2.

Page 18: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

18

Copyright © 2015, SAS Institute Inc. All rights reserved.

Training, Validation, and Test

Another cross validation option is to divide the data into three sets: Training set Validation set Test set

The training and validation sets are used as previously illustrated.

Then we fit our final model on the Test set to give us an independent assessment of predictive performance. The test set allows us to compare models against each other on

data that has not been used in the development of the models. Recall that the validation set was used in model development.

Page 19: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

19

Copyright © 2015, SAS Institute Inc. All rights reserved.

Possible Pitfalls of One Validation Set Model selection can be sensitive to the particular

validation set chosen.

This is especially concerning with limited data.

Page 20: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

20

Copyright © 2015, SAS Institute Inc. All rights reserved.

Advanced Tree Methods in JMP Pro

Page 21: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

21

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Bootstrap Forest

Why build only one tree when you can build a forest? The bootstrap forest method makes many trees and averages

the predicted values to get final predictions.

How do we build many trees on one data set? Each tree is built using a bootstrap sample (sampled with

replacement). Each split on each tree only considers a random sample of

candidate columns for splitting. This is also known as bootstrap-averaging (or bagging).

Validation can be used to control the number of trees. Using validation allows you to use early stopping rules.

Page 22: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

22

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Bootstrap Forest

What are some options we can specify? # trees (to build and average over) # columns sampled in each split Minimum & Maximum splits in each tree Early stopping (if using validation)

What output do we get? Could view individual trees, but it’s hard to interpret visually. Look at R2 for validation and test sets. Column Contributions report is useful as well. Continuous response has residual diagnostic plots. Categorical response has ROC and Lift curves.

Page 23: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

23

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Bootstrap Forest

Page 24: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

24

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Boosted Tree

Rather than build a forest of trees and average them, why not build trees sequentially and add them together?

Boosted trees do just that. Build a small tree and get the residuals. Then build another tree on those residuals. Repeat this process and then add all the small trees together.

Final tree is sum of estimates for each terminal node.

Note: For categorical responses, JMP only supports responses with 2 levels.

Page 25: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

25

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Boosted Tree

What are some options we can specify? # layers of fits (or stages) # splits per tree Learning Rate (between 0 and 1) Overfit Penalty (to avoid prob=0 with categorical responses) Minimum Split Size Early stopping (if using validation)

Page 26: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

26

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Boosted Tree

What output do we get? R2 for validation and test sets Confusion matrix for categorical responses Cumulative validation plot (fit statistic vs stage number) Could view individual trees, but that is unwieldy

» Column Contributions report is useful again!

Page 27: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

27

Copyright © 2015, SAS Institute Inc. All rights reserved.

Extension: Boosted Tree

Page 28: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

28

Copyright © 2015, SAS Institute Inc. All rights reserved.

Comparison of Tree Models

Compare results for the Boston Home Prices data:

Page 29: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

29

Copyright © 2015, SAS Institute Inc. All rights reserved.

Parting Thoughts

Trees are flexible and (usually) interpretable, or at least fairly easy to explain conceptually to people.

JMP offers decision trees.

JMP Pro extends decision trees: Bootstrap Forest Boosted Trees

All the methods can be evaluated with Column Contributions report and using Model Comparison tool.

Page 30: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

30

Copyright © 2015, SAS Institute Inc. All rights reserved.

Short Break

Page 31: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

31

Copyright © 2015, SAS Institute Inc. All rights reserved.

Penalized Regression Techniques

Page 32: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

32

Copyright © 2015, SAS Institute Inc. All rights reserved.

The Diabetes data

Suppose that we want to use information like age, gender, cholesterol, … to predict the progression of diabetes over one year.

𝐸𝐸 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝐴𝐴𝐴𝐴𝑒𝑒𝑖𝑖 + 𝛽𝛽2𝐺𝐺𝑒𝑒𝐺𝐺𝐺𝐺𝑒𝑒𝑟𝑟𝑖𝑖 + 𝛽𝛽3𝐵𝐵𝐵𝐵𝐼𝐼𝑖𝑖 + …= 𝒙𝒙𝑖𝑖𝛽𝛽

Page 33: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

33

Copyright © 2015, SAS Institute Inc. All rights reserved.

The Diabetes model

We have recorded 10 different attributes.

Including interactions and quadratic terms, we end up with a model that contains 65 coefficients.

We can use a subset of the data to estimate our 65 regression coefficients and use the remainder of the data to evaluate the performance of the model.

Training set:𝑋𝑋 is 266 × 65 matrix𝑌𝑌 is 266 × 1 vector

Test Set:64 observations

Page 34: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

34

Copyright © 2015, SAS Institute Inc. All rights reserved.

Least squares estimation

We often fit our model using least squares estimation.

�̂�𝛽 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝒙𝒙𝑖𝑖𝛽𝛽 2

LS is equivalent to maximum likelihood when we assume that our errors are independent and normally distributed.

This minimization gives us the familiar estimate:�̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1XTy

𝑋𝑋 is our 𝐺𝐺 × 𝑝𝑝 design matrix (266 × 65 here)

𝑌𝑌 is our 𝐺𝐺 × 1 observed response vector (266 × 1 here)

Page 35: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

35

Copyright © 2015, SAS Institute Inc. All rights reserved.

So how did we do?

For the portion of the data used to do estimation,

our model fits pretty well.

For the set of data not involved in estimation,

our model fits pretty poorly.

This looks like a classic case of overfitting!

Page 36: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

36

Copyright © 2015, SAS Institute Inc. All rights reserved.

Overfitting

Overfitting occurs when our model is more complex than necessary and it starts to model random noise in the data instead of the underlying relationships.

When we overfit… our model may fit very well on the data we observed, but fail dramatically when predicting new observations.

Did we really need all 65 terms in our model?

Would something other than Least Squares be better?

Page 37: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

37

Copyright © 2015, SAS Institute Inc. All rights reserved.

Prediction error

Suppose that we observe data of the form

𝑌𝑌𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 + 𝜖𝜖𝑖𝑖 𝑖𝑖 = 1 …𝐺𝐺

where 𝜖𝜖𝑖𝑖 ∼ 𝑁𝑁 0,𝜎𝜎2 , 𝑥𝑥𝑖𝑖 is a 𝑝𝑝 × 1 vector, and 𝑓𝑓(⋅) is a function that describes the true relationship between the response and predictors.

We fit a model and want to predict a new point 𝑌𝑌𝑛𝑛+1 given 𝑥𝑥𝑛𝑛+1.

Prediction error is one way to measure how well our model will predict the new point.

Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

Page 38: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

38

Copyright © 2015, SAS Institute Inc. All rights reserved.

Rewriting the prediction error

We can get the prediction error into a familiar form…

PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

= E{ 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 + 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}= E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 2 + E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1

2

+ 2E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1

We know that E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 0 because E 𝜖𝜖𝑖𝑖 = 0.

⇒ PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + E{ 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}

MSEPE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1

Page 39: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

39

Copyright © 2015, SAS Institute Inc. All rights reserved.

Bias variance trade-off

Prediction error breaks down into three familiar pieces.

𝜎𝜎2 is the true error variance, we’re stuck with it.

But we can work with the other two pieces:

the bias/variance trade-off

Maybe accepting some bias will yield better predictions?

Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1

= �𝜎𝜎2fixed

+ E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

bias2+ var[𝑓𝑓 𝑥𝑥𝑛𝑛+1 ]

variance

Page 40: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

40

Copyright © 2015, SAS Institute Inc. All rights reserved.

Prediction error for LS regression

PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

{bias 𝑓𝑓 𝑥𝑥𝑖𝑖0

}2 +1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

var[𝑓𝑓(𝑥𝑥𝑖𝑖)]

Recall that �̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦 is unbiased and cov �̂�𝛽 = 𝜎𝜎2 𝑋𝑋𝑇𝑇𝑋𝑋 −1.

⇒ PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁

trace 𝑋𝑋 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝜎𝜎2

= 𝜎𝜎2 + 𝜎𝜎2

𝑁𝑁rank 𝑋𝑋𝑇𝑇𝑋𝑋

= 𝜎𝜎2 + 𝜎𝜎2𝑝𝑝𝑁𝑁

𝑋𝑋𝑛𝑛×p and full column rank

= 𝜎𝜎2(1 + 𝑝𝑝𝑁𝑁

)

Page 41: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

41

Copyright © 2015, SAS Institute Inc. All rights reserved.

Penalized regression

LS is unbiased, but could a biased estimator do better?

Penalized regression is one way of adding bias.

Ridge regression minimizes a penalized sum of squares:

�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = arg min𝛽𝛽

�𝑖𝑖

𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝛽𝛽𝑗𝑗2

= 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝑋𝑋𝑇𝑇𝑦𝑦

Here 𝜆𝜆 is a tuning parameter that controls the magnitude of the parameter estimates. 𝜆𝜆 = 0 gives us OLS. 𝜆𝜆 → ∞ is moving toward a vector of zeros.

Page 42: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

42

Copyright © 2015, SAS Institute Inc. All rights reserved.

Ridge regression

Rewrite the ridge estimator as a function of LS:

�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝑋𝑋𝑇𝑇𝑋𝑋[𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1] −1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 �̂�𝛽𝐿𝐿𝐿𝐿

This makes it easier to calculate MSE.

Note that for 𝜆𝜆 < ∞, the estimates will be nonzero.

Page 43: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

43

Copyright © 2015, SAS Institute Inc. All rights reserved.

Ridge MSE

E �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 E(�̂�𝛽𝐿𝐿𝐿𝐿)

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝛽𝛽 bc LS is unbiased

var 𝑥𝑥𝑖𝑖 �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝜎𝜎2𝑥𝑥𝑖𝑖𝑇𝑇 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝑥𝑥𝑖𝑖

When 𝜆𝜆 > 0: Ridge estimates are biased by shrinking coefficients toward zero Ridge estimates are less variable than LS

So can ridge outperform LS?

Page 44: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

44

Copyright © 2015, SAS Institute Inc. All rights reserved.

MSE for LS and Ridge

• Example with N=100, p=50, 𝜎𝜎 = 1.

• 𝛽𝛽1−10 large• 𝛽𝛽11−50 small

• Ridge outperforms LS for a range of 𝜆𝜆values and then starts to do worse.

• Choosing 𝜆𝜆 carefully is crucial. More on this later…

Page 45: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

45

Copyright © 2015, SAS Institute Inc. All rights reserved.

Back to the diabetes example

Ridge predicts better on the test data than LS did!

Page 46: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

46

Copyright © 2015, SAS Institute Inc. All rights reserved.

Penalized regression

Ridge regression uses a quadratic penalty to introduce bias.

There is an entire family of penalized regression techniques that attempt to do the same.

�̂�𝛽 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝜌𝜌(𝛽𝛽𝑗𝑗)

𝝆𝝆(𝒙𝒙) Technique𝑥𝑥2 Ridge (L2 norm)

|𝑥𝑥| Lasso (L1 norm)

𝐼𝐼(𝑥𝑥 ≠ 0) Best Subset (L0 norm)

𝐼𝐼 𝑥𝑥 ≤ 𝜆𝜆 +𝑎𝑎𝜆𝜆 − 𝑥𝑥 +𝑎𝑎 − 1 𝜆𝜆 𝐼𝐼(𝑥𝑥 > 𝜆𝜆)

Smoothly clipped absolutedeviation

Page 47: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

47

Copyright © 2015, SAS Institute Inc. All rights reserved.

The lasso

Lasso is a promising penalized regression technique.

�̂�𝛽𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

|𝛽𝛽𝑗𝑗|

Like ridge regression, lasso biases estimates by shrinking them toward zero.

But lasso also does selection by shrinking some coefficients all the way to zero:

least absolute shrinkage and selection operator

Improved prediction and interpretation is a big win!

Page 48: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

48

Copyright © 2015, SAS Institute Inc. All rights reserved.

Ridge vs Lasso penalties

• For 𝛽𝛽𝑗𝑗 close to zero, the lasso penalty is much stiffer than the ridge penalty. This partially explains why lasso can zero terms out but ridge cannot.

Page 49: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

49

Copyright © 2015, SAS Institute Inc. All rights reserved.

Ridge and Lasso geometry Instead of thinking about penalizing the SSE, we could think about

these techniques as constrained optimization problems. Lasso → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 |𝛽𝛽𝑗𝑗| ≤ 𝑠𝑠

Ridge → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 𝛽𝛽𝑗𝑗2 ≤ 𝑠𝑠

For a simple two predictor problem, it is easy to visualize the feasible region for both ridge and lasso.

Page 50: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

50

Copyright © 2015, SAS Institute Inc. All rights reserved.

More geometry

The lasso feasible region has corners which allow for intersections at zero, unlike ridge.

Ridge (left) and Lasso (right) feasible regions with SSE contours.

Page 51: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

51

Copyright © 2015, SAS Institute Inc. All rights reserved.

Lasso and the Diabetes example The lasso solution and MSE are complicated, but…

As the penalty increases, bias increases and prediction variance decreases. When only a subset of the predictors are active, we expect lasso

to outperform ridge.• Lasso performs best on the Diabetes Test set and only has 20 active

terms (not 65).

Page 52: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

52

Copyright © 2015, SAS Institute Inc. All rights reserved.

Ridge vs the Lasso

Both are both penalized regression techniques that can produce better predictions than least squares.

Ridge: �̂�𝛽𝑗𝑗 ≠ 0 for all 𝑗𝑗 (even when 𝐺𝐺 < 𝑝𝑝) Naturally handles collinearity and even linear dependencies

Lasso: Performs variable selection and estimation simultaneously Provides estimates for up to 𝐺𝐺 terms If 𝑥𝑥1 and 𝑥𝑥2 are highly correlated, only one of them tends to enter

the model.

Can we get the best of both worlds?

Page 53: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

53

Copyright © 2015, SAS Institute Inc. All rights reserved.

The Elastic Net

Elastic Net is a combination of ridge and the lasso

Penalty function: 𝜌𝜌 𝛽𝛽 = 1−𝛼𝛼2𝛽𝛽2 + 𝛼𝛼 𝛽𝛽 𝛼𝛼 ∈ [0,1]

Ridge (𝛼𝛼 = 0) and lasso (𝛼𝛼 = 1) are special cases

𝛼𝛼 tuning parameter controls the mix of ℓ1 and ℓ2 penalties

For 𝛼𝛼 ∈ 0,1 , We get selection and shrinkage We can handle collinearity and dependencies We can estimate more than 𝐺𝐺 coefficients

𝛼𝛼 ≈ 1 is a popular choice since we get the lasso with a bit of singularity handling.

Page 54: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

54

Copyright © 2015, SAS Institute Inc. All rights reserved.

Elastic net penalty for various 𝜶𝜶

The elastic net penalty is a compromise between ridge regression and the lasso.

Page 55: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

55

Copyright © 2015, SAS Institute Inc. All rights reserved.

Elastic net vs Lasso

Suppose we have 10 candidate predictors.

The only true active predictors are 𝑥𝑥3 and 𝑥𝑥5 which are highly correlated. The lasso will likely only include 𝑥𝑥3 or 𝑥𝑥5 in the final model. The elastic net will likely include 𝑥𝑥3 and 𝑥𝑥5 in the final model.

This is where the name “elastic net” comes from, it stretches to capture correlated predictors.

Which solution is “better” may depend on your objectives.

Page 56: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

56

Copyright © 2015, SAS Institute Inc. All rights reserved.

Back to the Diabetes data again…

Elastic Net performs almost identical to the lasso.

Because of some highly correlated predictors, the elastic net model has 24 terms whereas the lasso has 20.

Page 57: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

57

Copyright © 2015, SAS Institute Inc. All rights reserved.

But what about Forward Selection?

Forward selection is a popular variable selection technique that has been around a long time.

FS algorithm usually looks something like…1. Start with just an intercept2. Enter the most significant predictor (Wald test, score test, …)3. Repeat step 2 until everything enters or we run out of DF.4. Keep the best model in the sequence .

You can look at FS as a way of approximating Best Subset regression.

Important: FS does selection, but not shrinkage!

Page 58: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

58

Copyright © 2015, SAS Institute Inc. All rights reserved.

One last look at the diabetes data

Forward selection chooses 12 terms for our final model.

• But it doesn’t do very well predicting new observations.• In general, we should expect shrinkage techniques to yield

better predictions than FS.

Page 59: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

59

Copyright © 2015, SAS Institute Inc. All rights reserved.

Shrinkage and selection

Lasso, Elastic Net RidgeForward Selection OLS, ML

Yes No

Yes

No

Variable Selection

Shrinkage

• How we choose to do estimation is crucial!• Shrinkage techniques have shown great promise in building

models that are parsimonious and that will predict well.

Page 60: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

60

Copyright © 2015, SAS Institute Inc. All rights reserved.

The Adaptive Lasso

If we knew which predictors were most important, we may penalize their coefficients less severely.

This leads us to the adaptive lasso:

�̂�𝛽𝐴𝐴𝐿𝐿 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝑤𝑤𝑗𝑗|𝛽𝛽𝑗𝑗|

If we choose the weights carefully, we get the oracle property: Asymptotically we will get the right active set. Asymptotically we will predict as well as if we had known the true model in

advance.

Using the inverse of the LS estimates gets us the oracle property.𝑤𝑤𝑗𝑗 = �1 |�̂�𝛽𝑗𝑗,𝐿𝐿𝐿𝐿|

Page 61: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

61

Copyright © 2015, SAS Institute Inc. All rights reserved.

Penalized Generalized Linear Models

So far we have only considered penalized least squares, but the idea extends naturally to generalized linear models.

For GLMs, we penalize the negative log-likelihood

�̂�𝛽 = arg min𝛽𝛽

−𝑙𝑙 𝛽𝛽|𝑦𝑦 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝜌𝜌(𝛽𝛽𝑗𝑗)

The ideas are the exact same, but estimation can be much trickier.

JMP allows us to do lasso/elastic net fits for a handful of distributions: Poisson, Binomial, Gamma, …

Page 62: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

62

Copyright © 2015, SAS Institute Inc. All rights reserved.

Using a penalized regression model

We have ignored the choice of the tuning parameter 𝜆𝜆 so far.

Rather than a single estimate, we end up with a sequence of fits defined by a range of tuning parameter values.

In practice, we will want to use a value of the tuning parameter that gives us the best fit.

Elastic net also depends on the choice of 𝛼𝛼, but we usually just pick a single value like 𝛼𝛼 = .99.

Tuning 𝝀𝝀𝟏𝟏 𝝀𝝀𝟐𝟐 … 𝝀𝝀𝒌𝒌−𝟏𝟏 𝝀𝝀𝒌𝒌Estimate �̂�𝛽1 �̂�𝛽2 … �̂�𝛽𝑘𝑘−1 �̂�𝛽𝑘𝑘

Page 63: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

63

Copyright © 2015, SAS Institute Inc. All rights reserved.

The solution path The solution path is a convenient summary of the sequence of fits.

• Each line represents an estimated coefficient in the model. • 𝜆𝜆 decreases as we move left to right, allowing more predictors to enter.

• BMI and LTG are the first two terms to enter the model.

• HDL enters with a negative coefficient, but later becomes positive as the penalty is relaxed.

Page 64: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

64

Copyright © 2015, SAS Institute Inc. All rights reserved.

Tuning

Since each value of 𝜆𝜆 leads to a different model, how do we choose a value that leads to a good model?

In reality 𝜆𝜆 is a continuous number, but we break it up into a grid [𝜆𝜆1, 𝜆𝜆2, … , 𝜆𝜆𝑟𝑟−1, 𝜆𝜆𝑟𝑟]

Then we can try each value of 𝜆𝜆 and keep the model that fits best.

But how we determine which fit is best? Cross-validation (hold-out or k-fold) Information criteria (AICc or BIC)

In our diabetes examples so far, we have used Training and Validation sets to tune our models.

Page 65: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

65

Copyright © 2015, SAS Institute Inc. All rights reserved.

K-Fold Cross-Validation

We may worry that our model may be sensitive to the particular validation set used to choose the tuning parameter. This is especially concerning with limited data.

An alternative is to break our data set into K pieces or “folds” of similar size.

At each iteration, we treat one of the folds as the validation set and the remaining folds as the training set. We sum the error over the validation sets.

When K = N, we call that leave-one-out or jackknife cross-validation.

Page 66: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

66

Copyright © 2015, SAS Institute Inc. All rights reserved.

5-Fold Cross-Validation Example

For each value of our tuning parameter 𝜆𝜆_𝑗𝑗 we will be doing five fits. Fit 1: we fit the model on Sets 1,2,3, and 5 and then evaluate the

model on Set 4, giving us 𝐶𝐶𝑉𝑉_1 (𝜆𝜆_𝑗𝑗). Fit 2: we fit the model on Sets 1, 3, 4, and 5 and evaluate the

model on Set 2, giving us 𝐶𝐶𝑉𝑉_2 (𝜆𝜆_𝑗𝑗 ). …and so on. Our “best” value of 𝜆𝜆 minimizes 𝐶𝐶𝑉𝑉_1+𝐶𝐶𝑉𝑉_2+𝐶𝐶𝑉𝑉_3+𝐶𝐶𝑉𝑉_4+𝐶𝐶𝑉𝑉_5.

Set 1 Set 2 Set 3 Set 4 Set 5Training Training Training Validation TrainingTraining Validation Training Training Training

Validation Training Training Training TrainingTraining Training Training Training ValidationTraining Training Validation Training Training

Page 67: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

67

Copyright © 2015, SAS Institute Inc. All rights reserved.

K-Fold for Tuning

So how do we choose K? If K is too small, we may not have avoided the drawbacks of a

single validation set. If K is too large, we should be concerned about the training sets

being highly correlated which results in highly variable error estimates. K very large can also lead to long computation times.

K = 5 and K = 10 are very popular choices.

Page 68: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

68

Copyright © 2015, SAS Institute Inc. All rights reserved.

Is the best really the best? When building our model, it is tempting to find the model that

produces the best CV error or AIC/BIC and stick with it.

If a simpler model performs just as well, should we go smaller?

If our goal is to capture the true effects, should we go bigger?

Page 69: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

69

Copyright © 2015, SAS Institute Inc. All rights reserved.

Similar models AIC and BIC provide guidance on which models are similar to the

best model.

AIC – Best AIC < 4 ⇒ strong evidence supporting lesser model

4 ≤ AIC – Best AIC < 10 ⇒ weak evidence supporting lesser model

…and we should probably avoid anything worse than that.

Page 70: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

70

Copyright © 2015, SAS Institute Inc. All rights reserved.

Similar models and k-fold There is a similar concept when using k-fold.

For the best model, the validation error is made up of 𝐾𝐾 pieces:

CV 𝛽𝛽 = CV1 𝛽𝛽 + CV2(𝛽𝛽) + … + CVK 𝛽𝛽

Taking the sample standard error of the CVj(𝛽𝛽) gives us 𝑠𝑠𝑒𝑒 𝛽𝛽 .

Then if 𝐶𝐶𝑉𝑉 𝛽𝛽 ≤ 𝐶𝐶𝑉𝑉 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏 + 𝑠𝑠𝑒𝑒 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏we should feel good about 𝛽𝛽.

Page 71: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

71

Copyright © 2015, SAS Institute Inc. All rights reserved.

Penalized Regression and JMP Pro

Penalized regression tools are found in the Generalized Regression platform in JMP Pro.

Ridge, Lasso, and Elastic net (and FS and ML too)

A variety of response distributions (normal, binomial, Poisson, gamma, negative binomial,…)

And quantile regression too (but only ML, no selection)

Page 72: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

72

Copyright © 2015, SAS Institute Inc. All rights reserved.

References

Sall, J. (2002), “Monte Carlo Calibration of Distributions of Partition Statistics,” SAS Institute. Retrieved July 29, 2015 from http://www.jmp.com/content/dam/jmp/documents/en/white-papers/montecarlocal.pdf

SAS Institute Inc. 2015. JMP® 12 Specialized Models. Cary, NC: SAS Institute Inc.

SAS Press 2015. Building Better Models with JMP Pro. Cary, NC: SAS Institute Inc.

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” JRSS-B, 58-1, pp. 267-288.

Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net,” JRSS-B, 67-2, pp. 301-320.

Page 73: Interactive Model Building with JMP Pro · 2019-11-22 · A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions. Split the left side leaf

Copyright © 2015 SAS Institute Inc. All rights reserved.

Clay Barker – [email protected] Crotty – [email protected] Stephens – [email protected]

Discussion and Q&A