interactive model building with jmp pro · 2019-11-22 · a single split results in an r2 of 22.8%....

Copyright © 2015 SAS Institute Inc. All rights reserved.

Interactive Model Building with JMP ProClay Barker – [email protected] Crotty – [email protected] Stephens – [email protected]

2

Copyright © 2015, SAS Institute Inc. All rights reserved.

What is JMP (and JMP Pro)? Statistical Discovery Software from SAS

Developed in 1989 as a desktop application

Comprehensive basic statistics and graphical summaries advanced tools and techniques

Extendible powerful scripting language application and add-in builders integrates with Excel, R, MATLAB, and SAS

Visual, dynamic and interactive

JMP Pro – Advanced tools for analytics and modeling

3


Outline Introduction to JMP Software

A Motivating Example – Boston Home Prices

The Modeling Process

Simple and Multiple Linear Regression

Classification and Regression Trees

Advanced Tree Methods

Short Break

Penalized Regression Techniques

Return to Boston Home Prices

Discussion and Q&A

4


The Modeling Process Explore data (know the data, identify key features) One variable at a time Two variables at a time Many variables at a time

Identify potential data quality issues

Prepare data for modeling Missing values Data cleanup (recode, binning) Transformations Create validation column (holdout sets)

Model building, selection, and comparison

5


Example: Boston Home Building Value

Graph Builder and Geographic Mapping

Data Filter Columns Viewer Distribution

Multivariate Create a Validation Column Regression Regression Trees Neural Networks

Situation:

Predict total value for single family owner occupied homes in a Boston neighborhood.

Data:

2014 appraisal information on over 25K Boston homes, publicly available from https://data.cityofboston.gov

6


Classification and Regression Trees

7


Overview

Why Trees?

Algorithm and Decision Tree Example

Column Contributions

Cross Validation

Bootstrap Forest

Boosted Tree

8


Why Trees?

Easy to interpret and explain Not a “black box” model Can be represented with a tree (flow-chart) diagram Interpretable as a series of “if-then” statements that lead to a

classification or a prediction

Flexible for response types Categorical response => classification tree Continuous response => regression tree

Flexible for input factor types Categorical factors get split into two groups of levels Continuous factors get split based on a cutting value

9


Algorithm

Build a tree by making optimal splits. Examine candidate splits across all factors. Split on the factor that maximizes the split criterion, using the

split that optimizes the criterion.

Repeat above steps until you decide to stop splitting or until a stopping criterion is reached.

10


Basic Example

Corn data (142 observations)

Predict yield from nitrate: What is the optimal split of

nitrate values to maximize the ability to predict yield? What is the optimal split of

nitrate values to maximize the ANOVA model SS?

Prediction after one split: For nitrate < 26.32, predict yield

of 6756. For nitrate ≥ 26.32, predict yield

of 8606.

11


Basic Example

A single split results in an R2 of 22.8%. Let’s continue splitting to improve our predictions.

Split the left side leaf at nitrate < 11.44. R2 increases to 31.3%.

Keep splitting until we hit the minimum size split…

Splitting terminates at 21 splits, with R2 up to 45.3%. Leads to an unwieldy model.

We used all the observations in our data to fit the model. Runs risk of bad predictions for new data, because of overfitting. We can’t test how well our model fits for new observations.

12


Basic Example Tree View

13


Column Contributions

Simple trees are intuitive. Complex trees might not be intuitive. Advanced tree methods don’t have a single tree representation.

We want to get a sense of which variables most influence our response variable. This is especially true when doing exploratory (rather than

predictive) modeling.

This report shows each column’s contribution to the fit.

14


Cross Validation

We want to avoid overfitting our data, because overfitting leads to bad predictions for new observations.

Cross validation is one way to avoid overfitting.

We focus on one type of cross validation for trees: Holdback sets (Training/Validation/(Test))

15


Holdback Cross Validation

Cross validation refers to the process of randomly dividing our data set into training and validation sets. Use the training set to estimate our model parameters. Use the validation set to evaluate how well the model fits. This is

a measure of how well the model will fit on new observations. Retain the model that fits best on our validation set.

Most simple case: use a single training set and a single validation set.

16


Training and Validation Set

Suppose we have 150 observations. One strategy would be to use the first 100 for training and the last 50 for validation.

For each split on the training data, we measure the tree’s ability to predict on the validation data.

The tree/model with the best R2 for the validation set is our “best” model.

𝑥𝑥1,1 … 𝑥𝑥1,𝑝𝑝⋮⋮ ⋱ ⋮

⋮𝑥𝑥100,1 … 𝑥𝑥100,𝑝𝑝

𝑥𝑥101,1 … 𝑥𝑥101,𝑝𝑝⋮ ⋱ ⋮

𝑥𝑥150,1 … 𝑥𝑥150,𝑝𝑝

𝑦𝑦1⋮⋮

𝑦𝑦100

𝑦𝑦101⋮

𝑦𝑦150

Training

Validation

17


Training and Validation Example Back to the Corn data (142 observations):

Randomly select 42 obs. for validation; use the rest for training.

For each additional split, calculate the validation R2 for the model.

Continue until the validation R2 fails to improve for 10 consecutive splits.

Select the model defined by the maximum validation R2.

18


Training, Validation, and Test

Another cross validation option is to divide the data into three sets: Training set Validation set Test set

The training and validation sets are used as previously illustrated.

Then we fit our final model on the Test set to give us an independent assessment of predictive performance. The test set allows us to compare models against each other on

data that has not been used in the development of the models. Recall that the validation set was used in model development.

19


Possible Pitfalls of One Validation Set Model selection can be sensitive to the particular

validation set chosen.

This is especially concerning with limited data.

20


Advanced Tree Methods in JMP Pro

21


Extension: Bootstrap Forest

Why build only one tree when you can build a forest? The bootstrap forest method makes many trees and averages

the predicted values to get final predictions.

How do we build many trees on one data set? Each tree is built using a bootstrap sample (sampled with

replacement). Each split on each tree only considers a random sample of

candidate columns for splitting. This is also known as bootstrap-averaging (or bagging).

Validation can be used to control the number of trees. Using validation allows you to use early stopping rules.

22



What are some options we can specify? # trees (to build and average over) # columns sampled in each split Minimum & Maximum splits in each tree Early stopping (if using validation)

What output do we get? Could view individual trees, but it’s hard to interpret visually. Look at R2 for validation and test sets. Column Contributions report is useful as well. Continuous response has residual diagnostic plots. Categorical response has ROC and Lift curves.

23



24


Extension: Boosted Tree

Rather than build a forest of trees and average them, why not build trees sequentially and add them together?

Boosted trees do just that. Build a small tree and get the residuals. Then build another tree on those residuals. Repeat this process and then add all the small trees together.

Final tree is sum of estimates for each terminal node.

Note: For categorical responses, JMP only supports responses with 2 levels.

25



What are some options we can specify? # layers of fits (or stages) # splits per tree Learning Rate (between 0 and 1) Overfit Penalty (to avoid prob=0 with categorical responses) Minimum Split Size Early stopping (if using validation)

26



What output do we get? R2 for validation and test sets Confusion matrix for categorical responses Cumulative validation plot (fit statistic vs stage number) Could view individual trees, but that is unwieldy

» Column Contributions report is useful again!

27



28


Comparison of Tree Models

Compare results for the Boston Home Prices data:

29


Parting Thoughts

Trees are flexible and (usually) interpretable, or at least fairly easy to explain conceptually to people.

JMP offers decision trees.

JMP Pro extends decision trees: Bootstrap Forest Boosted Trees

All the methods can be evaluated with Column Contributions report and using Model Comparison tool.

30


Short Break

31


Penalized Regression Techniques

32


The Diabetes data

Suppose that we want to use information like age, gender, cholesterol, … to predict the progression of diabetes over one year.

𝐸𝐸 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝐴𝐴𝐴𝐴𝑒𝑒𝑖𝑖 + 𝛽𝛽2𝐺𝐺𝑒𝑒𝐺𝐺𝐺𝐺𝑒𝑒𝑟𝑟𝑖𝑖 + 𝛽𝛽3𝐵𝐵𝐵𝐵𝐼𝐼𝑖𝑖 + …= 𝒙𝒙𝑖𝑖𝛽𝛽

33


The Diabetes model

We have recorded 10 different attributes.

Including interactions and quadratic terms, we end up with a model that contains 65 coefficients.

We can use a subset of the data to estimate our 65 regression coefficients and use the remainder of the data to evaluate the performance of the model.

Training set:𝑋𝑋 is 266 × 65 matrix𝑌𝑌 is 266 × 1 vector

Test Set:64 observations

34


Least squares estimation

We often fit our model using least squares estimation.

�̂�𝛽 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝒙𝒙𝑖𝑖𝛽𝛽 2

LS is equivalent to maximum likelihood when we assume that our errors are independent and normally distributed.

This minimization gives us the familiar estimate:�̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1XTy

𝑋𝑋 is our 𝐺𝐺 × 𝑝𝑝 design matrix (266 × 65 here)

𝑌𝑌 is our 𝐺𝐺 × 1 observed response vector (266 × 1 here)

35


So how did we do?

For the portion of the data used to do estimation,

our model fits pretty well.

For the set of data not involved in estimation,

our model fits pretty poorly.

This looks like a classic case of overfitting!

36


Overfitting

Overfitting occurs when our model is more complex than necessary and it starts to model random noise in the data instead of the underlying relationships.

When we overfit… our model may fit very well on the data we observed, but fail dramatically when predicting new observations.

Did we really need all 65 terms in our model?

Would something other than Least Squares be better?

37


Prediction error

Suppose that we observe data of the form

𝑌𝑌𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 + 𝜖𝜖𝑖𝑖 𝑖𝑖 = 1 …𝐺𝐺

where 𝜖𝜖𝑖𝑖 ∼ 𝑁𝑁 0,𝜎𝜎2 , 𝑥𝑥𝑖𝑖 is a 𝑝𝑝 × 1 vector, and 𝑓𝑓(⋅) is a function that describes the true relationship between the response and predictors.

We fit a model and want to predict a new point 𝑌𝑌𝑛𝑛+1 given 𝑥𝑥𝑛𝑛+1.

Prediction error is one way to measure how well our model will predict the new point.

Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

38


Rewriting the prediction error

We can get the prediction error into a familiar form…

PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

= E{ 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 + 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}= E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 2 + E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1

2

+ 2E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1

We know that E 𝑦𝑦𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 0 because E 𝜖𝜖𝑖𝑖 = 0.

⇒ PE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + E{ 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓(𝑥𝑥𝑛𝑛+1) 2}

MSEPE 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1

39


Bias variance trade-off

Prediction error breaks down into three familiar pieces.

𝜎𝜎2 is the true error variance, we’re stuck with it.

But we can work with the other two pieces:

the bias/variance trade-off

Maybe accepting some bias will yield better predictions?

Prediction Error 𝑓𝑓 𝑥𝑥𝑛𝑛+1 = 𝜎𝜎2 + MSE 𝑓𝑓 𝑥𝑥𝑛𝑛+1

= �𝜎𝜎2fixed

+ E 𝑓𝑓 𝑥𝑥𝑛𝑛+1 − 𝑓𝑓 𝑥𝑥𝑛𝑛+12

bias2+ var[𝑓𝑓 𝑥𝑥𝑛𝑛+1 ]

variance

40


Prediction error for LS regression

PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

{bias 𝑓𝑓 𝑥𝑥𝑖𝑖0

}2 +1𝑁𝑁�𝑖𝑖=1

𝑁𝑁

var[𝑓𝑓(𝑥𝑥𝑖𝑖)]

Recall that �̂�𝛽 = 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦 is unbiased and cov �̂�𝛽 = 𝜎𝜎2 𝑋𝑋𝑇𝑇𝑋𝑋 −1.

⇒ PE 𝑓𝑓 = 𝜎𝜎2 +1𝑁𝑁

trace 𝑋𝑋 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝜎𝜎2

= 𝜎𝜎2 + 𝜎𝜎2

𝑁𝑁rank 𝑋𝑋𝑇𝑇𝑋𝑋

= 𝜎𝜎2 + 𝜎𝜎2𝑝𝑝𝑁𝑁

𝑋𝑋𝑛𝑛×p and full column rank

= 𝜎𝜎2(1 + 𝑝𝑝𝑁𝑁

)

41


Penalized regression

LS is unbiased, but could a biased estimator do better?

Penalized regression is one way of adding bias.

Ridge regression minimizes a penalized sum of squares:

�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = arg min𝛽𝛽

�𝑖𝑖

𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝛽𝛽𝑗𝑗2

= 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝑋𝑋𝑇𝑇𝑦𝑦

Here 𝜆𝜆 is a tuning parameter that controls the magnitude of the parameter estimates. 𝜆𝜆 = 0 gives us OLS. 𝜆𝜆 → ∞ is moving toward a vector of zeros.

42


Ridge regression

Rewrite the ridge estimator as a function of LS:

�̂�𝛽𝑟𝑟𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝑋𝑋𝑇𝑇𝑋𝑋[𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1] −1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 𝑋𝑋𝑇𝑇𝑋𝑋 −1𝑋𝑋𝑇𝑇𝑦𝑦

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 �̂�𝛽𝐿𝐿𝐿𝐿

This makes it easier to calculate MSE.

Note that for 𝜆𝜆 < ∞, the estimates will be nonzero.

43


Ridge MSE

E �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1 E(�̂�𝛽𝐿𝐿𝐿𝐿)

= 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝛽𝛽 bc LS is unbiased

var 𝑥𝑥𝑖𝑖 �̂�𝛽𝑅𝑅𝑖𝑖𝑟𝑟𝑟𝑟𝑟𝑟 = 𝜎𝜎2𝑥𝑥𝑖𝑖𝑇𝑇 𝑋𝑋𝑇𝑇𝑋𝑋 + 𝜆𝜆𝐼𝐼𝑝𝑝−1 𝐼𝐼𝑝𝑝 + 𝜆𝜆 𝑋𝑋𝑇𝑇𝑋𝑋 −1 −1𝑥𝑥𝑖𝑖

When 𝜆𝜆 > 0: Ridge estimates are biased by shrinking coefficients toward zero Ridge estimates are less variable than LS

So can ridge outperform LS?

44


MSE for LS and Ridge

• Example with N=100, p=50, 𝜎𝜎 = 1.

• 𝛽𝛽1−10 large• 𝛽𝛽11−50 small

• Ridge outperforms LS for a range of 𝜆𝜆values and then starts to do worse.

• Choosing 𝜆𝜆 carefully is crucial. More on this later…

45


Back to the diabetes example

Ridge predicts better on the test data than LS did!

46


Penalized regression

Ridge regression uses a quadratic penalty to introduce bias.

There is an entire family of penalized regression techniques that attempt to do the same.


�𝑖𝑖=1

𝑛𝑛


𝑝𝑝

𝜌𝜌(𝛽𝛽𝑗𝑗)

𝝆𝝆(𝒙𝒙) Technique𝑥𝑥2 Ridge (L2 norm)

|𝑥𝑥| Lasso (L1 norm)

𝐼𝐼(𝑥𝑥 ≠ 0) Best Subset (L0 norm)

𝐼𝐼 𝑥𝑥 ≤ 𝜆𝜆 +𝑎𝑎𝜆𝜆 − 𝑥𝑥 +𝑎𝑎 − 1 𝜆𝜆 𝐼𝐼(𝑥𝑥 > 𝜆𝜆)

Smoothly clipped absolutedeviation

47


The lasso

Lasso is a promising penalized regression technique.

�̂�𝛽𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛


𝑝𝑝

|𝛽𝛽𝑗𝑗|

Like ridge regression, lasso biases estimates by shrinking them toward zero.

But lasso also does selection by shrinking some coefficients all the way to zero:

least absolute shrinkage and selection operator

Improved prediction and interpretation is a big win!

48


Ridge vs Lasso penalties

• For 𝛽𝛽𝑗𝑗 close to zero, the lasso penalty is much stiffer than the ridge penalty. This partially explains why lasso can zero terms out but ridge cannot.

49


Ridge and Lasso geometry Instead of thinking about penalizing the SSE, we could think about

these techniques as constrained optimization problems. Lasso → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 |𝛽𝛽𝑗𝑗| ≤ 𝑠𝑠

Ridge → min∑𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑥𝑥𝑖𝑖𝛽𝛽 2 such that ∑𝑗𝑗 𝛽𝛽𝑗𝑗2 ≤ 𝑠𝑠

For a simple two predictor problem, it is easy to visualize the feasible region for both ridge and lasso.

50


More geometry

The lasso feasible region has corners which allow for intersections at zero, unlike ridge.

Ridge (left) and Lasso (right) feasible regions with SSE contours.

51


Lasso and the Diabetes example The lasso solution and MSE are complicated, but…

As the penalty increases, bias increases and prediction variance decreases. When only a subset of the predictors are active, we expect lasso

to outperform ridge.• Lasso performs best on the Diabetes Test set and only has 20 active

terms (not 65).

52


Ridge vs the Lasso

Both are both penalized regression techniques that can produce better predictions than least squares.

Ridge: �̂�𝛽𝑗𝑗 ≠ 0 for all 𝑗𝑗 (even when 𝐺𝐺 < 𝑝𝑝) Naturally handles collinearity and even linear dependencies

Lasso: Performs variable selection and estimation simultaneously Provides estimates for up to 𝐺𝐺 terms If 𝑥𝑥1 and 𝑥𝑥2 are highly correlated, only one of them tends to enter

the model.

Can we get the best of both worlds?

53


The Elastic Net

Elastic Net is a combination of ridge and the lasso

Penalty function: 𝜌𝜌 𝛽𝛽 = 1−𝛼𝛼2𝛽𝛽2 + 𝛼𝛼 𝛽𝛽 𝛼𝛼 ∈ [0,1]

Ridge (𝛼𝛼 = 0) and lasso (𝛼𝛼 = 1) are special cases

𝛼𝛼 tuning parameter controls the mix of ℓ1 and ℓ2 penalties

For 𝛼𝛼 ∈ 0,1 , We get selection and shrinkage We can handle collinearity and dependencies We can estimate more than 𝐺𝐺 coefficients

𝛼𝛼 ≈ 1 is a popular choice since we get the lasso with a bit of singularity handling.

54


Elastic net penalty for various 𝜶𝜶

The elastic net penalty is a compromise between ridge regression and the lasso.

55


Elastic net vs Lasso

Suppose we have 10 candidate predictors.

The only true active predictors are 𝑥𝑥3 and 𝑥𝑥5 which are highly correlated. The lasso will likely only include 𝑥𝑥3 or 𝑥𝑥5 in the final model. The elastic net will likely include 𝑥𝑥3 and 𝑥𝑥5 in the final model.

This is where the name “elastic net” comes from, it stretches to capture correlated predictors.

Which solution is “better” may depend on your objectives.

56


Back to the Diabetes data again…

Elastic Net performs almost identical to the lasso.

Because of some highly correlated predictors, the elastic net model has 24 terms whereas the lasso has 20.

57


But what about Forward Selection?

Forward selection is a popular variable selection technique that has been around a long time.

FS algorithm usually looks something like…1. Start with just an intercept2. Enter the most significant predictor (Wald test, score test, …)3. Repeat step 2 until everything enters or we run out of DF.4. Keep the best model in the sequence .

You can look at FS as a way of approximating Best Subset regression.

Important: FS does selection, but not shrinkage!

58


One last look at the diabetes data

Forward selection chooses 12 terms for our final model.

• But it doesn’t do very well predicting new observations.• In general, we should expect shrinkage techniques to yield

better predictions than FS.

59


Shrinkage and selection

Lasso, Elastic Net RidgeForward Selection OLS, ML

Yes No

Yes

No

Variable Selection

Shrinkage

• How we choose to do estimation is crucial!• Shrinkage techniques have shown great promise in building

models that are parsimonious and that will predict well.

60


The Adaptive Lasso

If we knew which predictors were most important, we may penalize their coefficients less severely.

This leads us to the adaptive lasso:

�̂�𝛽𝐴𝐴𝐿𝐿 = arg min𝛽𝛽

�𝑖𝑖=1

𝑛𝑛


𝑝𝑝

𝑤𝑤𝑗𝑗|𝛽𝛽𝑗𝑗|

If we choose the weights carefully, we get the oracle property: Asymptotically we will get the right active set. Asymptotically we will predict as well as if we had known the true model in

advance.

Using the inverse of the LS estimates gets us the oracle property.𝑤𝑤𝑗𝑗 = �1 |�̂�𝛽𝑗𝑗,𝐿𝐿𝐿𝐿|

61


Penalized Generalized Linear Models

So far we have only considered penalized least squares, but the idea extends naturally to generalized linear models.

For GLMs, we penalize the negative log-likelihood


−𝑙𝑙 𝛽𝛽|𝑦𝑦 + 𝜆𝜆�𝑗𝑗=1

𝑝𝑝

𝜌𝜌(𝛽𝛽𝑗𝑗)

The ideas are the exact same, but estimation can be much trickier.

JMP allows us to do lasso/elastic net fits for a handful of distributions: Poisson, Binomial, Gamma, …

62


Using a penalized regression model

We have ignored the choice of the tuning parameter 𝜆𝜆 so far.

Rather than a single estimate, we end up with a sequence of fits defined by a range of tuning parameter values.

In practice, we will want to use a value of the tuning parameter that gives us the best fit.

Elastic net also depends on the choice of 𝛼𝛼, but we usually just pick a single value like 𝛼𝛼 = .99.

Tuning 𝝀𝝀𝟏𝟏 𝝀𝝀𝟐𝟐 … 𝝀𝝀𝒌𝒌−𝟏𝟏 𝝀𝝀𝒌𝒌Estimate �̂�𝛽1 �̂�𝛽2 … �̂�𝛽𝑘𝑘−1 �̂�𝛽𝑘𝑘

63


The solution path The solution path is a convenient summary of the sequence of fits.

• Each line represents an estimated coefficient in the model. • 𝜆𝜆 decreases as we move left to right, allowing more predictors to enter.

• BMI and LTG are the first two terms to enter the model.

• HDL enters with a negative coefficient, but later becomes positive as the penalty is relaxed.

64


Tuning

Since each value of 𝜆𝜆 leads to a different model, how do we choose a value that leads to a good model?

In reality 𝜆𝜆 is a continuous number, but we break it up into a grid [𝜆𝜆1, 𝜆𝜆2, … , 𝜆𝜆𝑟𝑟−1, 𝜆𝜆𝑟𝑟]

Then we can try each value of 𝜆𝜆 and keep the model that fits best.

But how we determine which fit is best? Cross-validation (hold-out or k-fold) Information criteria (AICc or BIC)

In our diabetes examples so far, we have used Training and Validation sets to tune our models.

65


K-Fold Cross-Validation

We may worry that our model may be sensitive to the particular validation set used to choose the tuning parameter. This is especially concerning with limited data.

An alternative is to break our data set into K pieces or “folds” of similar size.

At each iteration, we treat one of the folds as the validation set and the remaining folds as the training set. We sum the error over the validation sets.

When K = N, we call that leave-one-out or jackknife cross-validation.

66


5-Fold Cross-Validation Example

For each value of our tuning parameter 𝜆𝜆_𝑗𝑗 we will be doing five fits. Fit 1: we fit the model on Sets 1,2,3, and 5 and then evaluate the

model on Set 4, giving us 𝐶𝐶𝑉𝑉_1 (𝜆𝜆_𝑗𝑗). Fit 2: we fit the model on Sets 1, 3, 4, and 5 and evaluate the

model on Set 2, giving us 𝐶𝐶𝑉𝑉_2 (𝜆𝜆_𝑗𝑗 ). …and so on. Our “best” value of 𝜆𝜆 minimizes 𝐶𝐶𝑉𝑉_1+𝐶𝐶𝑉𝑉_2+𝐶𝐶𝑉𝑉_3+𝐶𝐶𝑉𝑉_4+𝐶𝐶𝑉𝑉_5.

Set 1 Set 2 Set 3 Set 4 Set 5Training Training Training Validation TrainingTraining Validation Training Training Training

Validation Training Training Training TrainingTraining Training Training Training ValidationTraining Training Validation Training Training

67


K-Fold for Tuning

So how do we choose K? If K is too small, we may not have avoided the drawbacks of a

single validation set. If K is too large, we should be concerned about the training sets

being highly correlated which results in highly variable error estimates. K very large can also lead to long computation times.

K = 5 and K = 10 are very popular choices.

68


Is the best really the best? When building our model, it is tempting to find the model that

produces the best CV error or AIC/BIC and stick with it.

If a simpler model performs just as well, should we go smaller?

If our goal is to capture the true effects, should we go bigger?

69


Similar models AIC and BIC provide guidance on which models are similar to the

best model.

AIC – Best AIC < 4 ⇒ strong evidence supporting lesser model

4 ≤ AIC – Best AIC < 10 ⇒ weak evidence supporting lesser model

…and we should probably avoid anything worse than that.

70


Similar models and k-fold There is a similar concept when using k-fold.

For the best model, the validation error is made up of 𝐾𝐾 pieces:

CV 𝛽𝛽 = CV1 𝛽𝛽 + CV2(𝛽𝛽) + … + CVK 𝛽𝛽

Taking the sample standard error of the CVj(𝛽𝛽) gives us 𝑠𝑠𝑒𝑒 𝛽𝛽 .

Then if 𝐶𝐶𝑉𝑉 𝛽𝛽 ≤ 𝐶𝐶𝑉𝑉 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏 + 𝑠𝑠𝑒𝑒 𝛽𝛽𝑏𝑏𝑟𝑟𝑙𝑙𝑏𝑏we should feel good about 𝛽𝛽.

71


Penalized Regression and JMP Pro

Penalized regression tools are found in the Generalized Regression platform in JMP Pro.

Ridge, Lasso, and Elastic net (and FS and ML too)

A variety of response distributions (normal, binomial, Poisson, gamma, negative binomial,…)

And quantile regression too (but only ML, no selection)

72


References

Sall, J. (2002), “Monte Carlo Calibration of Distributions of Partition Statistics,” SAS Institute. Retrieved July 29, 2015 from http://www.jmp.com/content/dam/jmp/documents/en/white-papers/montecarlocal.pdf

SAS Institute Inc. 2015. JMP® 12 Specialized Models. Cary, NC: SAS Institute Inc.

SAS Press 2015. Building Better Models with JMP Pro. Cary, NC: SAS Institute Inc.

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” JRSS-B, 58-1, pp. 267-288.

Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net,” JRSS-B, 67-2, pp. 301-320.

http://www.jmp.com/content/dam/jmp/documents/en/white-papers/montecarlocal.pdf

interactive model building with jmp pro · 2019-11-22 · a single split results in an r2 of 22.8%....

Documents