high-dimensional statistical learning: introduction...classical statistics biological big data...

147
Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/ August 2018 Tehran, Iran 1 / 15 Classical Statistics Biological Big Data Supervised and Unsupervised Learning A Simple Example Suppose we have n = 400 people with diabetes for whom we have p = 3 serum-level measurements (LDL, HDL, GLU). Want to predict these peoples’ disease progression after 1 year. 2 / 15

Upload: others

Post on 05-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

High-Dimensional Statistical Learning:Introduction

Ali ShojaieUniversity of Washington

http://faculty.washington.edu/ashojaie/

August 2018Tehran, Iran

1 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

A Simple Example

I Suppose we have n = 400 people with diabetes for whom wehave p = 3 serum-level measurements (LDL, HDL, GLU).

I Want to predict these peoples’ disease progression after 1 year.

2 / 15

Page 2: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

A Simple Example

Notation:

I n is the number of observations.

I p the number ofvariables/features/predictors.

I y is a n-vector containingresponse/outcome for each of nobservations.

I X is a n × p data matrix.

3 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Linear Regression on a Simple Example

I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:

y = β0 + β1X1 + β2X2 + β3X3 + ε

where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.

I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.

I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!

4 / 15

Page 3: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Linear Regression on a Simple Example

I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:

y = β0 + β1X1 + β2X2 + β3X3 + ε

where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.

I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.

I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!

4 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Linear Regression on a Simple Example

I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:

y = β0 + β1X1 + β2X2 + β3X3 + ε

where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.

I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.

I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!

4 / 15

Page 4: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

A Relationship Between the Variables?

progression

−0.10 0.00 0.10 0.20

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●● ● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

● ●

●●

●●●

●● ●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

−0.10 0.00 0.10

5015

025

035

0

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

−0.

100.

000.

100.

20

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●●●

● ●

●●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●● ●

●● ●●

● ●

●●

●●

●●

●●

●●●

●●

● ● ●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

● ●●

● ●

● ●

●●

LDL●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●●●●

● ●

●●●

●●●

●●●

●●

●●

●● ●

●●

●●

● ●

● ●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ● ●

● ●● ●

● ●

● ●

●●

●●

●●

●●●

●●

●● ●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

● ●

● ●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

● ●

● ●●

●● ●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●●●

● ●

● ●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●

● ●

● ●

●●

● ●

● ●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

● ● ●

●●

● ●●

●●●

● ●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●●

●●

●● ●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

● ●●

● ●

●●

●●

●●

● ●

●●●

●●●

●●

●● ●

●●

●●

● ●

●●

● ●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●●

●●●

●●

●●

● ●

●● ●

●●

● ●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

HDL

−0.

100.

000.

10

● ●

●●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

● ●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●● ●

●●

●●

● ●

●●

● ●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●● ●●

●●●

●●

●●

● ●

● ●●

●●

● ●●

●●

● ●●

●●

●●

●●

● ●

● ●●

● ●

●●

50 150 250 350

−0.

100.

000.

10

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

−0.10 0.00 0.10

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

● ●

● ●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

GLU

5 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Linear Model Output

Estimate Std. Error T-Stat P-Value

Intercept 152.928 3.385 45.178 < 2e-16 ***LDL 77.057 75.701 1.018 0.309HDL -487.574 75.605 -6.449 3.28e-10 ***GLU 477.604 76.643 6.232 1.18e-09 ***

progression measure ≈ 152.9+77.1×LDL−487.6×HDL+477.6×GLU.

6 / 15

Page 5: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Low-Dimensional Versus High-Dimensional

I The data set that we just saw is low-dimensional: n � p.

I Lots of the data sets coming out of modern biologicaltechniques are high-dimensional: n ≈ p or n � p.

I This poses statistical challenges! Classical Statistics no longerapplies.

7 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Low Dimensional

8 / 15

Page 6: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

High Dimensional

9 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

What Goes Wrong in High Dimensions?

I Suppose that we included many additional predictors in ourmodel, such as

I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2

I Some of these predictors are useful, others aren’t.

I If we include too many predictors, we will overfit the data.

I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.

I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.

10 / 15

Page 7: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

What Goes Wrong in High Dimensions?

I Suppose that we included many additional predictors in ourmodel, such as

I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2

I Some of these predictors are useful, others aren’t.

I If we include too many predictors, we will overfit the data.

I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.

I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.

10 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

What Goes Wrong in High Dimensions?

I Suppose that we included many additional predictors in ourmodel, such as

I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2

I Some of these predictors are useful, others aren’t.

I If we include too many predictors, we will overfit the data.

I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.

I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.

10 / 15

Page 8: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

What Goes Wrong in High Dimensions?

I Suppose that we included many additional predictors in ourmodel, such as

I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2

I Some of these predictors are useful, others aren’t.

I If we include too many predictors, we will overfit the data.

I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.

I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.

10 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Why Does Dimensionality Matter?

I Classical statistical techniques, such as linear regression,cannot be applied.

I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.

I High risks of overfitting, false positives, and more.

This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.

11 / 15

Page 9: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Why Does Dimensionality Matter?

I Classical statistical techniques, such as linear regression,cannot be applied.

I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.

I High risks of overfitting, false positives, and more.

This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.

11 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Supervised and Unsupervised Learning

I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.

I Supervised Learning: Use a data set X to predict or detectassociation with a response y .

I RegressionI ClassificationI Hypothesis Testing

I Unsupervised Learning: Discover the signal in X , or detectassociations within X .

I Dimension ReductionI Clustering

12 / 15

Page 10: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Supervised and Unsupervised Learning

I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.

I Supervised Learning: Use a data set X to predict or detectassociation with a response y .

I RegressionI ClassificationI Hypothesis Testing

I Unsupervised Learning: Discover the signal in X , or detectassociations within X .

I Dimension ReductionI Clustering

12 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Supervised and Unsupervised Learning

I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.

I Supervised Learning: Use a data set X to predict or detectassociation with a response y .

I RegressionI ClassificationI Hypothesis Testing

I Unsupervised Learning: Discover the signal in X , or detectassociations within X .

I Dimension ReductionI Clustering

12 / 15

Page 11: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Supervised Learning

13 / 15

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

Unsupervised Learning

14 / 15

Page 12: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

“Course Textbook” . . . with applications in R

I Available for (free!) download from www.statlearning.com.I An accessible introduction to statistical machine learning,

with an R lab at the end of each chapter.I We will go through some of these R labs.I To learn more, go through them on your own!

15 / 15

Page 13: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

High-Dimensional Statistical Learning:Bias-Variance Tradeoff and the Test Error

Ali ShojaieUniversity of Washington

http://faculty.washington.edu/ashojaie/

August 2018Tehran, Iran

1 / 48

Supervised Learning

2 / 48

Page 14: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Page 15: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Page 16: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Linear Models

I We have n observations, for each of which we have ppredictor measurements and a response measurement.

I Want to develop a model of the form

yi = β0 + β1Xi1 + . . .+ βpXip + εi .

I Here εi is a noise term associated with the ith observation.

I Must estimate β0, β1, . . . , βp – i.e. we must fit the model.

4 / 48

Linear Model With p = 2 Predictors

5 / 48

Page 17: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

Page 18: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

Page 19: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Linear Models in Matrix Form

I For simplicity, ignore the intercept β0.I Assume

∑ni=1 yi =

∑ni=1 Xij = 0; in this case, β0 = 0.

I Alternatively, let the first column of X be a column of 1’s.

I In matrix form, we can write the linear model as

y = Xβ + ε,

i.e.

y1y2...yn

=

X11 X12 . . . X1p

X21 X22 . . . X2p...

.... . .

...Xn1 Xn2 . . . Xnp

β1β2...βp

+

ε1ε2...εn

.

7 / 48

Least Squares Regression

I There are many ways we could fit the model

y = Xβ + ε.

I Most common approach in classical statistics is least squares:

minimizeβ

{‖y − Xβ‖2

}.

Here ‖a‖2 ≡∑ni=1 a

2i .

I We are looking for β1, . . . , βp such that

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2

is as small as possible, or in other words, such thatn∑

i=1

(yi − yi )2

is as small as possible, where yi is the ith predicted value.8 / 48

Page 20: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Let’s Try Out Least Squares in R!

Chapter 3 R labwww.statlearning.com

9 / 48

Least Squares Regression

I When we fit a model, we use a training set of observations.

I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Page 21: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.

I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Page 22: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Page 23: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48

Least Squares Regression

I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.10 / 48

Page 24: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

Page 25: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

Page 26: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

The Problem

As we add more variables into the model...

... the training error decreases and the R2 increases!

12 / 48

Why is this a Problem?

I We really care about the model’s performance on observationsnot used to fit the model!

I We want a model that will predict the survival time of a newpatient who walks into the clinic!

I We want a model that can be used to diagnose cancer for apatient not used in model training!

I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!

I What we really care about:

(ytest − ytest)2,

whereytest = β1Xtest,1 + . . .+ βpXtest,p,

and (Xtest , ytest) was not used to train the model.

I The test error is the average of (ytest − ytest)2 over a bunch of

test observations.

13 / 48

Page 27: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Why is this a Problem?

I We really care about the model’s performance on observationsnot used to fit the model!

I We want a model that will predict the survival time of a newpatient who walks into the clinic!

I We want a model that can be used to diagnose cancer for apatient not used in model training!

I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!

I What we really care about:

(ytest − ytest)2,

whereytest = β1Xtest,1 + . . .+ βpXtest,p,

and (Xtest , ytest) was not used to train the model.

I The test error is the average of (ytest − ytest)2 over a bunch of

test observations.13 / 48

Training Error versus Test Error

As we add more variables into the model...

... the training error decreases and the R2 increases!

But the test error might not!

14 / 48

Page 28: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Training Error versus Test Error

As we add more variables into the model...

... the training error decreases and the R2 increases!

But the test error might not!

15 / 48

Why the Number of Variables Matters

I Linear regression will have a very low training error if p islarge relative to n.

I A simple example:

I When n ≤ p, you can always get a perfect model fit to thetraining data!

I But the test error will be awful.

16 / 48

Page 29: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Model Complexity, Training Error, and Test Error

I In this course, we will consider various types of models.

I We will be very concerned with model complexity: e.g. thenumber of variables used to fit a model.

I As we fit more complex models – e.g. models with morevariables – the training error will always decrease.

I But the test error might not.

I As we will see, the number of variables in the model is not theonly – or even the best – way to quantify model complexity.

17 / 48

An Example In R

xtr <- matrix(rnorm(100*100),ncol=100)

xte <- matrix(rnorm(100000*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

yte <- xte%*%beta + rnorm(100000)

rsq <- trainerr <- testerr <- NULL

for(i in 2:100){

mod <- lm(ytr~xtr[,1:i])

rsq <- c(rsq,summary(mod)$r.squared)

beta <- mod$coef[-1]

intercept <- mod$coef[1]

trainerr <- c(trainerr, mean((xtr[,1:i]%*%beta+intercept - ytr)^2))

testerr <- c(testerr, mean((xte[,1:i]%*%beta+intercept - yte)^2))

}

par(mfrow=c(1,3))

plot(2:100,rsq, xlab=’Number of Variables’, ylab="R Squared", log="y")

abline(v=10,col="red")

plot(2:100,trainerr, xlab=’Number of Variables’, ylab="Training Error",log="y")

abline(v=10,col="red")

plot(2:100,testerr, xlab=’Number of Variables’, ylab="Test Error",log="y")

abline(v=10,col="red")

18 / 48

Page 30: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

A Simulated Example

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1.0

Number of Variables

R S

quar

ed

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 1001e

−28

1e−

211e

−14

1e−

071e

+00

Number of Variables

Trai

ning

Err

or

●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●

●●●

●●●

●●●

●●

0 20 40 60 80 100

12

510

2050

100

200

500

Number of Variables

Test

Err

or

I 1st 10 variables are related to response; remaining 90 are not.

I R2 increases and training error decreases as more variables areadded to the model.

I Test error is lowest when only signal variables in model.

19 / 48

Bias and Variance

I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.

I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.

I The test error depends on both the bias and variance:

Test Error = Bias2 + Variance.

I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.

20 / 48

Page 31: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Bias and Variance

I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.

I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.

I The test error depends on both the bias and variance:

Test Error = Bias2 + Variance.

I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.

20 / 48

Bias and Variance

I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.

I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.

I The test error depends on both the bias and variance:

Test Error = Bias2 + Variance.

I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.

20 / 48

Page 32: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Bias and Variance

I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.

I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.

I The test error depends on both the bias and variance:

Test Error = Bias2 + Variance.

I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.

20 / 48

A Really Fundamental Picture

21 / 48

Page 33: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Page 34: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Page 35: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!

23 / 48

Page 36: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?

A: Never!

23 / 48

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!

23 / 48

Page 37: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!

24 / 48

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!

24 / 48

Page 38: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Training Set And Test Set

To get an estimate of the test error of a particular model on afuture observation:

1. Split the samples into a training set and a test set.

2. Fit the model on the training set.

3. Evaluate its performance on the test set.

4. The test set error rate is an estimate of the model’sperformance on a future observation.

But remember: no peeking!

25 / 48

Training Set And Test Set

To get an estimate of the test error of a particular model on afuture observation:

1. Split the samples into a training set and a test set.

2. Fit the model on the training set.

3. Evaluate its performance on the test set.

4. The test set error rate is an estimate of the model’sperformance on a future observation.

But remember: no peeking!

25 / 48

Page 39: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Page 40: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Page 41: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Page 42: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Validation Set Approach

Split the n observations into two sets of approximately equal size.Train on one set, and evaluate performance on the other.

27 / 48

Validation Set Approach

For a given model, we perform the following procedure:

1. Split the observations into two sets of approximately equalsize, a training set and a validation set.

a. Fit the model using the training observations. Let β(train)

denote the regression coefficient estimates.b. For each observation in the validation set, compute the test

error, ei = (yi − xTi β(train))2.

2. Calculate the total validation set error by summing the ei ’sover all of the validation set observations.

Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.

28 / 48

Page 43: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Leave-One-Out Cross-Validation

Fit n models, each on n − 1 of the observations. Evaluate eachmodel on the left-out observation.

29 / 48

Leave-One-Out Cross-Validation

For a given model, we perform the following procedure:

1. For i = 1, . . . , n:

a. Fit the model using observations 1, . . . , i − 1, i + 1, . . . , n. Letβ(i) denote the regression coefficient estimates.

b. Compute the test error, ei = (yi − xTi β(i))2.

2. Calculate∑n

i=1 ei , the total CV error.

Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.

30 / 48

Page 44: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

5-Fold Cross-Validation

Split the observations into 5 sets. Repeatedly train the model on 4sets and evaluate its performance on the 5th.

31 / 48

K-fold cross-validation

A generalization of leave-one-out cross-validation. For a givenmodel, we perform the following procedure:

1. Split the n observations into K equally-sized folds.

2. For k = 1, . . . ,K :

a. Fit the model using the observations not in the kth fold.b. Let ek denote the test error for the observations in the kth fold.

3. Calculate∑K

k=1 ek , the total CV error.

Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.

32 / 48

Page 45: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

An Example In R

xtr <- matrix(rnorm(100*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

cv.err <- NULL

for(i in 2:50){

dat <- data.frame(x=xtr[,1:i],y=ytr)

mod <- glm(y~.,data=dat)

cv.err <- c(cv.err, cv.glm(dat,mod,K=6)$delta[1])

}

plot(2:50, cv.err, xlab="Number of Variables",

ylab="6-Fold CV Error", log="y")

abline(v=10, col="red")

33 / 48

Output of R Code

●●

● ●

●●

●● ● ●

●●

● ●●

● ●

●●

● ●

● ●

10 20 30 40 50

12

510

Number of Variables

6−F

old

CV

Err

or

I Six-fold CV identifies the model with just over ten predictors.

I First ten predictors contain signal, and the rest are noise.

34 / 48

Page 46: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

After Estimating the Test Error on the Training Set...

After we estimate the test error using the training set, we refit the“best” model on all of the available training observations. We thenevaluate this model on the test set.

35 / 48

Big Picture

36 / 48

Page 47: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Big Picture

37 / 48

Big Picture

38 / 48

Page 48: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Big Picture

39 / 48

Big Picture

40 / 48

Page 49: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Let’s Try Out Cross-Validation in R!

Chapter 5 R labFirst Half: Cross-Validationwww.statlearning.com

41 / 48

Summary: Four-Step Procedure

1. Split observations into training set and test set.

2. Fit a bunch of models on training set, and estimate the testerror, using cross-validation or validation set approach.

3. Refit the best model on the full training set.

4. Evaluate the model’s performance on the test set.

42 / 48

Page 50: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Why All the Bother?

Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?

A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!

43 / 48

Why All the Bother?

Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?

A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!

43 / 48

Page 51: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48

Page 52: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48

Page 53: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

If you

I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set

performance;

then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.

Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.

45 / 48

Page 54: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

If you

I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set

performance;

then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.

Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.

45 / 48

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

If you

I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set

performance;

then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.

Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.

45 / 48

Page 55: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

The Curse of Dimensionality

Q: A data set with more variables is better than a data set withfewer variables, right?

A: Not necessarily!

Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.

On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!

46 / 48

The Curse of Dimensionality

Q: A data set with more variables is better than a data set withfewer variables, right?

A: Not necessarily!

Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.

On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!

46 / 48

Page 56: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Wise Words

In high-dimensional data analysis, common mistakes are simple,and simple mistakes are common.

– Keith Baggerly

47 / 48

Before You’re Done Your Analysis

I Estimate the test error.I Do a “sanity check” whenever possible.

I “Spot-check” the variables that have the largest coefficients inthe model.

I Rewrite your code from scratch. Do you get the same answeragain?

Fitting models in high-dimensions: one mistake away from disaster!

48 / 48

Page 57: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

High-Dimensional Statistical Learning:Regression

Ali ShojaieUniversity of Washington

http://faculty.washington.edu/ashojaie/

August 2018Tehran, Iran

1 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Linear Models in High Dimensions

I When p is large, least squares regression will lead to very lowtraining error but terrible test error.

I We will now see some approaches for fitting linear models inhigh dimensions, p � n.

I These approaches also work well when p ≈ n or n > p.

2 / 45

Page 58: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Linear Models in High Dimensions

I When p is large, least squares regression will lead to very lowtraining error but terrible test error.

I We will now see some approaches for fitting linear models inhigh dimensions, p � n.

I These approaches also work well when p ≈ n or n > p.

2 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Page 59: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:

I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.

I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Page 60: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.

I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Page 61: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Remember

I We have n training observations.

I Our goal is to get a model that will perform well on futuretest observations.

I We’ll incur some bias in order to reduce variance.

4 / 45

Page 62: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Variable Pre-Selection

The simplest approach for fitting a model in high dimensions:

1. Choose a small set of variables, say the q variables that aremost correlated with the response, where q < n and q < p.

2. Use least squares to fit a model predicting y using only theseq variables.

This approach is simple and straightforward.

5 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Variable Pre-Selection in R

xtr <- matrix(rnorm(100*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

cors <- cor(xtr,ytr)

whichers <- which(abs(cors)>.2)

mod <- lm(ytr~xtr[,whichers])

print(summary(mod))

6 / 45

Page 63: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45

Page 64: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45

Page 65: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Estimating the Test Error For a Given q

This is the right way to estimate the test error using the validationset approach:

1. Split the observations into a training set and a validation set.

2. Using the training set only:

a. Identify the q variables most associated with the response.b. Use least squares to fit a model predicting y using those q

variables.c. Let β1, . . . , βq denote the resulting coefficient estimates.

3. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.

8 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Estimating the Test Error For a Given q

This is the wrong way to estimate the test error using thevalidation set approach:

1. Identify the q variables most associated with the response onthe full data set.

2. Split the observations into a training set and a validation set.

3. Using the training set only:

a. Use least squares to fit a model predicting y using those qvariables.

b. Let β1, . . . , βq denote the resulting coefficient estimates.

4. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.

9 / 45

Page 66: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Frequently Asked Questions

I Q: Does it really matter how you estimate the test error?A: Yes.

I Q: Would anyone make such a silly mistake?A: Yes.

10 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Frequently Asked Questions

I Q: Does it really matter how you estimate the test error?A: Yes.

I Q: Would anyone make such a silly mistake?A: Yes.

10 / 45

Page 67: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Better Approach

I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.

I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.

I What we really want to do: pick the q variables that bestpredict the response.

11 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Better Approach

I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.

I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.

I What we really want to do: pick the q variables that bestpredict the response.

11 / 45

Page 68: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Better Approach

I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.

I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.

I What we really want to do: pick the q variables that bestpredict the response.

11 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Subset Selection

Several Approaches:

I Best Subset Selection: Consider all subsets of predictorsComputational mess!

I Stepwise Regression: Greedily add/remove predictorsHeuristic and potentially inefficient

I Modern Penalized Methods

12 / 45

Page 69: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.I Unfortunately, this is computationally intractable:

I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.I Unfortunately, this is computationally intractable:

I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45

Page 70: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.

I Unfortunately, this is computationally intractable:I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.I Unfortunately, this is computationally intractable:

I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45

Page 71: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.I Unfortunately, this is computationally intractable:

I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Forward Stepwise Regression

1. Use least squares to fit p univariate regression models, andselect the predictor corresponding to the best model(according to e.g. training set MSE).

2. Use least squares to fit p − 1 models containing that onepredictor, and each of the p − 1 other predictors. Select thepredictors in the best two-variable model.

3. Now use least squares to fit p − 2 models containing thosetwo predictors, and each of the p − 2 other predictors. Selectthe predictors in the best three-variable model.

4. And so on....

This gives us a nested set of models, containing the predictors

M1 ⊆M2 ⊆M3 ⊆ . . . .14 / 45

Page 72: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Forward Stepwise Regression With p = 10

15 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Example in R

xtr <- matrix(rnorm(100*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

library(leaps)

out <- regsubsets(xtr,ytr,nvmax=30,method="forward")

print(summary(out))

print(coef(out,1:10))

16 / 45

Page 73: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Which Value of q is Best?

I This procedure traces out a set of models, containing between1 and p variables.

I The qth model contains q variables, given by the set Mq.

I Q: Which value of q is best?

A: The one that minimizes the test error!

I We can select the value of q using cross-validation or thevalidation set approach.

17 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Which Value of q is Best?

I This procedure traces out a set of models, containing between1 and p variables.

I The qth model contains q variables, given by the set Mq.

I Q: Which value of q is best?A: The one that minimizes the test error!

I We can select the value of q using cross-validation or thevalidation set approach.

17 / 45

Page 74: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45

Page 75: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45

Page 76: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How To Do Forward Stepwise?

Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.

Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.

Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.

19 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How To Do Forward Stepwise?

Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.

Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.

Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.

19 / 45

Page 77: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

How To Do Forward Stepwise?

Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.

Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.

Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.

19 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Let’s Try It Out in R!

Chapter 6 R Lab, Part 1www.statlearning.com

20 / 45

Page 78: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45

Page 79: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45

Page 80: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Crazy Coefficients

I When p > n, some of the variables are highly correlated.

I Why does correlation matter?I Suppose that X1 and X2 are highly correlated with each

other... assume X1 = X2 for the sake of argument.I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.

I Then this is also a least squares model:

y = 100000001X1 − 100000002X2 + 3X3.

I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!

I This craziness is directly responsible for poor test error.

I It amounts to too much model complexity.

22 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Crazy Coefficients

I When p > n, some of the variables are highly correlated.I Why does correlation matter?

I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.

I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.

I Then this is also a least squares model:

y = 100000001X1 − 100000002X2 + 3X3.

I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!

I This craziness is directly responsible for poor test error.

I It amounts to too much model complexity.

22 / 45

Page 81: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Crazy Coefficients

I When p > n, some of the variables are highly correlated.I Why does correlation matter?

I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.

I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.

I Then this is also a least squares model:

y = 100000001X1 − 100000002X2 + 3X3.

I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!

I This craziness is directly responsible for poor test error.

I It amounts to too much model complexity.

22 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Solution: Don’t Let the Coefficients Get Too Crazy

I Recall that least squares involves finding β that minimizes

‖y − Xβ‖2.

I Ridge regression involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

β2j .

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

β2j ≤ s.

23 / 45

Page 82: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Solution: Don’t Let the Coefficients Get Too Crazy

I Recall that least squares involves finding β that minimizes

‖y − Xβ‖2.

I Ridge regression involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

β2j .

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

β2j ≤ s.

23 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

A Solution: Don’t Let the Coefficients Get Too Crazy

I Recall that least squares involves finding β that minimizes

‖y − Xβ‖2.

I Ridge regression involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

β2j .

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

β2j ≤ s.

23 / 45

Page 83: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression

I Ridge regression coefficient estimates minimize

‖y − Xβ‖2 + λ∑

j

β2j .

I Here λ is a nonnegative tuning parameter that shrinks thecoefficient estimates.

I When λ = 0, then ridge regression is just the same as leastsquares.

I As λ increases, then∑p

j=1(βRλ,j)2 decreases – i.e. coefficients

become shrunken towards zero.

I When λ =∞, βRλ = 0.

24 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression As λ Varies

25 / 45

Page 84: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge Regression In Practice

I Perform ridge regression for a very fine grid of λ values.

I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.

I Perform ridge on the full data set, using that value of λ.

26 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Example in R

xtr <- matrix(rnorm(100*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

library(glmnet)

cv.out <- cv.glmnet(xtr,ytr,alpha=0,nfolds=5)

print(cv.out$cvm)

plot(cv.out)

cat("CV Errors", cv.out$cvm,fill=TRUE)

cat("Lambda with smallest CV Error",

cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)

cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)

cat("Number of Zero Coefficients",

sum(abs(coef(cv.out))<1e-8),fill=TRUE)

27 / 45

Page 85: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

R Output

−2 0 2 4 6

46

810

12

log(Lambda)

Mea

n−S

quar

ed E

rror

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●

●●

●●

●●

28 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Page 86: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Page 87: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Page 88: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I The lasso involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

|βj |.

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

|βj | ≤ s.

I So lasso is just like ridge, except that β2j has been replacedwith |βj |.

30 / 45

Page 89: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I The lasso involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

|βj |.

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

|βj | ≤ s.

I So lasso is just like ridge, except that β2j has been replacedwith |βj |.

30 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I The lasso involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

|βj |.

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

|βj | ≤ s.

I So lasso is just like ridge, except that β2j has been replacedwith |βj |.

30 / 45

Page 90: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I Lasso is a lot like ridge:

I λ is a nonnegative tuning parameter that controls modelcomplexity.

I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model

complexity.

I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45

Page 91: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model

complexity.I When λ = 0, we get least squares.

I When λ is very large, we get βLλ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model

complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45

Page 92: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

The Lasso

I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model

complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Lasso As λ Varies

32 / 45

Page 93: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Lasso In Practice

I Perform lasso for a very fine grid of λ values.

I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.

I Perform the lasso on the full data set, using that value of λ.

33 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Example in R

xtr <- matrix(rnorm(100*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

library(glmnet)

cv.out <- cv.glmnet(xtr,ytr,alpha=1,nfolds=5)

print(cv.out$cvm)

plot(cv.out)

cat("CV Errors", cv.out$cvm,fill=TRUE)

cat("Lambda with smallest CV Error",

cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)

cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)

cat("Number of Zero Coefficients",sum(abs(coef(cv.out))<1e-8),

fill=TRUE)

34 / 45

Page 94: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

R Output

−8 −6 −4 −2 0

24

68

10

log(Lambda)

Mea

n−S

quar

ed E

rror

●●

●●

●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

100 100 98 96 93 90 88 80 67 48 24 16 13 9 2

35 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Ridge and Lasso: A Geometric Interpretation

36 / 45

Page 95: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Pros/Cons of Each Approach

Approach Simplicity?* Sparsity?** Predictions?***

Pre-Selection Good Yes So-SoForward Stepwise Good Yes So-So

Ridge Medium No GreatLasso Bad Yes Great

* How simple is this model-fitting procedure? If you were strandedon a desert island with pretty limited statistical software, could youfit this model?** Does this approach perform feature selection, i.e. is theresulting model sparse?*** How good are the predictions resulting from this model?

37 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

No “Best” Approach

I There is no “best” approach to regression in high dimensions.

I Some approaches will work better than others. For instance:I Lasso will work well if it’s really true that just a few features

are associated with the response.I Ridge will do better if all of the features are associated with

the response.

I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.

I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!

38 / 45

Page 96: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

No “Best” Approach

I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:

I Lasso will work well if it’s really true that just a few featuresare associated with the response.

I Ridge will do better if all of the features are associated withthe response.

I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.

I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!

38 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

No “Best” Approach

I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:

I Lasso will work well if it’s really true that just a few featuresare associated with the response.

I Ridge will do better if all of the features are associated withthe response.

I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.

I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!

38 / 45

Page 97: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Let’s Try It Out in R!

Chapter 6 R Lab, Part 2www.statlearning.com

39 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Making Linear Regression Less Linear

What if the relationship isn’t linear?

y = 3 sin(x) + ε

y = 2ex + ε

y = 3x2 + 2x + 1 + ε

If we know the functional form we can still use “linear regression”

40 / 45

Page 98: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Making Linear Regression Less Linear

y = 3 sin(x) + ε: x

sin(x)

y = 3x2 + 2x + 1 + ε:

x

x x2

41 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Making Linear Regression Less Linear

What if we don’t know the right functional form?

Use a flexible basis expansion:

I polynomial basis

x

x x2 · · · xk

I hockey-stick (/spline) basis

x

x (x − t1)+ · · · (x − tk)+

42 / 45

Page 99: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Making Linear Regression Less Linear

0 1 2 3 4 5 6

−1.

0−

0.5

0.0

0.5

1.0

x

y

sinpoly

43 / 45

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Making Linear Regression Less Linear

For high dimensional problems, expand each variable

x1 x2 · · · xp

x1 · · · xk1 x2 · · · xk2 · · · xp · · · xkp

and use the Lasso on this expanded problem.

k must be small (∼ 5ish)

Spline basis generally outperforms polynomial

44 / 45

Page 100: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

Bottom Line

Much more important than what model you fit is how you fit it.

I Was cross-validation performed properly?

I Did you select a model (or level of model complexity) basedon an estimate of test error?

45 / 45

Page 101: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

High-Dimensional Statistical Learning:Classification

Ali ShojaieUniversity of Washington

http://faculty.washington.edu/ashojaie/

August 2018Tehran, Iran

1 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification

I Regression involves predicting a continuous-valued response,like tumor size.

I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3

I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.

I Just like regression,I Classification cannot be blindly performed in high-dimensions

because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification

approaches to high-dimensions, which we have already seen inthe regression context!

2 / 50

Page 102: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification

I Regression involves predicting a continuous-valued response,like tumor size.

I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3

I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.

I Just like regression,I Classification cannot be blindly performed in high-dimensions

because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification

approaches to high-dimensions, which we have already seen inthe regression context!

2 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification

I Regression involves predicting a continuous-valued response,like tumor size.

I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3

I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.

I Just like regression,I Classification cannot be blindly performed in high-dimensions

because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification

approaches to high-dimensions, which we have already seen inthe regression context!

2 / 50

Page 103: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification

I Regression involves predicting a continuous-valued response,like tumor size.

I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3

I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.

I Just like regression,I Classification cannot be blindly performed in high-dimensions

because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification

approaches to high-dimensions, which we have already seen inthe regression context!

2 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification

I There are many approaches out there for performingclassification.

I We will discuss a few, logistic regression, K -nearest neighbors,discriminant analysis and support vector machines.

3 / 50

Page 104: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50

Page 105: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.

I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50

Page 106: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Why Not Linear Regression?

0 2000 4000 6000 8000 10000

0.0

0.5

1.0

Biomarker Level

Pro

babi

lity

of C

ance

r

500 1000 1500 2000 2500 3000 3500

0.0

0.2

0.4

0.6

0.8

1.0

Biomarker Level

Pro

babi

lity

of C

ance

r

I Left: linear regression.

I Right: logistic regression.

5 / 50

Page 107: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Page 108: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Page 109: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50

Page 110: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

What is an appropriate validation measure?

For classification without a probability or score:

I Misclassification rate:

#test samples misclassified

total # of test samples

7 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

What is an appropriate validation measure?

For probablistic classification

I Can still use misclassification rate.I Like in continuous regression could use SSE:

i∈test(yi − pi )

2

I Often preferable to use “predictive [log]likelihood”:

− log

[ ∏

i∈testpyii (1− pi )

1−yi]

I Can also use ROC-curve-based metric (eg. AUC)

Remember though; all of these must be conducted on a separatevalidation set.

8 / 50

Page 111: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Example in R: Lasso Logistic Regression

xtr <- matrix(rnorm(1000*20),ncol=20)

beta <- c(rep(1,5),rep(0,15))

ytr <- 1*((xtr%*%beta + .5*rnorm(1000)) >= 0)

cv.out <- cv.glmnet(xtr, ytr, family="binomial", alpha=1)

plot(cv.out)

9 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

I Can I take a totally non-parametric (model-free) approach toclassification?

I K-nearest neighbors:

1. Identify the K observations whose X values are closest to theobservation at which we want to make a prediction.

2. Classify the observation of interest to the most frequent classlabel of those K nearest neighbors.

10 / 50

Page 112: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

11 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

12 / 50

Page 113: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

KNN: K=10

13 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

14 / 50

Page 114: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00.0

50

.10

0.1

50

.20

1/K

Err

or

Ra

te

Training Errors

Test Errors

15 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

I Simple, intuitive, model-free.

I Good option when p is very small.

I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.

I Do not use in high dimensions!

I unless, you use it with a dimension reduction approach

16 / 50

Page 115: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

K -Nearest Neighbors

I Simple, intuitive, model-free.

I Good option when p is very small.

I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.

I Do not use in high dimensions!I unless, you use it with a dimension reduction approach

16 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Bayes-Based Classifiers

Suppose rather than knowing P (y = j |x)...

we have information on fj(x) = P (x |y = j), the featuredistribution within each class

How do we use this to make predictions?

Using Bayes Theorem:

P (y = j |x) =fj(x)πj∑k fk(x)πk

here πk = P(y = k) is the prior probability of class k .

17 / 50

Page 116: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Bayes-Based Classifiers

Suppose rather than knowing P (y = j |x)...

we have information on fj(x) = P (x |y = j), the featuredistribution within each class

How do we use this to make predictions?

Using Bayes Theorem:

P (y = j |x) =fj(x)πj∑k fk(x)πk

here πk = P(y = k) is the prior probability of class k .

17 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Estimating the Rule

To apply Bayes Theorem

P (y = j |x) =fj(x)πj∑k fk(x)πk

we need

I fk(x) for k = 1, . . . ,K

I πk for k = 1, . . . ,K

18 / 50

Page 117: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Estimating πk

πk is generally simple to estimate

I If your data are a random sample; then can use the sampleproportion

πk =# {yi = k}

n

I Otherwise can use outside information (eg. historical data)

If you change population proportions; it is easy to adjust the rule.

19 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Estimating fk(x)

Estimate of fk(x) = P (x |y = k) is more difficult.

This is a density estimation problem.

The tools we discuss for this break down into 3 general categories

I flexible, non-parametric estimates

I parametric estimates

I shrunken parametric estimates

The above are ordered (more-or-less) by where they fall onbias/variance spectrum:

more flexible → less bias/more variance

20 / 50

Page 118: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Parametric fk(x) Estimate

Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:

Here we assume that fk(x) is Gaussian density, N(µk ,Σk)

21 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Parametric fk(x) Estimate

Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:

Here we assume that fk(x) is Gaussian density, N(µk ,Σk)

21 / 50

Page 119: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Discriminant Analysis

There are three main types of unpenalized discriminant analysis:

I Quadratic (QDA)

I Linear (LDA)

I Diagonal (DDA)

These make different assumptions on the covariance structure:

I QDA makes no assumptions

I LDA assumes a pooled variance Σ = Σk for all k

I DDA assumes a pooled variance; and further that Σ isdiagonal (i.e. no correlation among covariates!)

22 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Discriminant Analysis

Why would we choose DDA over QDA?

Remember, flexibility comes at a price!

QDA will have the least bias; but has many more parameters toestimate

Often good estimates of the correlation don’t improveclassifications much

DDA takes into account the scale of each feature, but trades a bitof bias for potentially a large reduction in variance

23 / 50

Page 120: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p = 1

I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and

µk =1

nk

i :yi=k

xi σ2 =1

n − K

K∑

k=1

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 1

24 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p = 1

I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and

µk =1

nk

i :yi=k

xi σ2 =1

n − K

K∑

k=1

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 1

24 / 50

Page 121: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p = 1

I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and

µk =1

nk

i :yi=k

xi σ2 =1

n − K

K∑

k=1

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 1

24 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p = 1

I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and

µk =1

nk

i :yi=k

xi σ2 =1

n − K

K∑

k=1

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 124 / 50

Page 122: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p > 1

25 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p > 1

25 / 50

Page 123: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p > 1

25 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p > 1

26 / 50

Page 124: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

LDA for p > 1

26 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

QDA vs LDA

The level-curves for each class look identical with LDA;

QDA allows for different classes to have differently shapedellipsoids...

Results in non-linear decision boundaries (quadratic in fact)

27 / 50

Page 125: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

DDA

For DDA...

I level curves are spheres (not ellipsoids).

I decision boundaries are still linear

I sometimes called naive bayes (that doesn’t mean it’s badthough!)

I with πk = 1K for all k , and equal variances (ie. Σ = σI ); this

is just the nearest centroid classifier

28 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

-DA vs logistic regression

Discriminant Analysis model can actually be rewritten asmultinomial logistic models:

Beginning with

P (y = j |x) =fj(x)πj∑k fk(x)πk

and

fk(x) ∝ exp

[−1

2(x − µk)>Σ−1k (x − µk)

]

substituting and simplifying we get

P (y = j |x) =eηj∑k e

ηk

29 / 50

Page 126: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

-DA vs logistic regression

P (y = j |x) =eηj∑k e

ηk

whereηk = β0 + x>β + x>Σ−1k x

This is just a multinomial logistic model with quadratic terms andinteractions.

In particular for LDA (where Σk = Σ is pooled) we havecancellation and get

ηk = β0 + x>β

Simply a linear logistic model.

30 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Shrunken Parametric Estimates

Sometimes the optimal bias/variance tradeoff is between twoparametric classes.

For example: We may not have the data to estimate completelydifferent covariance matrices for each class (i.e. QDA); but we maynot want to use identical covariance matrices.

In this case we can take a weighted combination of our estimates.This is called regularized discriminant analysis.

This is a type of shrunken parametric estimator.

31 / 50

Page 127: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Regularized Discriminant Analysis

For shrinking between QDA/LDA we use:

ΣRDAk = λΣLDA + (1− λ)ΣQDA

k

For shrinking between LDA and Naive Bayes we use

ΣRDA = λΣLDA + (1− λ)ΣNB

λ is a tuning parameter, and is generally selected via CV

32 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

DA in High Dimensions

All of the Discriminant Analysis techniques discussed so far use allthe features.

For high dimensional problems this will lead to over-fitting

One popular solution is to shrink each class-mean estimate µktowards the overall mean µ using element-wise soft-thresholding

This method is called Nearest Shrunken Centroids (though itshould probably more appropriately be “nearest shrunken DDA”)

33 / 50

Page 128: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

j

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

j

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50

Page 129: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

j

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

j

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50

Page 130: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Let’s Try It Out in R!

Chapter 4 R Labwww.statlearning.com

35 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50

Page 131: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50

Page 132: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50

Page 133: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Separating Hyperplane

37 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Classification Via a Separating Hyperplane

Blue class if β0 + β1X1 + β2X2 > c; red class otherwise.38 / 50

Page 134: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Maximal Separating Hyperplane

Note that only a few observations are on the margin: these are thesupport vectors.

39 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

What if There is No Separating Hyperplane?

40 / 50

Page 135: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Classifier: Allow for Violations

41 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machine

I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.

I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.

I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.

42 / 50

Page 136: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machine

I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.

I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.

I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.

42 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machine

I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.

I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.

I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.

42 / 50

Page 137: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Non-Linear Class Structure

This will be hard for a linear classifier!43 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Try a Support Vector Classifier

Uh-oh!!44 / 50

Page 138: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Support Vector Machine

Much Better.45 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Is A Non-Linear Kernel Better?

I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.

I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.

46 / 50

Page 139: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Is A Non-Linear Kernel Better?

I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.

I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.

46 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Is A Non-Linear Kernel Better?

I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.

I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.

46 / 50

Page 140: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Other Classification Methods

I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:

n∑

i=1

L(f (xi ), yi )

I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!

I Hinge loss: max(0, 1− f (xi )yi )

I Logistic loss: log(1 + expf (xi )yi

47 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Other Classification Methods

I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:

n∑

i=1

L(f (xi ), yi )

I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!

I Hinge loss: max(0, 1− f (xi )yi )

I Logistic loss: log(1 + expf (xi )yi

47 / 50

Page 141: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Other Classification Methods

I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:

n∑

i=1

L(f (xi ), yi )

I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!

I Hinge loss: max(0, 1− f (xi )yi )

I Logistic loss: log(1 + expf (xi )yi

47 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Page 142: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Page 143: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Page 144: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50

Page 145: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

Let’s Try It Out in R!

Chapter 9 R Labwww.statlearning.com

49 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50

Page 146: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50

Page 147: High-Dimensional Statistical Learning: Introduction...Classical Statistics Biological Big Data Supervised and Unsupervised Learning Low-Dimensional Versus High-Dimensional I The data

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50