high-dimensional statistical learning: introduction...classical statistics biological big data...

Classical StatisticsBiological Big Data

Supervised and Unsupervised Learning

High-Dimensional Statistical Learning:Introduction

Ali ShojaieUniversity of Washington

http://faculty.washington.edu/ashojaie/

August 2018Tehran, Iran

1 / 15



A Simple Example

I Suppose we have n = 400 people with diabetes for whom wehave p = 3 serum-level measurements (LDL, HDL, GLU).

I Want to predict these peoples’ disease progression after 1 year.

2 / 15



A Simple Example

Notation:

I n is the number of observations.

I p the number ofvariables/features/predictors.

I y is a n-vector containingresponse/outcome for each of nobservations.

I X is a n × p data matrix.

3 / 15



Linear Regression on a Simple Example

I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:

y = β0 + β1X1 + β2X2 + β3X3 + ε

where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.

I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.

I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!

4 / 15





y = β0 + β1X1 + β2X2 + β3X3 + ε




4 / 15





y = β0 + β1X1 + β2X2 + β3X3 + ε




4 / 15



A Relationship Between the Variables?

progression

−0.10 0.00 0.10 0.20

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●● ●

● ●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

−0.10 0.00 0.10

5015

025

035

0

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●●

● ●

●

●

●

●● ●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

−0.

100.

000.

100.

20

●●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

● ●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

● ●

● ●

●

● ●

●●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

● ●●●

●

●

●

● ●

●

●

●

●●●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

● ●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●● ●●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

● ● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●●

●

● ●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

● ●● ●

●

●●

●

●

●

●

●

●●

●

●

●

●

● ●●

● ●

●

●

●

●

●

● ●

●●

LDL●

●●

●●

●

●

●

●

●

●

●

●●●

●

●

●

● ●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

● ●

●

●

●●

●●

●

● ●

●●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●●●

●

●

● ●

●

●

●

●●●

●

●

●

●

●●●

●

●

●●●

●

●

●

●

●

●

●●

●●

●● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

● ●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ● ●

● ●● ●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●●

●

●●●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

● ●

● ●

●●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

● ●●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●●

●●

●

●●

●●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

● ●

●

●

●

● ●●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

● ●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●●●●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●●

●

●●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●●● ●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●

● ●

●

●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●

● ●●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●●●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●●

●

●●

●

●

●

●

●●

●

● ●●

●

● ●

●

●

●●

●

●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●●

●

● ●

●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

● ●

●

●●

●●●●

●

●

●

●

●●●

●

●

●

●●

●●

●●

●

●●●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

● ●●

●

●●

●●

●

●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

● ●●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

● ●

●

●

●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●●

●●

●

● ●

●

●

●

●

●●

●

● ●●

●

● ●

●

●

●●

●

●

●

●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

● ●

●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

●●

●

●●

●●●●

●

●

●

●

●●●

●

●

●

●●

●●

● ●

●

●● ●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

● ●●

●

●●

●●

●

HDL

−0.

100.

000.

10

●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

● ●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

● ●

●

●

●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●●

●●

●

● ●

●

●

●

●

●●

●

● ●●

●

● ●

●

●

●●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

● ●

●

●●

●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

● ●

●

●●

●● ●●

●

●

●

●

●●●

●

●

●

●●

●●

● ●

●

● ●●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

● ●

●

●

● ●●

●

● ●

●●

●

50 150 250 350

−0.

100.

000.

10

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●

●●

●

●

●

●

●

● ●●● ●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●

●●

●

●

●

●

●

● ●●●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●

●●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

−0.10 0.00 0.10

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

● ●

●

●●

●

●

●

●

●

●●●●●●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

● ●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●●

●●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●●

●

●

● ●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

GLU

5 / 15



Linear Model Output

Estimate Std. Error T-Stat P-Value

Intercept 152.928 3.385 45.178 < 2e-16 ***LDL 77.057 75.701 1.018 0.309HDL -487.574 75.605 -6.449 3.28e-10 ***GLU 477.604 76.643 6.232 1.18e-09 ***

progression measure ≈ 152.9+77.1×LDL−487.6×HDL+477.6×GLU.

6 / 15



Low-Dimensional Versus High-Dimensional

I The data set that we just saw is low-dimensional: n � p.

I Lots of the data sets coming out of modern biologicaltechniques are high-dimensional: n ≈ p or n � p.

I This poses statistical challenges! Classical Statistics no longerapplies.

7 / 15



Low Dimensional

8 / 15



High Dimensional

9 / 15



What Goes Wrong in High Dimensions?

I Suppose that we included many additional predictors in ourmodel, such as

I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2

I Some of these predictors are useful, others aren’t.

I If we include too many predictors, we will overfit the data.

I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.

I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.

10 / 15










10 / 15










10 / 15










10 / 15



Why Does Dimensionality Matter?

I Classical statistical techniques, such as linear regression,cannot be applied.

I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.

I High risks of overfitting, false positives, and more.

This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.

11 / 15



Why Does Dimensionality Matter?

I Classical statistical techniques, such as linear regression,cannot be applied.

I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.

I High risks of overfitting, false positives, and more.

This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.

11 / 15




I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.

I Supervised Learning: Use a data set X to predict or detectassociation with a response y .

I RegressionI ClassificationI Hypothesis Testing

I Unsupervised Learning: Discover the signal in X , or detectassociations within X .

I Dimension ReductionI Clustering

12 / 15









12 / 15









12 / 15



Supervised Learning

13 / 15



Unsupervised Learning

14 / 15



“Course Textbook” . . . with applications in R

I Available for (free!) download from www.statlearning.com.I An accessible introduction to statistical machine learning,

with an R lab at the end of each chapter.I We will go through some of these R labs.I To learn more, go through them on your own!

15 / 15

High-Dimensional Statistical Learning:Bias-Variance Tradeoff and the Test Error




1 / 48

Supervised Learning

2 / 48

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Regression Versus Classification

I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size

I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma

I This lecture: Regression.

3 / 48

Linear Models

I We have n observations, for each of which we have ppredictor measurements and a response measurement.

I Want to develop a model of the form

yi = β0 + β1Xi1 + . . .+ βpXip + εi .

I Here εi is a noise term associated with the ith observation.

I Must estimate β0, β1, . . . , βp – i.e. we must fit the model.

4 / 48

Linear Model With p = 2 Predictors

5 / 48

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

What Makes a Model Linear?

I A linear model is linear in the regression coefficients!

I This is a linear model:

yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .

I This is not a linear model:

yi = βXi11 + sin(β2Xi2) + εi .

6 / 48

Linear Models in Matrix Form

I For simplicity, ignore the intercept β0.I Assume

∑ni=1 yi =

∑ni=1 Xij = 0; in this case, β0 = 0.

I Alternatively, let the first column of X be a column of 1’s.

I In matrix form, we can write the linear model as

y = Xβ + ε,

i.e.

y1y2...yn

=

X11 X12 . . . X1p

X21 X22 . . . X2p...

.... . .

...Xn1 Xn2 . . . Xnp

β1β2...βp

+

ε1ε2...εn

.

7 / 48

Least Squares Regression

I There are many ways we could fit the model

y = Xβ + ε.

I Most common approach in classical statistics is least squares:

minimizeβ

{‖y − Xβ‖2

}.

Here ‖a‖2 ≡∑ni=1 a

2i .

I We are looking for β1, . . . , βp such that

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2

is as small as possible, or in other words, such thatn∑

i=1

(yi − yi )2

is as small as possible, where yi is the ith predicted value.8 / 48

Let’s Try Out Least Squares in R!

Chapter 3 R labwww.statlearning.com

9 / 48


I When we fit a model, we use a training set of observations.

I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.

I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.

I One way to quantify the training error is using the meansquared error (MSE):

MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.

I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.

I Big R2 ⇔ Small Training Error.

10 / 48


I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.

I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.



10 / 48


I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form

yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.



10 / 48



yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.



10 / 48



yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.



10 / 48



yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.



10 / 48



yi = β1Xi1 + . . .+ βpXip.



MSE =1

n

n∑

i=1

(yi − yi )2 =

1

n

n∑

i=1

(yi − (β1Xi1 + . . .+ βpXip))2.


I Big R2 ⇔ Small Training Error.10 / 48

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

Least Squares as More Variables are Included in the Model

I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.

I The problem? Training error and R2 evaluate the model’sperformance on the training observations.

I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.

I We really care about the model’s performance on testobservations – observations not used to fit the model.

11 / 48

The Problem

As we add more variables into the model...

... the training error decreases and the R2 increases!

12 / 48

Why is this a Problem?

I We really care about the model’s performance on observationsnot used to fit the model!

I We want a model that will predict the survival time of a newpatient who walks into the clinic!

I We want a model that can be used to diagnose cancer for apatient not used in model training!

I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!

I What we really care about:

(ytest − ytest)2,

whereytest = β1Xtest,1 + . . .+ βpXtest,p,

and (Xtest , ytest) was not used to train the model.

I The test error is the average of (ytest − ytest)2 over a bunch of

test observations.

13 / 48

Why is this a Problem?

I We really care about the model’s performance on observationsnot used to fit the model!

I We want a model that will predict the survival time of a newpatient who walks into the clinic!

I We want a model that can be used to diagnose cancer for apatient not used in model training!

I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!

I What we really care about:

(ytest − ytest)2,

whereytest = β1Xtest,1 + . . .+ βpXtest,p,

and (Xtest , ytest) was not used to train the model.

I The test error is the average of (ytest − ytest)2 over a bunch of

test observations.13 / 48

Training Error versus Test Error



But the test error might not!

14 / 48

Training Error versus Test Error



But the test error might not!

15 / 48

Why the Number of Variables Matters

I Linear regression will have a very low training error if p islarge relative to n.

I A simple example:

I When n ≤ p, you can always get a perfect model fit to thetraining data!

I But the test error will be awful.

16 / 48

Model Complexity, Training Error, and Test Error

I In this course, we will consider various types of models.

I We will be very concerned with model complexity: e.g. thenumber of variables used to fit a model.

I As we fit more complex models – e.g. models with morevariables – the training error will always decrease.

I But the test error might not.

I As we will see, the number of variables in the model is not theonly – or even the best – way to quantify model complexity.

17 / 48

An Example In R

xtr <- matrix(rnorm(100*100),ncol=100)

xte <- matrix(rnorm(100000*100),ncol=100)

beta <- c(rep(1,10),rep(0,90))

ytr <- xtr%*%beta + rnorm(100)

yte <- xte%*%beta + rnorm(100000)

rsq <- trainerr <- testerr <- NULL

for(i in 2:100){

mod <- lm(ytr~xtr[,1:i])

rsq <- c(rsq,summary(mod)$r.squared)

beta <- mod$coef[-1]

intercept <- mod$coef[1]

trainerr <- c(trainerr, mean((xtr[,1:i]%*%beta+intercept - ytr)^2))

testerr <- c(testerr, mean((xte[,1:i]%*%beta+intercept - yte)^2))

}

par(mfrow=c(1,3))

plot(2:100,rsq, xlab=’Number of Variables’, ylab="R Squared", log="y")

abline(v=10,col="red")

plot(2:100,trainerr, xlab=’Number of Variables’, ylab="Training Error",log="y")


plot(2:100,testerr, xlab=’Number of Variables’, ylab="Test Error",log="y")


18 / 48

A Simulated Example

●

●

●

●

●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1.0

Number of Variables

R S

quar

ed

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

0 20 40 60 80 1001e

−28

1e−

211e

−14

1e−

071e

+00

Number of Variables

Trai

ning

Err

or

●●●●●

●

●

●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●

●●●

●

●●●

●

●●●

●●

●

0 20 40 60 80 100

12

510

2050

100

200

500

Number of Variables

Test

Err

or

I 1st 10 variables are related to response; remaining 90 are not.

I R2 increases and training error decreases as more variables areadded to the model.

I Test error is lowest when only signal variables in model.

19 / 48

Bias and Variance

I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.

I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.

I The test error depends on both the bias and variance:

Test Error = Bias2 + Variance.

I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.

20 / 48

Bias and Variance






20 / 48

Bias and Variance






20 / 48

Bias and Variance






20 / 48

A Really Fundamental Picture

21 / 48

Overfitting

I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.

I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.

I In particular, we must rely not on training error, but on testerror, as a measure of model performance.

I How can we estimate the test error?

22 / 48

Overfitting





22 / 48

Overfitting





22 / 48

Training Set Versus Test Set

I Split samples into training set and test set.

I Fit model on training set, and evaluate on test set.

Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!

23 / 48




Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?

A: Never!

23 / 48




Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!

23 / 48




You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!

24 / 48




You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!

24 / 48

Training Set And Test Set

To get an estimate of the test error of a particular model on afuture observation:

1. Split the samples into a training set and a test set.

2. Fit the model on the training set.

3. Evaluate its performance on the test set.

4. The test set error rate is an estimate of the model’sperformance on a future observation.

But remember: no peeking!

25 / 48

Training Set And Test Set

To get an estimate of the test error of a particular model on afuture observation:

1. Split the samples into a training set and a test set.

2. Fit the model on the training set.

3. Evaluate its performance on the test set.

4. The test set error rate is an estimate of the model’sperformance on a future observation.

But remember: no peeking!

25 / 48

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Choosing Between Several Models

I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.

I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!

I We must pick a best model based on the training set, but wewant a model that will have low test error!

I How can we estimate test error using only the training set?

1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.

I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.

26 / 48

Validation Set Approach

Split the n observations into two sets of approximately equal size.Train on one set, and evaluate performance on the other.

27 / 48

Validation Set Approach

For a given model, we perform the following procedure:

1. Split the observations into two sets of approximately equalsize, a training set and a validation set.

a. Fit the model using the training observations. Let β(train)

denote the regression coefficient estimates.b. For each observation in the validation set, compute the test

error, ei = (yi − xTi β(train))2.

2. Calculate the total validation set error by summing the ei ’sover all of the validation set observations.

Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.

28 / 48

Leave-One-Out Cross-Validation

Fit n models, each on n − 1 of the observations. Evaluate eachmodel on the left-out observation.

29 / 48

Leave-One-Out Cross-Validation

For a given model, we perform the following procedure:

1. For i = 1, . . . , n:

a. Fit the model using observations 1, . . . , i − 1, i + 1, . . . , n. Letβ(i) denote the regression coefficient estimates.

b. Compute the test error, ei = (yi − xTi β(i))2.

2. Calculate∑n

i=1 ei , the total CV error.


30 / 48

5-Fold Cross-Validation

Split the observations into 5 sets. Repeatedly train the model on 4sets and evaluate its performance on the 5th.

31 / 48

K-fold cross-validation

A generalization of leave-one-out cross-validation. For a givenmodel, we perform the following procedure:

1. Split the n observations into K equally-sized folds.

2. For k = 1, . . . ,K :

a. Fit the model using the observations not in the kth fold.b. Let ek denote the test error for the observations in the kth fold.

3. Calculate∑K

k=1 ek , the total CV error.


32 / 48

An Example In R


beta <- c(rep(1,10),rep(0,90))


cv.err <- NULL

for(i in 2:50){

dat <- data.frame(x=xtr[,1:i],y=ytr)

mod <- glm(y~.,data=dat)

cv.err <- c(cv.err, cv.glm(dat,mod,K=6)$delta[1])

}

plot(2:50, cv.err, xlab="Number of Variables",

ylab="6-Fold CV Error", log="y")

abline(v=10, col="red")

33 / 48

Output of R Code

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●● ● ●

●

●

●

●

●●

● ●●

●

● ●

●

●●

● ●

● ●

●

●

●

●

●

●

●

●

●

10 20 30 40 50

12

510

Number of Variables

6−F

old

CV

Err

or

I Six-fold CV identifies the model with just over ten predictors.

I First ten predictors contain signal, and the rest are noise.

34 / 48

After Estimating the Test Error on the Training Set...

After we estimate the test error using the training set, we refit the“best” model on all of the available training observations. We thenevaluate this model on the test set.

35 / 48

Big Picture

36 / 48

Big Picture

37 / 48

Big Picture

38 / 48

Big Picture

39 / 48

Big Picture

40 / 48

Let’s Try Out Cross-Validation in R!

Chapter 5 R labFirst Half: Cross-Validationwww.statlearning.com

41 / 48

Summary: Four-Step Procedure

1. Split observations into training set and test set.

2. Fit a bunch of models on training set, and estimate the testerror, using cross-validation or validation set approach.

3. Refit the best model on the full training set.

4. Evaluate the model’s performance on the test set.

42 / 48

Why All the Bother?

Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?

A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!

43 / 48

Why All the Bother?

Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?

A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!

43 / 48

Regression in High Dimensions

I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.

I Instead, we must fit a less complex model, e.g. a model withfewer variables.

I We will consider five ways to fit less complex models:

1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression

I These are alternatives to fitting a linear model using leastsquares.

I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.

44 / 48








44 / 48








44 / 48




If you

I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set

performance;

then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.

Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.

45 / 48




If you


performance;



45 / 48




If you


performance;



45 / 48

The Curse of Dimensionality

Q: A data set with more variables is better than a data set withfewer variables, right?

A: Not necessarily!

Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.

On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!

46 / 48

The Curse of Dimensionality

Q: A data set with more variables is better than a data set withfewer variables, right?

A: Not necessarily!

Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.

On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!

46 / 48

Wise Words

In high-dimensional data analysis, common mistakes are simple,and simple mistakes are common.

– Keith Baggerly

47 / 48

Before You’re Done Your Analysis

I Estimate the test error.I Do a “sanity check” whenever possible.

I “Spot-check” the variables that have the largest coefficients inthe model.

I Rewrite your code from scratch. Do you get the same answeragain?

Fitting models in high-dimensions: one mistake away from disaster!

48 / 48

Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression

High-Dimensional Statistical Learning:Regression




1 / 45


Linear Models in High Dimensions

I When p is large, least squares regression will lead to very lowtraining error but terrible test error.

I We will now see some approaches for fitting linear models inhigh dimensions, p � n.

I These approaches also work well when p ≈ n or n > p.

2 / 45


Linear Models in High Dimensions

I When p is large, least squares regression will lead to very lowtraining error but terrible test error.

I We will now see some approaches for fitting linear models inhigh dimensions, p � n.

I These approaches also work well when p ≈ n or n > p.

2 / 45


Motivating example

I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.

I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.

I How can we develop a model with low test error in thissetting?

3 / 45


Motivating example


I For instance, these biomarkers could be:

I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.


3 / 45


Motivating example


I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.

I protein levels.I mutations in genes potentially implicated in breast cancer.


3 / 45


Motivating example


I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.

I mutations in genes potentially implicated in breast cancer.


3 / 45


Motivating example




3 / 45


Motivating example




3 / 45


Remember

I We have n training observations.

I Our goal is to get a model that will perform well on futuretest observations.

I We’ll incur some bias in order to reduce variance.

4 / 45


Variable Pre-Selection

The simplest approach for fitting a model in high dimensions:

1. Choose a small set of variables, say the q variables that aremost correlated with the response, where q < n and q < p.

2. Use least squares to fit a model predicting y using only theseq variables.

This approach is simple and straightforward.

5 / 45


Variable Pre-Selection in R


beta <- c(rep(1,10),rep(0,90))


cors <- cor(xtr,ytr)

whichers <- which(abs(cors)>.2)

mod <- lm(ytr~xtr[,whichers])

print(summary(mod))

6 / 45


How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45


How Many Variable to Use?

I We need a way to choose q, the number of variables used inthe regression model.

I We want q that minimizes the test error.

I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.

I Then choose the value of q for which the estimated test erroris smallest.

7 / 45


Estimating the Test Error For a Given q

This is the right way to estimate the test error using the validationset approach:

1. Split the observations into a training set and a validation set.

2. Using the training set only:

a. Identify the q variables most associated with the response.b. Use least squares to fit a model predicting y using those q

variables.c. Let β1, . . . , βq denote the resulting coefficient estimates.

3. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.

8 / 45


Estimating the Test Error For a Given q

This is the wrong way to estimate the test error using thevalidation set approach:

1. Identify the q variables most associated with the response onthe full data set.

2. Split the observations into a training set and a validation set.

3. Using the training set only:

a. Use least squares to fit a model predicting y using those qvariables.

b. Let β1, . . . , βq denote the resulting coefficient estimates.

4. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.

9 / 45


Frequently Asked Questions

I Q: Does it really matter how you estimate the test error?A: Yes.

I Q: Would anyone make such a silly mistake?A: Yes.

10 / 45


Frequently Asked Questions

I Q: Does it really matter how you estimate the test error?A: Yes.

I Q: Would anyone make such a silly mistake?A: Yes.

10 / 45


A Better Approach

I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.

I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.

I What we really want to do: pick the q variables that bestpredict the response.

11 / 45


A Better Approach




11 / 45


A Better Approach




11 / 45


Subset Selection

Several Approaches:

I Best Subset Selection: Consider all subsets of predictorsComputational mess!

I Stepwise Regression: Greedily add/remove predictorsHeuristic and potentially inefficient

I Modern Penalized Methods

12 / 45


Best Subset Selection

I We would like to consider all possible models using a subset ofthe p predictors.

I In other words, we’d like to consider all 2p possible models.

I This is called best subset selection.I Unfortunately, this is computationally intractable:

I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.

According to www.universetoday.com, this is around thenumber of atoms in the known universe.

I Not feasible to consider so many models!

I Need an efficient way to sift through all of these models:forward stepwise regression.

13 / 45










13 / 45





I This is called best subset selection.

I Unfortunately, this is computationally intractable:I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.




13 / 45










13 / 45










13 / 45


Forward Stepwise Regression

1. Use least squares to fit p univariate regression models, andselect the predictor corresponding to the best model(according to e.g. training set MSE).

2. Use least squares to fit p − 1 models containing that onepredictor, and each of the p − 1 other predictors. Select thepredictors in the best two-variable model.

3. Now use least squares to fit p − 2 models containing thosetwo predictors, and each of the p − 2 other predictors. Selectthe predictors in the best three-variable model.

4. And so on....

This gives us a nested set of models, containing the predictors

M1 ⊆M2 ⊆M3 ⊆ . . . .14 / 45


Forward Stepwise Regression With p = 10

15 / 45


Example in R


beta <- c(rep(1,10),rep(0,90))


library(leaps)

out <- regsubsets(xtr,ytr,nvmax=30,method="forward")

print(summary(out))

print(coef(out,1:10))

16 / 45


Which Value of q is Best?

I This procedure traces out a set of models, containing between1 and p variables.

I The qth model contains q variables, given by the set Mq.

I Q: Which value of q is best?

A: The one that minimizes the test error!

I We can select the value of q using cross-validation or thevalidation set approach.

17 / 45


Which Value of q is Best?

I This procedure traces out a set of models, containing between1 and p variables.

I The qth model contains q variables, given by the set Mq.

I Q: Which value of q is best?A: The one that minimizes the test error!

I We can select the value of q using cross-validation or thevalidation set approach.

17 / 45


Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45


Drawback of Forward Stepwise Selection

I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.

I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.

I For instance, suppose that the best model with one variable is

y = β3X3 + ε

and the best model with two variables is

y = β4X4 + β8X8 + ε.

Then forward stepwise selection will not identify the besttwo-variable model.

I Q: Does this really happen in practice?A: Yes.

18 / 45


How To Do Forward Stepwise?

Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.

Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.

Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.

19 / 45






19 / 45






19 / 45


Let’s Try It Out in R!

Chapter 6 R Lab, Part 1www.statlearning.com

20 / 45


Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45


Ridge Regression and the Lasso

I Best subset, and stepwise regression control model complexityby using subsets of the predictors.

I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.

I This is known as regularization or penalization.

I Hot area in statistical machine learning today.

21 / 45


Crazy Coefficients

I When p > n, some of the variables are highly correlated.

I Why does correlation matter?I Suppose that X1 and X2 are highly correlated with each

other... assume X1 = X2 for the sake of argument.I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.

I Then this is also a least squares model:

y = 100000001X1 − 100000002X2 + 3X3.

I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!

I This craziness is directly responsible for poor test error.

I It amounts to too much model complexity.

22 / 45


Crazy Coefficients

I When p > n, some of the variables are highly correlated.I Why does correlation matter?

I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.

I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.


y = 100000001X1 − 100000002X2 + 3X3.




22 / 45


Crazy Coefficients

I When p > n, some of the variables are highly correlated.I Why does correlation matter?

I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.

I And suppose that the least squares model is

y = X1 − 2X2 + 3X3.


y = 100000001X1 − 100000002X2 + 3X3.




22 / 45


A Solution: Don’t Let the Coefficients Get Too Crazy

I Recall that least squares involves finding β that minimizes

‖y − Xβ‖2.

I Ridge regression involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

β2j .

I Equivalently, find β that minimizes

‖y − Xβ‖2

subject to the constraint thatp∑

j=1

β2j ≤ s.

23 / 45




‖y − Xβ‖2.


‖y − Xβ‖2 + λ∑

j

β2j .


‖y − Xβ‖2


j=1

β2j ≤ s.

23 / 45




‖y − Xβ‖2.


‖y − Xβ‖2 + λ∑

j

β2j .


‖y − Xβ‖2


j=1

β2j ≤ s.

23 / 45


Ridge Regression

I Ridge regression coefficient estimates minimize

‖y − Xβ‖2 + λ∑

j

β2j .

I Here λ is a nonnegative tuning parameter that shrinks thecoefficient estimates.

I When λ = 0, then ridge regression is just the same as leastsquares.

I As λ increases, then∑p

j=1(βRλ,j)2 decreases – i.e. coefficients

become shrunken towards zero.

I When λ =∞, βRλ = 0.

24 / 45


Ridge Regression As λ Varies

25 / 45


Ridge Regression In Practice

I Perform ridge regression for a very fine grid of λ values.

I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.

I Perform ridge on the full data set, using that value of λ.

26 / 45


Example in R


beta <- c(rep(1,10),rep(0,90))


library(glmnet)

cv.out <- cv.glmnet(xtr,ytr,alpha=0,nfolds=5)

print(cv.out$cvm)

plot(cv.out)

cat("CV Errors", cv.out$cvm,fill=TRUE)

cat("Lambda with smallest CV Error",

cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)

cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)

cat("Number of Zero Coefficients",

sum(abs(coef(cv.out))<1e-8),fill=TRUE)

27 / 45


R Output

−2 0 2 4 6

46

810

12

log(Lambda)

Mea

n−S

quar

ed E

rror

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●

●●

●●

●●

28 / 45


Drawbacks of Ridge

I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.

I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.

I Often want a simpler model involving a subset of the features.

I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.

I In other words, the resulting model is sparse. We say that thelasso performs feature selection.

I The lasso is a very active area of research interest in thestatistical community!

29 / 45


Drawbacks of Ridge







29 / 45


Drawbacks of Ridge







29 / 45


Drawbacks of Ridge







29 / 45


The Lasso

I The lasso involves finding β that minimizes

‖y − Xβ‖2 + λ∑

j

|βj |.


‖y − Xβ‖2


j=1

|βj | ≤ s.

I So lasso is just like ridge, except that β2j has been replacedwith |βj |.

30 / 45


The Lasso


‖y − Xβ‖2 + λ∑

j

|βj |.


‖y − Xβ‖2


j=1

|βj | ≤ s.


30 / 45


The Lasso


‖y − Xβ‖2 + λ∑

j

|βj |.


‖y − Xβ‖2


j=1

|βj | ≤ s.


30 / 45


The Lasso

I Lasso is a lot like ridge:

I λ is a nonnegative tuning parameter that controls modelcomplexity.

I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.

I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!

31 / 45


The Lasso

I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model

complexity.

I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.


31 / 45


The Lasso


complexity.I When λ = 0, we get least squares.

I When λ is very large, we get βLλ = 0.


31 / 45


The Lasso


complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.


31 / 45


The Lasso


complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL

λ = 0.


31 / 45


Lasso As λ Varies

32 / 45


Lasso In Practice

I Perform lasso for a very fine grid of λ values.

I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.

I Perform the lasso on the full data set, using that value of λ.

33 / 45


Example in R


beta <- c(rep(1,10),rep(0,90))


library(glmnet)

cv.out <- cv.glmnet(xtr,ytr,alpha=1,nfolds=5)

print(cv.out$cvm)

plot(cv.out)

cat("CV Errors", cv.out$cvm,fill=TRUE)

cat("Lambda with smallest CV Error",

cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)

cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)

cat("Number of Zero Coefficients",sum(abs(coef(cv.out))<1e-8),

fill=TRUE)

34 / 45


R Output

−8 −6 −4 −2 0

24

68

10

log(Lambda)

Mea

n−S

quar

ed E

rror

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●●●●●●●●●●●●●

●●

●●

●●

●

●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

100 100 98 96 93 90 88 80 67 48 24 16 13 9 2

35 / 45


Ridge and Lasso: A Geometric Interpretation

36 / 45


Pros/Cons of Each Approach

Approach Simplicity?* Sparsity?** Predictions?***

Pre-Selection Good Yes So-SoForward Stepwise Good Yes So-So

Ridge Medium No GreatLasso Bad Yes Great

* How simple is this model-fitting procedure? If you were strandedon a desert island with pretty limited statistical software, could youfit this model?** Does this approach perform feature selection, i.e. is theresulting model sparse?*** How good are the predictions resulting from this model?

37 / 45


No “Best” Approach

I There is no “best” approach to regression in high dimensions.

I Some approaches will work better than others. For instance:I Lasso will work well if it’s really true that just a few features

are associated with the response.I Ridge will do better if all of the features are associated with

the response.

I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.

I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!

38 / 45



I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:

I Lasso will work well if it’s really true that just a few featuresare associated with the response.

I Ridge will do better if all of the features are associated withthe response.



38 / 45



I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:

I Lasso will work well if it’s really true that just a few featuresare associated with the response.

I Ridge will do better if all of the features are associated withthe response.



38 / 45



Chapter 6 R Lab, Part 2www.statlearning.com

39 / 45


Making Linear Regression Less Linear

What if the relationship isn’t linear?

y = 3 sin(x) + ε

y = 2ex + ε

y = 3x2 + 2x + 1 + ε

If we know the functional form we can still use “linear regression”

40 / 45



y = 3 sin(x) + ε: x

→

sin(x)

y = 3x2 + 2x + 1 + ε:

x

→

x x2

41 / 45



What if we don’t know the right functional form?

Use a flexible basis expansion:

I polynomial basis

x

→

x x2 · · · xk

I hockey-stick (/spline) basis

x

→

x (x − t1)+ · · · (x − tk)+

42 / 45



0 1 2 3 4 5 6

−1.

0−

0.5

0.0

0.5

1.0

x

y

sinpoly

43 / 45



For high dimensional problems, expand each variable

x1 x2 · · · xp

→

x1 · · · xk1 x2 · · · xk2 · · · xp · · · xkp

and use the Lasso on this expanded problem.

k must be small (∼ 5ish)

Spline basis generally outperforms polynomial

44 / 45


Bottom Line

Much more important than what model you fit is how you fit it.

I Was cross-validation performed properly?

I Did you select a model (or level of model complexity) basedon an estimate of test error?

45 / 45

Logistic RegressionK -Nearest Neighbors

Bayes-Based ClassifiersSVM

High-Dimensional Statistical Learning:Classification




1 / 50



Classification

I Regression involves predicting a continuous-valued response,like tumor size.

I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3

I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.

I Just like regression,I Classification cannot be blindly performed in high-dimensions

because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification

approaches to high-dimensions, which we have already seen inthe regression context!

2 / 50



Classification







2 / 50



Classification







2 / 50



Classification







2 / 50



Classification

I There are many approaches out there for performingclassification.

I We will discuss a few, logistic regression, K -nearest neighbors,discriminant analysis and support vector machines.

3 / 50



Logistic Regression

I Logistic regression is the straightforward extension of linearregression to the classification setting.

I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.

I The simple linear model

y = Xβ + ε

doesn’t make sense for classification.I Instead, the logistic regression model is

P(y = 1|X ) =exp(XTβ)

1 + exp(XTβ).

I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.

4 / 50



Logistic Regression




y = Xβ + ε



1 + exp(XTβ).


4 / 50



Logistic Regression




y = Xβ + ε

doesn’t make sense for classification.

I Instead, the logistic regression model is


1 + exp(XTβ).


4 / 50



Logistic Regression




y = Xβ + ε



1 + exp(XTβ).


4 / 50



Logistic Regression




y = Xβ + ε



1 + exp(XTβ).


4 / 50



Why Not Linear Regression?

0 2000 4000 6000 8000 10000

0.0

0.5

1.0

Biomarker Level

Pro

babi

lity

of C

ance

r

500 1000 1500 2000 2500 3000 3500

0.0

0.2

0.4

0.6

0.8

1.0

Biomarker Level

Pro

babi

lity

of C

ance

r

I Left: linear regression.

I Right: logistic regression.

5 / 50



Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50



Ways to Extend Logistic Regression to High Dimensions

1. Variable Pre-Selection

2. Forward Stepwise Logistic Regression

3. Ridge Logistic Regression

4. Lasso Logistic Regression

How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.

6 / 50



What is an appropriate validation measure?

For classification without a probability or score:

I Misclassification rate:

#test samples misclassified

total # of test samples

7 / 50



What is an appropriate validation measure?

For probablistic classification

I Can still use misclassification rate.I Like in continuous regression could use SSE:

∑

i∈test(yi − pi )

2

I Often preferable to use “predictive [log]likelihood”:

− log

[ ∏

i∈testpyii (1− pi )

1−yi]

I Can also use ROC-curve-based metric (eg. AUC)

Remember though; all of these must be conducted on a separatevalidation set.

8 / 50



Example in R: Lasso Logistic Regression


beta <- c(rep(1,5),rep(0,15))

ytr <- 1*((xtr%*%beta + .5*rnorm(1000)) >= 0)

cv.out <- cv.glmnet(xtr, ytr, family="binomial", alpha=1)

plot(cv.out)

9 / 50



K -Nearest Neighbors

I Can I take a totally non-parametric (model-free) approach toclassification?

I K-nearest neighbors:

1. Identify the K observations whose X values are closest to theobservation at which we want to make a prediction.

2. Classify the observation of interest to the most frequent classlabel of those K nearest neighbors.

10 / 50




o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

11 / 50




o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

12 / 50




o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

KNN: K=10

13 / 50




o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

14 / 50




0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00.0

50

.10

0.1

50

.20

1/K

Err

or

Ra

te

Training Errors

Test Errors

15 / 50




I Simple, intuitive, model-free.

I Good option when p is very small.

I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.

I Do not use in high dimensions!

I unless, you use it with a dimension reduction approach

16 / 50




I Simple, intuitive, model-free.

I Good option when p is very small.

I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.

I Do not use in high dimensions!I unless, you use it with a dimension reduction approach

16 / 50



Bayes-Based Classifiers

Suppose rather than knowing P (y = j |x)...

we have information on fj(x) = P (x |y = j), the featuredistribution within each class

How do we use this to make predictions?

Using Bayes Theorem:

P (y = j |x) =fj(x)πj∑k fk(x)πk

here πk = P(y = k) is the prior probability of class k .

17 / 50



Bayes-Based Classifiers

Suppose rather than knowing P (y = j |x)...

we have information on fj(x) = P (x |y = j), the featuredistribution within each class

How do we use this to make predictions?

Using Bayes Theorem:


here πk = P(y = k) is the prior probability of class k .

17 / 50



Estimating the Rule

To apply Bayes Theorem


we need

I fk(x) for k = 1, . . . ,K

I πk for k = 1, . . . ,K

18 / 50



Estimating πk

πk is generally simple to estimate

I If your data are a random sample; then can use the sampleproportion

πk =# {yi = k}

n

I Otherwise can use outside information (eg. historical data)

If you change population proportions; it is easy to adjust the rule.

19 / 50



Estimating fk(x)

Estimate of fk(x) = P (x |y = k) is more difficult.

This is a density estimation problem.

The tools we discuss for this break down into 3 general categories

I flexible, non-parametric estimates

I parametric estimates

I shrunken parametric estimates

The above are ordered (more-or-less) by where they fall onbias/variance spectrum:

more flexible → less bias/more variance

20 / 50



Parametric fk(x) Estimate

Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:

Here we assume that fk(x) is Gaussian density, N(µk ,Σk)

21 / 50



Parametric fk(x) Estimate

Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:

Here we assume that fk(x) is Gaussian density, N(µk ,Σk)

21 / 50



Discriminant Analysis

There are three main types of unpenalized discriminant analysis:

I Quadratic (QDA)

I Linear (LDA)

I Diagonal (DDA)

These make different assumptions on the covariance structure:

I QDA makes no assumptions

I LDA assumes a pooled variance Σ = Σk for all k

I DDA assumes a pooled variance; and further that Σ isdiagonal (i.e. no correlation among covariates!)

22 / 50



Discriminant Analysis

Why would we choose DDA over QDA?

Remember, flexibility comes at a price!

QDA will have the least bias; but has many more parameters toestimate

Often good estimates of the correlation don’t improveclassifications much

DDA takes into account the scale of each feature, but trades a bitof bias for potentially a large reduction in variance

23 / 50



LDA for p = 1

I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and

µk =1

nk

∑

i :yi=k

xi σ2 =1

n − K

K∑

k=1

∑

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 1

24 / 50



LDA for p = 1


µk =1

nk

∑

i :yi=k

xi σ2 =1

n − K

K∑

k=1

∑

i :yi=k

(xi − µk)2


24 / 50



LDA for p = 1


µk =1

nk

∑

i :yi=k

xi σ2 =1

n − K

K∑

k=1

∑

i :yi=k

(xi − µk)2


24 / 50



LDA for p = 1


µk =1

nk

∑

i :yi=k

xi σ2 =1

n − K

K∑

k=1

∑

i :yi=k

(xi − µk)2

I The picture is very similar if K > 2...or if p > 124 / 50



LDA for p > 1

25 / 50



LDA for p > 1

25 / 50



LDA for p > 1

25 / 50



LDA for p > 1

26 / 50



LDA for p > 1

26 / 50



QDA vs LDA

The level-curves for each class look identical with LDA;

QDA allows for different classes to have differently shapedellipsoids...

Results in non-linear decision boundaries (quadratic in fact)

27 / 50



DDA

For DDA...

I level curves are spheres (not ellipsoids).

I decision boundaries are still linear

I sometimes called naive bayes (that doesn’t mean it’s badthough!)

I with πk = 1K for all k , and equal variances (ie. Σ = σI ); this

is just the nearest centroid classifier

28 / 50



-DA vs logistic regression

Discriminant Analysis model can actually be rewritten asmultinomial logistic models:

Beginning with


and

fk(x) ∝ exp

[−1

2(x − µk)>Σ−1k (x − µk)

]

substituting and simplifying we get

P (y = j |x) =eηj∑k e

ηk

29 / 50



-DA vs logistic regression

P (y = j |x) =eηj∑k e

ηk

whereηk = β0 + x>β + x>Σ−1k x

This is just a multinomial logistic model with quadratic terms andinteractions.

In particular for LDA (where Σk = Σ is pooled) we havecancellation and get

ηk = β0 + x>β

Simply a linear logistic model.

30 / 50



Shrunken Parametric Estimates

Sometimes the optimal bias/variance tradeoff is between twoparametric classes.

For example: We may not have the data to estimate completelydifferent covariance matrices for each class (i.e. QDA); but we maynot want to use identical covariance matrices.

In this case we can take a weighted combination of our estimates.This is called regularized discriminant analysis.

This is a type of shrunken parametric estimator.

31 / 50



Regularized Discriminant Analysis

For shrinking between QDA/LDA we use:

ΣRDAk = λΣLDA + (1− λ)ΣQDA

k

For shrinking between LDA and Naive Bayes we use

ΣRDA = λΣLDA + (1− λ)ΣNB

λ is a tuning parameter, and is generally selected via CV

32 / 50



DA in High Dimensions

All of the Discriminant Analysis techniques discussed so far use allthe features.

For high dimensional problems this will lead to over-fitting

One popular solution is to shrink each class-mean estimate µktowards the overall mean µ using element-wise soft-thresholding

This method is called Nearest Shrunken Centroids (though itshould probably more appropriately be “nearest shrunken DDA”)

33 / 50



Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

∑

j

∑

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

∑

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50



Regularized DA Methods

I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .

I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty

∑

j

∑

k

|µjk |,

I or towards each other, using a fused lasso penalty∑

j

∑

k,k′

|µjk − µjk′ |.

Both of these are implemented in R-package penalizedLDA.

I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).

34 / 50




Chapter 4 R Labwww.statlearning.com

35 / 50



Support Vector Machines

I Developed in around 1995.

I Touted as “overcoming the curse of dimensionality.”

I Does not (automatically) overcome the curse ofdimensionality!

I Fundamentally and numerically very similar to logisticregression.

I But, it is a nice idea.

36 / 50









36 / 50









36 / 50



Separating Hyperplane

37 / 50



Classification Via a Separating Hyperplane

Blue class if β0 + β1X1 + β2X2 > c; red class otherwise.38 / 50



Maximal Separating Hyperplane

Note that only a few observations are on the margin: these are thesupport vectors.

39 / 50



What if There is No Separating Hyperplane?

40 / 50



Support Vector Classifier: Allow for Violations

41 / 50



Support Vector Machine

I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.

I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.

I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.

42 / 50







42 / 50







42 / 50



Non-Linear Class Structure

This will be hard for a linear classifier!43 / 50



Try a Support Vector Classifier

Uh-oh!!44 / 50




Much Better.45 / 50



Is A Non-Linear Kernel Better?

I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.

I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.

46 / 50






46 / 50






46 / 50



SVM vs Other Classification Methods

I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:

n∑

i=1

L(f (xi ), yi )

I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!

I Hinge loss: max(0, 1− f (xi )yi )

I Logistic loss: log(1 + expf (xi )yi

47 / 50





n∑

i=1

L(f (xi ), yi )




47 / 50





n∑

i=1

L(f (xi ), yi )




47 / 50



SVM vs Logistic Regression

I Bottom Line: Support vector classifier and logistic regressionaren’t that different!

I Neither they nor any other approach can overcome the “curseof dimensionality”.

I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.

I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!

I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.

48 / 50









48 / 50









48 / 50




Chapter 9 R Labwww.statlearning.com

49 / 50



In High Dimensions...

I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.

I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).

I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.

I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.

50 / 50








50 / 50








50 / 50

high-dimensional statistical learning: introduction...classical statistics biological big data...

Documents