high-dimensional statistical learning: introduction...classical statistics biological big data...
TRANSCRIPT
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
High-Dimensional Statistical Learning:Introduction
Ali ShojaieUniversity of Washington
http://faculty.washington.edu/ashojaie/
August 2018Tehran, Iran
1 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
A Simple Example
I Suppose we have n = 400 people with diabetes for whom wehave p = 3 serum-level measurements (LDL, HDL, GLU).
I Want to predict these peoples’ disease progression after 1 year.
2 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
A Simple Example
Notation:
I n is the number of observations.
I p the number ofvariables/features/predictors.
I y is a n-vector containingresponse/outcome for each of nobservations.
I X is a n × p data matrix.
3 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Linear Regression on a Simple Example
I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:
y = β0 + β1X1 + β2X2 + β3X3 + ε
where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.
I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.
I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!
4 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Linear Regression on a Simple Example
I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:
y = β0 + β1X1 + β2X2 + β3X3 + ε
where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.
I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.
I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!
4 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Linear Regression on a Simple Example
I You can perform linear regression to develop a model topredict progression using LDL, HDL, and GLU:
y = β0 + β1X1 + β2X2 + β3X3 + ε
where y is our continuous measure of disease progression,X1,X2,X3 are our serum-level measurements, and ε is a noiseterm.
I You can look at the coefficients, p-values, and t-statistics foryour linear regression model in order to interpret your results.
I You learned everything (or most of what) you need to analyzethis data set in Classical Statistics!
4 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
A Relationship Between the Variables?
progression
−0.10 0.00 0.10 0.20
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●● ●
● ●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
●●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
−0.10 0.00 0.10
5015
025
035
0
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●●
● ●
●
●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
−0.
100.
000.
100.
20
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
● ●
● ●
●
● ●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
● ●
●
●
●
●●●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
● ●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●● ●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
● ●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
● ●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●●
● ●
●
●
●
●
●
● ●
●●
LDL●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●●
●●
●
● ●
●●
●
●
●
●
●●
●
●
●● ●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●●●
●
●
● ●
●
●
●
●●●
●
●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●●
●●
●● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
● ●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ● ●
● ●● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
● ●
● ●
●●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●●
●
●
●
● ●
●
●
●
● ●●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
● ●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●●●●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
● ●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
● ●●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●●
●
●●
●
●
●
●
●●
●
● ●●
●
● ●
●
●
●●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
● ●
●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
● ●
●
●●
●●●●
●
●
●
●
●●●
●
●
●
●●
●●
●●
●
●●●
●
●
●
●●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
● ●●
●
●●
●●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
● ●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●●
●●
●
● ●
●
●
●
●
●●
●
● ●●
●
● ●
●
●
●●
●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
● ●
●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●●
●●●●
●
●
●
●
●●●
●
●
●
●●
●●
● ●
●
●● ●
●
●
●
●●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
● ●●
●
●●
●●
●
HDL
−0.
100.
000.
10
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
● ●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●●
●●
●
● ●
●
●
●
●
●●
●
● ●●
●
● ●
●
●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●●
●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
● ●
●
●●
●● ●●
●
●
●
●
●●●
●
●
●
●●
●●
● ●
●
● ●●
●
●
●
●●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●
●
●
● ●●
●
● ●
●●
●
50 150 250 350
−0.
100.
000.
10
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
● ●●● ●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
● ●●●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●●
●●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.10 0.00 0.10
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
● ●
●
●●
●
●
●
●
●
●●●●●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
GLU
5 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Linear Model Output
Estimate Std. Error T-Stat P-Value
Intercept 152.928 3.385 45.178 < 2e-16 ***LDL 77.057 75.701 1.018 0.309HDL -487.574 75.605 -6.449 3.28e-10 ***GLU 477.604 76.643 6.232 1.18e-09 ***
progression measure ≈ 152.9+77.1×LDL−487.6×HDL+477.6×GLU.
6 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Low-Dimensional Versus High-Dimensional
I The data set that we just saw is low-dimensional: n � p.
I Lots of the data sets coming out of modern biologicaltechniques are high-dimensional: n ≈ p or n � p.
I This poses statistical challenges! Classical Statistics no longerapplies.
7 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Low Dimensional
8 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
High Dimensional
9 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
What Goes Wrong in High Dimensions?
I Suppose that we included many additional predictors in ourmodel, such as
I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2
I Some of these predictors are useful, others aren’t.
I If we include too many predictors, we will overfit the data.
I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.
I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.
10 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
What Goes Wrong in High Dimensions?
I Suppose that we included many additional predictors in ourmodel, such as
I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2
I Some of these predictors are useful, others aren’t.
I If we include too many predictors, we will overfit the data.
I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.
I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.
10 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
What Goes Wrong in High Dimensions?
I Suppose that we included many additional predictors in ourmodel, such as
I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2
I Some of these predictors are useful, others aren’t.
I If we include too many predictors, we will overfit the data.
I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.
I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.
10 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
What Goes Wrong in High Dimensions?
I Suppose that we included many additional predictors in ourmodel, such as
I AgeI Zodiac symbolI Favorite colorI Mother’s birthday, in base 2
I Some of these predictors are useful, others aren’t.
I If we include too many predictors, we will overfit the data.
I Overfitting: Model looks great on the data used to develop it,but will perform very poorly on future observations.
I When p ≈ n or p > n, overfitting is guaranteed unless we arevery careful.
10 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Why Does Dimensionality Matter?
I Classical statistical techniques, such as linear regression,cannot be applied.
I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.
I High risks of overfitting, false positives, and more.
This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.
11 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Why Does Dimensionality Matter?
I Classical statistical techniques, such as linear regression,cannot be applied.
I Even very simple tasks, like identifying variables that areassociated with a response, must be done with care.
I High risks of overfitting, false positives, and more.
This course: Statistical machine learning tools for big — mostlyhigh-dimensional — data.
11 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Supervised and Unsupervised Learning
I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.
I Supervised Learning: Use a data set X to predict or detectassociation with a response y .
I RegressionI ClassificationI Hypothesis Testing
I Unsupervised Learning: Discover the signal in X , or detectassociations within X .
I Dimension ReductionI Clustering
12 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Supervised and Unsupervised Learning
I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.
I Supervised Learning: Use a data set X to predict or detectassociation with a response y .
I RegressionI ClassificationI Hypothesis Testing
I Unsupervised Learning: Discover the signal in X , or detectassociations within X .
I Dimension ReductionI Clustering
12 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Supervised and Unsupervised Learning
I Statistical machine learning can be divided into two mainareas: supervised and unsupervised.
I Supervised Learning: Use a data set X to predict or detectassociation with a response y .
I RegressionI ClassificationI Hypothesis Testing
I Unsupervised Learning: Discover the signal in X , or detectassociations within X .
I Dimension ReductionI Clustering
12 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Supervised Learning
13 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
Unsupervised Learning
14 / 15
Classical StatisticsBiological Big Data
Supervised and Unsupervised Learning
“Course Textbook” . . . with applications in R
I Available for (free!) download from www.statlearning.com.I An accessible introduction to statistical machine learning,
with an R lab at the end of each chapter.I We will go through some of these R labs.I To learn more, go through them on your own!
15 / 15
High-Dimensional Statistical Learning:Bias-Variance Tradeoff and the Test Error
Ali ShojaieUniversity of Washington
http://faculty.washington.edu/ashojaie/
August 2018Tehran, Iran
1 / 48
Supervised Learning
2 / 48
Regression Versus Classification
I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size
I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma
I This lecture: Regression.
3 / 48
Regression Versus Classification
I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size
I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma
I This lecture: Regression.
3 / 48
Regression Versus Classification
I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size
I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma
I This lecture: Regression.
3 / 48
Regression Versus Classification
I Regression: Predict a quantitative response, such asI blood pressureI cholesterol levelI tumor size
I Classification: Predict a categorical response, such asI tumor versus normal tissueI heart disease versus no heart diseaseI subtype of glioblastoma
I This lecture: Regression.
3 / 48
Linear Models
I We have n observations, for each of which we have ppredictor measurements and a response measurement.
I Want to develop a model of the form
yi = β0 + β1Xi1 + . . .+ βpXip + εi .
I Here εi is a noise term associated with the ith observation.
I Must estimate β0, β1, . . . , βp – i.e. we must fit the model.
4 / 48
Linear Model With p = 2 Predictors
5 / 48
What Makes a Model Linear?
I A linear model is linear in the regression coefficients!
I This is a linear model:
yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .
I This is not a linear model:
yi = βXi11 + sin(β2Xi2) + εi .
6 / 48
What Makes a Model Linear?
I A linear model is linear in the regression coefficients!
I This is a linear model:
yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .
I This is not a linear model:
yi = βXi11 + sin(β2Xi2) + εi .
6 / 48
What Makes a Model Linear?
I A linear model is linear in the regression coefficients!
I This is a linear model:
yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .
I This is not a linear model:
yi = βXi11 + sin(β2Xi2) + εi .
6 / 48
What Makes a Model Linear?
I A linear model is linear in the regression coefficients!
I This is a linear model:
yi = β1 sin(Xi1) + β2Xi2Xi3 + εi .
I This is not a linear model:
yi = βXi11 + sin(β2Xi2) + εi .
6 / 48
Linear Models in Matrix Form
I For simplicity, ignore the intercept β0.I Assume
∑ni=1 yi =
∑ni=1 Xij = 0; in this case, β0 = 0.
I Alternatively, let the first column of X be a column of 1’s.
I In matrix form, we can write the linear model as
y = Xβ + ε,
i.e.
y1y2...yn
=
X11 X12 . . . X1p
X21 X22 . . . X2p...
.... . .
...Xn1 Xn2 . . . Xnp
β1β2...βp
+
ε1ε2...εn
.
7 / 48
Least Squares Regression
I There are many ways we could fit the model
y = Xβ + ε.
I Most common approach in classical statistics is least squares:
minimizeβ
{‖y − Xβ‖2
}.
Here ‖a‖2 ≡∑ni=1 a
2i .
I We are looking for β1, . . . , βp such that
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2
is as small as possible, or in other words, such thatn∑
i=1
(yi − yi )2
is as small as possible, where yi is the ith predicted value.8 / 48
Let’s Try Out Least Squares in R!
Chapter 3 R labwww.statlearning.com
9 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.
I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.
I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.
10 / 48
Least Squares Regression
I When we fit a model, we use a training set of observations.I We get coefficient estimates β1, . . . , βp.I We also get predictions using our model, of the form
yi = β1Xi1 + . . .+ βpXip.
I We can evaluate the training error, i.e. the extent to whichthe model fits the observations used to train it.
I One way to quantify the training error is using the meansquared error (MSE):
MSE =1
n
n∑
i=1
(yi − yi )2 =
1
n
n∑
i=1
(yi − (β1Xi1 + . . .+ βpXip))2.
I The training error is closely related to the R2 for a linearmodel – that is, the proportion of variance explained.
I Big R2 ⇔ Small Training Error.10 / 48
Least Squares as More Variables are Included in the Model
I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.
I The problem? Training error and R2 evaluate the model’sperformance on the training observations.
I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.
I We really care about the model’s performance on testobservations – observations not used to fit the model.
11 / 48
Least Squares as More Variables are Included in the Model
I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.
I The problem? Training error and R2 evaluate the model’sperformance on the training observations.
I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.
I We really care about the model’s performance on testobservations – observations not used to fit the model.
11 / 48
Least Squares as More Variables are Included in the Model
I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.
I The problem? Training error and R2 evaluate the model’sperformance on the training observations.
I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.
I We really care about the model’s performance on testobservations – observations not used to fit the model.
11 / 48
Least Squares as More Variables are Included in the Model
I Training error and R2 are not good ways to evaluate amodel’s performance, because they will always improve asmore variables are added into the model.
I The problem? Training error and R2 evaluate the model’sperformance on the training observations.
I If I had an unlimited number of features to use in developinga model, then I could surely come up with a regression modelthat fits the training data perfectly! Unfortunately, this modelwouldn’t capture the true signal in the data.
I We really care about the model’s performance on testobservations – observations not used to fit the model.
11 / 48
The Problem
As we add more variables into the model...
... the training error decreases and the R2 increases!
12 / 48
Why is this a Problem?
I We really care about the model’s performance on observationsnot used to fit the model!
I We want a model that will predict the survival time of a newpatient who walks into the clinic!
I We want a model that can be used to diagnose cancer for apatient not used in model training!
I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!
I What we really care about:
(ytest − ytest)2,
whereytest = β1Xtest,1 + . . .+ βpXtest,p,
and (Xtest , ytest) was not used to train the model.
I The test error is the average of (ytest − ytest)2 over a bunch of
test observations.
13 / 48
Why is this a Problem?
I We really care about the model’s performance on observationsnot used to fit the model!
I We want a model that will predict the survival time of a newpatient who walks into the clinic!
I We want a model that can be used to diagnose cancer for apatient not used in model training!
I We want to predict risk of diabetes for a patient who wasn’tused to fit the model!
I What we really care about:
(ytest − ytest)2,
whereytest = β1Xtest,1 + . . .+ βpXtest,p,
and (Xtest , ytest) was not used to train the model.
I The test error is the average of (ytest − ytest)2 over a bunch of
test observations.13 / 48
Training Error versus Test Error
As we add more variables into the model...
... the training error decreases and the R2 increases!
But the test error might not!
14 / 48
Training Error versus Test Error
As we add more variables into the model...
... the training error decreases and the R2 increases!
But the test error might not!
15 / 48
Why the Number of Variables Matters
I Linear regression will have a very low training error if p islarge relative to n.
I A simple example:
I When n ≤ p, you can always get a perfect model fit to thetraining data!
I But the test error will be awful.
16 / 48
Model Complexity, Training Error, and Test Error
I In this course, we will consider various types of models.
I We will be very concerned with model complexity: e.g. thenumber of variables used to fit a model.
I As we fit more complex models – e.g. models with morevariables – the training error will always decrease.
I But the test error might not.
I As we will see, the number of variables in the model is not theonly – or even the best – way to quantify model complexity.
17 / 48
An Example In R
xtr <- matrix(rnorm(100*100),ncol=100)
xte <- matrix(rnorm(100000*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
yte <- xte%*%beta + rnorm(100000)
rsq <- trainerr <- testerr <- NULL
for(i in 2:100){
mod <- lm(ytr~xtr[,1:i])
rsq <- c(rsq,summary(mod)$r.squared)
beta <- mod$coef[-1]
intercept <- mod$coef[1]
trainerr <- c(trainerr, mean((xtr[,1:i]%*%beta+intercept - ytr)^2))
testerr <- c(testerr, mean((xte[,1:i]%*%beta+intercept - yte)^2))
}
par(mfrow=c(1,3))
plot(2:100,rsq, xlab=’Number of Variables’, ylab="R Squared", log="y")
abline(v=10,col="red")
plot(2:100,trainerr, xlab=’Number of Variables’, ylab="Training Error",log="y")
abline(v=10,col="red")
plot(2:100,testerr, xlab=’Number of Variables’, ylab="Test Error",log="y")
abline(v=10,col="red")
18 / 48
A Simulated Example
●
●
●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
0 20 40 60 80 100
0.2
0.4
0.6
0.8
1.0
Number of Variables
R S
quar
ed
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
0 20 40 60 80 1001e
−28
1e−
211e
−14
1e−
071e
+00
Number of Variables
Trai
ning
Err
or
●●●●●
●
●
●
●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●
●●●●
●●●●●●
●●●
●
●●●
●
●●●
●●
●
0 20 40 60 80 100
12
510
2050
100
200
500
Number of Variables
Test
Err
or
I 1st 10 variables are related to response; remaining 90 are not.
I R2 increases and training error decreases as more variables areadded to the model.
I Test error is lowest when only signal variables in model.
19 / 48
Bias and Variance
I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.
I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.
I The test error depends on both the bias and variance:
Test Error = Bias2 + Variance.
I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.
20 / 48
Bias and Variance
I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.
I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.
I The test error depends on both the bias and variance:
Test Error = Bias2 + Variance.
I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.
20 / 48
Bias and Variance
I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.
I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.
I The test error depends on both the bias and variance:
Test Error = Bias2 + Variance.
I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.
20 / 48
Bias and Variance
I As model complexity increases, the bias of β – the averagedifference between β and β, if we were to repeat theexperiment a huge number of times – will decrease.
I But as complexity increases, the variance of β – the amountby which the β’s will differ across experiments – will increase.
I The test error depends on both the bias and variance:
Test Error = Bias2 + Variance.
I There is a bias-variance trade-off. We want a model that issufficiently complex as to have not too much bias, but not socomplex that it has too much variance.
20 / 48
A Really Fundamental Picture
21 / 48
Overfitting
I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.
I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.
I In particular, we must rely not on training error, but on testerror, as a measure of model performance.
I How can we estimate the test error?
22 / 48
Overfitting
I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.
I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.
I In particular, we must rely not on training error, but on testerror, as a measure of model performance.
I How can we estimate the test error?
22 / 48
Overfitting
I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.
I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.
I In particular, we must rely not on training error, but on testerror, as a measure of model performance.
I How can we estimate the test error?
22 / 48
Overfitting
I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.
I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.
I In particular, we must rely not on training error, but on testerror, as a measure of model performance.
I How can we estimate the test error?
22 / 48
Overfitting
I Fitting an overly complex model – a model that has too muchvariance – is known as overfitting.
I In the high-dimensional setting, when p � n, we must workhard not to overfit the data.
I In particular, we must rely not on training error, but on testerror, as a measure of model performance.
I How can we estimate the test error?
22 / 48
Training Set Versus Test Set
I Split samples into training set and test set.
I Fit model on training set, and evaluate on test set.
Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!
23 / 48
Training Set Versus Test Set
I Split samples into training set and test set.
I Fit model on training set, and evaluate on test set.
Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?
A: Never!
23 / 48
Training Set Versus Test Set
I Split samples into training set and test set.
I Fit model on training set, and evaluate on test set.
Q: Can there ever, under any circumstance, be sample overlapbetween the training and test sets?A: Never!
23 / 48
Training Set Versus Test Set
I Split samples into training set and test set.
I Fit model on training set, and evaluate on test set.
You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!
24 / 48
Training Set Versus Test Set
I Split samples into training set and test set.
I Fit model on training set, and evaluate on test set.
You can’t peek at the test set until you are completely done allaspects of model-fitting on the training set!
24 / 48
Training Set And Test Set
To get an estimate of the test error of a particular model on afuture observation:
1. Split the samples into a training set and a test set.
2. Fit the model on the training set.
3. Evaluate its performance on the test set.
4. The test set error rate is an estimate of the model’sperformance on a future observation.
But remember: no peeking!
25 / 48
Training Set And Test Set
To get an estimate of the test error of a particular model on afuture observation:
1. Split the samples into a training set and a test set.
2. Fit the model on the training set.
3. Evaluate its performance on the test set.
4. The test set error rate is an estimate of the model’sperformance on a future observation.
But remember: no peeking!
25 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Choosing Between Several Models
I In general, we will consider a lot of possible models – e.g.models with different levels of complexity. We must decidewhich model is best.
I We have split our samples into a training set and a test set.But remember: we can’t peek at the test set until we havecompletely finalized our choice of model!
I We must pick a best model based on the training set, but wewant a model that will have low test error!
I How can we estimate test error using only the training set?
1. The validation set approach.2. Leave-one-out cross-validation.3. K -fold cross-validation.
I In what follows, assume we have split the data into a trainingset and a test set, and the training set contains n observations.
26 / 48
Validation Set Approach
Split the n observations into two sets of approximately equal size.Train on one set, and evaluate performance on the other.
27 / 48
Validation Set Approach
For a given model, we perform the following procedure:
1. Split the observations into two sets of approximately equalsize, a training set and a validation set.
a. Fit the model using the training observations. Let β(train)
denote the regression coefficient estimates.b. For each observation in the validation set, compute the test
error, ei = (yi − xTi β(train))2.
2. Calculate the total validation set error by summing the ei ’sover all of the validation set observations.
Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.
28 / 48
Leave-One-Out Cross-Validation
Fit n models, each on n − 1 of the observations. Evaluate eachmodel on the left-out observation.
29 / 48
Leave-One-Out Cross-Validation
For a given model, we perform the following procedure:
1. For i = 1, . . . , n:
a. Fit the model using observations 1, . . . , i − 1, i + 1, . . . , n. Letβ(i) denote the regression coefficient estimates.
b. Compute the test error, ei = (yi − xTi β(i))2.
2. Calculate∑n
i=1 ei , the total CV error.
Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.
30 / 48
5-Fold Cross-Validation
Split the observations into 5 sets. Repeatedly train the model on 4sets and evaluate its performance on the 5th.
31 / 48
K-fold cross-validation
A generalization of leave-one-out cross-validation. For a givenmodel, we perform the following procedure:
1. Split the n observations into K equally-sized folds.
2. For k = 1, . . . ,K :
a. Fit the model using the observations not in the kth fold.b. Let ek denote the test error for the observations in the kth fold.
3. Calculate∑K
k=1 ek , the total CV error.
Out of a set of candidate models, the “best” one is the one forwhich the total error is smallest.
32 / 48
An Example In R
xtr <- matrix(rnorm(100*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
cv.err <- NULL
for(i in 2:50){
dat <- data.frame(x=xtr[,1:i],y=ytr)
mod <- glm(y~.,data=dat)
cv.err <- c(cv.err, cv.glm(dat,mod,K=6)$delta[1])
}
plot(2:50, cv.err, xlab="Number of Variables",
ylab="6-Fold CV Error", log="y")
abline(v=10, col="red")
33 / 48
Output of R Code
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●● ● ●
●
●
●
●
●●
● ●●
●
● ●
●
●●
● ●
● ●
●
●
●
●
●
●
●
●
●
10 20 30 40 50
12
510
Number of Variables
6−F
old
CV
Err
or
I Six-fold CV identifies the model with just over ten predictors.
I First ten predictors contain signal, and the rest are noise.
34 / 48
After Estimating the Test Error on the Training Set...
After we estimate the test error using the training set, we refit the“best” model on all of the available training observations. We thenevaluate this model on the test set.
35 / 48
Big Picture
36 / 48
Big Picture
37 / 48
Big Picture
38 / 48
Big Picture
39 / 48
Big Picture
40 / 48
Let’s Try Out Cross-Validation in R!
Chapter 5 R labFirst Half: Cross-Validationwww.statlearning.com
41 / 48
Summary: Four-Step Procedure
1. Split observations into training set and test set.
2. Fit a bunch of models on training set, and estimate the testerror, using cross-validation or validation set approach.
3. Refit the best model on the full training set.
4. Evaluate the model’s performance on the test set.
42 / 48
Why All the Bother?
Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?
A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!
43 / 48
Why All the Bother?
Q: Why do I need to have a separate test set? Why can’t I justestimate test error using cross-validation or a validation setapproach using all the observations, and then be done with it?
A: In general, we are choosing between a whole bunch of models,and we will use the cross-validation or validation set approach topick between these models. If we use the resulting estimate as afinal estimate of test error, then this could be an extremeunderestimate, because one model might give a lower estimatedtest error than others by chance. To avoid having an extremeunderestimate of test error, we need to evaluate the “best” modelobtained on an independent test set. This is particularly importantin high dimensions!!
43 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
I We will consider five ways to fit less complex models:
1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression
I These are alternatives to fitting a linear model using leastsquares.
I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.
44 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
I We will consider five ways to fit less complex models:
1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression
I These are alternatives to fitting a linear model using leastsquares.
I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.
44 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
I We will consider five ways to fit less complex models:
1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression
I These are alternatives to fitting a linear model using leastsquares.
I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.
44 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
I We will consider five ways to fit less complex models:
1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression
I These are alternatives to fitting a linear model using leastsquares.
I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.
44 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
I We will consider five ways to fit less complex models:
1. Variable Pre-Selection2. Forward Stepwise Regression3. Ridge Regression4. Lasso Regression5. Principal Components Regression
I These are alternatives to fitting a linear model using leastsquares.
I Each of these approaches will allow us to choose the level ofcomplexity – e.g. the number of variables in the model.
44 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
If you
I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set
performance;
then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.
Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.
45 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
If you
I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set
performance;
then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.
Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.
45 / 48
Regression in High Dimensions
I We usually cannot perform least squares regression to fit amodel in the high-dimensional setting, because we will getzero training error but a terrible test error.
I Instead, we must fit a less complex model, e.g. a model withfewer variables.
If you
I fit your model carelessly;I do not properly estimate the test error;I or select a model based on training set rather than test set
performance;
then you will woefully overfit your training data, leading to amodel that looks good on training data but will performatrociously on future observations.
Our intuition breaks down in high dimensions, and so rigorousmodel-fitting is crucial.
45 / 48
The Curse of Dimensionality
Q: A data set with more variables is better than a data set withfewer variables, right?
A: Not necessarily!
Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.
On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!
46 / 48
The Curse of Dimensionality
Q: A data set with more variables is better than a data set withfewer variables, right?
A: Not necessarily!
Noise variables – such as genes whose expression levels are nottruly associated with the response being studied – will simplyincrease the risk of overfitting, and the difficulty of developing aneffective model that will perform well on future observations.
On the other hand, more signal variables – variables that are trulyassociated with the response being studied – are always useful!
46 / 48
Wise Words
In high-dimensional data analysis, common mistakes are simple,and simple mistakes are common.
– Keith Baggerly
47 / 48
Before You’re Done Your Analysis
I Estimate the test error.I Do a “sanity check” whenever possible.
I “Spot-check” the variables that have the largest coefficients inthe model.
I Rewrite your code from scratch. Do you get the same answeragain?
Fitting models in high-dimensions: one mistake away from disaster!
48 / 48
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
High-Dimensional Statistical Learning:Regression
Ali ShojaieUniversity of Washington
http://faculty.washington.edu/ashojaie/
August 2018Tehran, Iran
1 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Linear Models in High Dimensions
I When p is large, least squares regression will lead to very lowtraining error but terrible test error.
I We will now see some approaches for fitting linear models inhigh dimensions, p � n.
I These approaches also work well when p ≈ n or n > p.
2 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Linear Models in High Dimensions
I When p is large, least squares regression will lead to very lowtraining error but terrible test error.
I We will now see some approaches for fitting linear models inhigh dimensions, p � n.
I These approaches also work well when p ≈ n or n > p.
2 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:
I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.
I protein levels.I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.
I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Motivating example
I We would like to build a model to predict survival time forbreast cancer patients using a number of clinicalmeasurements (tumor stage, tumor grade, tumor size, patientage, etc.) as well as some biomarkers.
I For instance, these biomarkers could be:I the expression levels of genes measured using a microarray.I protein levels.I mutations in genes potentially implicated in breast cancer.
I How can we develop a model with low test error in thissetting?
3 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Remember
I We have n training observations.
I Our goal is to get a model that will perform well on futuretest observations.
I We’ll incur some bias in order to reduce variance.
4 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Variable Pre-Selection
The simplest approach for fitting a model in high dimensions:
1. Choose a small set of variables, say the q variables that aremost correlated with the response, where q < n and q < p.
2. Use least squares to fit a model predicting y using only theseq variables.
This approach is simple and straightforward.
5 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Variable Pre-Selection in R
xtr <- matrix(rnorm(100*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
cors <- cor(xtr,ytr)
whichers <- which(abs(cors)>.2)
mod <- lm(ytr~xtr[,whichers])
print(summary(mod))
6 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How Many Variable to Use?
I We need a way to choose q, the number of variables used inthe regression model.
I We want q that minimizes the test error.
I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.
I Then choose the value of q for which the estimated test erroris smallest.
7 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How Many Variable to Use?
I We need a way to choose q, the number of variables used inthe regression model.
I We want q that minimizes the test error.
I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.
I Then choose the value of q for which the estimated test erroris smallest.
7 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How Many Variable to Use?
I We need a way to choose q, the number of variables used inthe regression model.
I We want q that minimizes the test error.
I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.
I Then choose the value of q for which the estimated test erroris smallest.
7 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How Many Variable to Use?
I We need a way to choose q, the number of variables used inthe regression model.
I We want q that minimizes the test error.
I For a range of values of q, we can perform the validation setapproach, leave-one-out cross-validation, or K -foldcross-validation in order to estimate the test error.
I Then choose the value of q for which the estimated test erroris smallest.
7 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Estimating the Test Error For a Given q
This is the right way to estimate the test error using the validationset approach:
1. Split the observations into a training set and a validation set.
2. Using the training set only:
a. Identify the q variables most associated with the response.b. Use least squares to fit a model predicting y using those q
variables.c. Let β1, . . . , βq denote the resulting coefficient estimates.
3. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.
8 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Estimating the Test Error For a Given q
This is the wrong way to estimate the test error using thevalidation set approach:
1. Identify the q variables most associated with the response onthe full data set.
2. Split the observations into a training set and a validation set.
3. Using the training set only:
a. Use least squares to fit a model predicting y using those qvariables.
b. Let β1, . . . , βq denote the resulting coefficient estimates.
4. Use β1, . . . , βq obtained on training set to predict response onvalidation set, and compute the validation set MSE.
9 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Frequently Asked Questions
I Q: Does it really matter how you estimate the test error?A: Yes.
I Q: Would anyone make such a silly mistake?A: Yes.
10 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Frequently Asked Questions
I Q: Does it really matter how you estimate the test error?A: Yes.
I Q: Would anyone make such a silly mistake?A: Yes.
10 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Better Approach
I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.
I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.
I What we really want to do: pick the q variables that bestpredict the response.
11 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Better Approach
I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.
I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.
I What we really want to do: pick the q variables that bestpredict the response.
11 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Better Approach
I The variable pre-selection approach is simple and easy toimplement — all you need is a way to calculate correlations,and software to fit a linear model using least squares.
I But it might not work well: just because a bunch of variablesare correlated with the response doesn’t mean that when usedtogether in a linear model, they will predict the response well.
I What we really want to do: pick the q variables that bestpredict the response.
11 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Subset Selection
Several Approaches:
I Best Subset Selection: Consider all subsets of predictorsComputational mess!
I Stepwise Regression: Greedily add/remove predictorsHeuristic and potentially inefficient
I Modern Penalized Methods
12 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Best Subset Selection
I We would like to consider all possible models using a subset ofthe p predictors.
I In other words, we’d like to consider all 2p possible models.
I This is called best subset selection.I Unfortunately, this is computationally intractable:
I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.
According to www.universetoday.com, this is around thenumber of atoms in the known universe.
I Not feasible to consider so many models!
I Need an efficient way to sift through all of these models:forward stepwise regression.
13 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Best Subset Selection
I We would like to consider all possible models using a subset ofthe p predictors.
I In other words, we’d like to consider all 2p possible models.
I This is called best subset selection.I Unfortunately, this is computationally intractable:
I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.
According to www.universetoday.com, this is around thenumber of atoms in the known universe.
I Not feasible to consider so many models!
I Need an efficient way to sift through all of these models:forward stepwise regression.
13 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Best Subset Selection
I We would like to consider all possible models using a subset ofthe p predictors.
I In other words, we’d like to consider all 2p possible models.
I This is called best subset selection.
I Unfortunately, this is computationally intractable:I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.
According to www.universetoday.com, this is around thenumber of atoms in the known universe.
I Not feasible to consider so many models!
I Need an efficient way to sift through all of these models:forward stepwise regression.
13 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Best Subset Selection
I We would like to consider all possible models using a subset ofthe p predictors.
I In other words, we’d like to consider all 2p possible models.
I This is called best subset selection.I Unfortunately, this is computationally intractable:
I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.
According to www.universetoday.com, this is around thenumber of atoms in the known universe.
I Not feasible to consider so many models!
I Need an efficient way to sift through all of these models:forward stepwise regression.
13 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Best Subset Selection
I We would like to consider all possible models using a subset ofthe p predictors.
I In other words, we’d like to consider all 2p possible models.
I This is called best subset selection.I Unfortunately, this is computationally intractable:
I When p = 3, 2p = 8.I When p = 6, 2p = 64.I When p = 250, there are 2250 ≈ 1080 possible models.
According to www.universetoday.com, this is around thenumber of atoms in the known universe.
I Not feasible to consider so many models!
I Need an efficient way to sift through all of these models:forward stepwise regression.
13 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Forward Stepwise Regression
1. Use least squares to fit p univariate regression models, andselect the predictor corresponding to the best model(according to e.g. training set MSE).
2. Use least squares to fit p − 1 models containing that onepredictor, and each of the p − 1 other predictors. Select thepredictors in the best two-variable model.
3. Now use least squares to fit p − 2 models containing thosetwo predictors, and each of the p − 2 other predictors. Selectthe predictors in the best three-variable model.
4. And so on....
This gives us a nested set of models, containing the predictors
M1 ⊆M2 ⊆M3 ⊆ . . . .14 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Forward Stepwise Regression With p = 10
15 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Example in R
xtr <- matrix(rnorm(100*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
library(leaps)
out <- regsubsets(xtr,ytr,nvmax=30,method="forward")
print(summary(out))
print(coef(out,1:10))
16 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Which Value of q is Best?
I This procedure traces out a set of models, containing between1 and p variables.
I The qth model contains q variables, given by the set Mq.
I Q: Which value of q is best?
A: The one that minimizes the test error!
I We can select the value of q using cross-validation or thevalidation set approach.
17 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Which Value of q is Best?
I This procedure traces out a set of models, containing between1 and p variables.
I The qth model contains q variables, given by the set Mq.
I Q: Which value of q is best?A: The one that minimizes the test error!
I We can select the value of q using cross-validation or thevalidation set approach.
17 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawback of Forward Stepwise Selection
I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.
I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.
I For instance, suppose that the best model with one variable is
y = β3X3 + ε
and the best model with two variables is
y = β4X4 + β8X8 + ε.
Then forward stepwise selection will not identify the besttwo-variable model.
I Q: Does this really happen in practice?A: Yes.
18 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawback of Forward Stepwise Selection
I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.
I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.
I For instance, suppose that the best model with one variable is
y = β3X3 + ε
and the best model with two variables is
y = β4X4 + β8X8 + ε.
Then forward stepwise selection will not identify the besttwo-variable model.
I Q: Does this really happen in practice?A: Yes.
18 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawback of Forward Stepwise Selection
I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.
I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.
I For instance, suppose that the best model with one variable is
y = β3X3 + ε
and the best model with two variables is
y = β4X4 + β8X8 + ε.
Then forward stepwise selection will not identify the besttwo-variable model.
I Q: Does this really happen in practice?A: Yes.
18 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawback of Forward Stepwise Selection
I Forward stepwise selection isn’t guaranteed to give you thebest model containing q variables.
I To get the best model with q variables, you’d need to considerevery possible one; computationally intractable.
I For instance, suppose that the best model with one variable is
y = β3X3 + ε
and the best model with two variables is
y = β4X4 + β8X8 + ε.
Then forward stepwise selection will not identify the besttwo-variable model.
I Q: Does this really happen in practice?A: Yes.
18 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How To Do Forward Stepwise?
Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.
Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.
Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.
19 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How To Do Forward Stepwise?
Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.
Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.
Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.
19 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
How To Do Forward Stepwise?
Wrong: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify themodel with best performance on the validation set. Then, refit themodel (using those q variables) on the full data set.
Right: Split the data into a training set and a validation set.Perform forward stepwise on the training set, and identify the valueof q corresponding to the best-performing model on the validationset. Then, perform forward stepwise selection in order to obtain aq-variable model on the full data set.
Bottom Line: We estimate the test error in order to choose thecorrect level of model complexity. Then we refit the model onthe full data set.
19 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Let’s Try It Out in R!
Chapter 6 R Lab, Part 1www.statlearning.com
20 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression and the Lasso
I Best subset, and stepwise regression control model complexityby using subsets of the predictors.
I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.
I This is known as regularization or penalization.
I Hot area in statistical machine learning today.
21 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression and the Lasso
I Best subset, and stepwise regression control model complexityby using subsets of the predictors.
I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.
I This is known as regularization or penalization.
I Hot area in statistical machine learning today.
21 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression and the Lasso
I Best subset, and stepwise regression control model complexityby using subsets of the predictors.
I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.
I This is known as regularization or penalization.
I Hot area in statistical machine learning today.
21 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression and the Lasso
I Best subset, and stepwise regression control model complexityby using subsets of the predictors.
I Ridge regression and the lasso instead control modelcomplexity by using an alternative to least squares, byshrinking the regression coefficients.
I This is known as regularization or penalization.
I Hot area in statistical machine learning today.
21 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Crazy Coefficients
I When p > n, some of the variables are highly correlated.
I Why does correlation matter?I Suppose that X1 and X2 are highly correlated with each
other... assume X1 = X2 for the sake of argument.I And suppose that the least squares model is
y = X1 − 2X2 + 3X3.
I Then this is also a least squares model:
y = 100000001X1 − 100000002X2 + 3X3.
I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!
I This craziness is directly responsible for poor test error.
I It amounts to too much model complexity.
22 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Crazy Coefficients
I When p > n, some of the variables are highly correlated.I Why does correlation matter?
I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.
I And suppose that the least squares model is
y = X1 − 2X2 + 3X3.
I Then this is also a least squares model:
y = 100000001X1 − 100000002X2 + 3X3.
I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!
I This craziness is directly responsible for poor test error.
I It amounts to too much model complexity.
22 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Crazy Coefficients
I When p > n, some of the variables are highly correlated.I Why does correlation matter?
I Suppose that X1 and X2 are highly correlated with eachother... assume X1 = X2 for the sake of argument.
I And suppose that the least squares model is
y = X1 − 2X2 + 3X3.
I Then this is also a least squares model:
y = 100000001X1 − 100000002X2 + 3X3.
I Bottom Line: When there are too many variables, the leastsquares coefficients can get crazy!
I This craziness is directly responsible for poor test error.
I It amounts to too much model complexity.
22 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Solution: Don’t Let the Coefficients Get Too Crazy
I Recall that least squares involves finding β that minimizes
‖y − Xβ‖2.
I Ridge regression involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
β2j .
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
β2j ≤ s.
23 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Solution: Don’t Let the Coefficients Get Too Crazy
I Recall that least squares involves finding β that minimizes
‖y − Xβ‖2.
I Ridge regression involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
β2j .
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
β2j ≤ s.
23 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
A Solution: Don’t Let the Coefficients Get Too Crazy
I Recall that least squares involves finding β that minimizes
‖y − Xβ‖2.
I Ridge regression involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
β2j .
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
β2j ≤ s.
23 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression
I Ridge regression coefficient estimates minimize
‖y − Xβ‖2 + λ∑
j
β2j .
I Here λ is a nonnegative tuning parameter that shrinks thecoefficient estimates.
I When λ = 0, then ridge regression is just the same as leastsquares.
I As λ increases, then∑p
j=1(βRλ,j)2 decreases – i.e. coefficients
become shrunken towards zero.
I When λ =∞, βRλ = 0.
24 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression As λ Varies
25 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge Regression In Practice
I Perform ridge regression for a very fine grid of λ values.
I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.
I Perform ridge on the full data set, using that value of λ.
26 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Example in R
xtr <- matrix(rnorm(100*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
library(glmnet)
cv.out <- cv.glmnet(xtr,ytr,alpha=0,nfolds=5)
print(cv.out$cvm)
plot(cv.out)
cat("CV Errors", cv.out$cvm,fill=TRUE)
cat("Lambda with smallest CV Error",
cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)
cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)
cat("Number of Zero Coefficients",
sum(abs(coef(cv.out))<1e-8),fill=TRUE)
27 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
R Output
−2 0 2 4 6
46
810
12
log(Lambda)
Mea
n−S
quar
ed E
rror
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●
●●
●●
●●
28 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Drawbacks of Ridge
I Ridge regression is a simple idea and has a number ofattractive properties: for instance, you can continuouslycontrol model complexity through the tuning parameter λ.
I But it suffers in terms of model interpretability, since the finalmodel contains all p variables, no matter what.
I Often want a simpler model involving a subset of the features.
I The lasso involves performing a little tweak to ridge regressionso that the resulting model contains mostly zeros.
I In other words, the resulting model is sparse. We say that thelasso performs feature selection.
I The lasso is a very active area of research interest in thestatistical community!
29 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I The lasso involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
|βj |.
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
|βj | ≤ s.
I So lasso is just like ridge, except that β2j has been replacedwith |βj |.
30 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I The lasso involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
|βj |.
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
|βj | ≤ s.
I So lasso is just like ridge, except that β2j has been replacedwith |βj |.
30 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I The lasso involves finding β that minimizes
‖y − Xβ‖2 + λ∑
j
|βj |.
I Equivalently, find β that minimizes
‖y − Xβ‖2
subject to the constraint thatp∑
j=1
|βj | ≤ s.
I So lasso is just like ridge, except that β2j has been replacedwith |βj |.
30 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I Lasso is a lot like ridge:
I λ is a nonnegative tuning parameter that controls modelcomplexity.
I When λ = 0, we get least squares.I When λ is very large, we get βL
λ = 0.
I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!
31 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model
complexity.
I When λ = 0, we get least squares.I When λ is very large, we get βL
λ = 0.
I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!
31 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model
complexity.I When λ = 0, we get least squares.
I When λ is very large, we get βLλ = 0.
I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!
31 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model
complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL
λ = 0.
I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!
31 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
The Lasso
I Lasso is a lot like ridge:I λ is a nonnegative tuning parameter that controls model
complexity.I When λ = 0, we get least squares.I When λ is very large, we get βL
λ = 0.
I But unlike ridge, lasso will give some coefficients exactly equalto zero for intermediate values of λ!
31 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Lasso As λ Varies
32 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Lasso In Practice
I Perform lasso for a very fine grid of λ values.
I Use cross-validation or the validation set approach to selectthe optimal value of λ – that is, the best level of modelcomplexity.
I Perform the lasso on the full data set, using that value of λ.
33 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Example in R
xtr <- matrix(rnorm(100*100),ncol=100)
beta <- c(rep(1,10),rep(0,90))
ytr <- xtr%*%beta + rnorm(100)
library(glmnet)
cv.out <- cv.glmnet(xtr,ytr,alpha=1,nfolds=5)
print(cv.out$cvm)
plot(cv.out)
cat("CV Errors", cv.out$cvm,fill=TRUE)
cat("Lambda with smallest CV Error",
cv.out$lambda[which.min(cv.out$cvm)],fill=TRUE)
cat("Coefficients", as.numeric(coef(cv.out)),fill=TRUE)
cat("Number of Zero Coefficients",sum(abs(coef(cv.out))<1e-8),
fill=TRUE)
34 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
R Output
−8 −6 −4 −2 0
24
68
10
log(Lambda)
Mea
n−S
quar
ed E
rror
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●●●●●●●●●●●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
100 100 98 96 93 90 88 80 67 48 24 16 13 9 2
35 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Ridge and Lasso: A Geometric Interpretation
36 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Pros/Cons of Each Approach
Approach Simplicity?* Sparsity?** Predictions?***
Pre-Selection Good Yes So-SoForward Stepwise Good Yes So-So
Ridge Medium No GreatLasso Bad Yes Great
* How simple is this model-fitting procedure? If you were strandedon a desert island with pretty limited statistical software, could youfit this model?** Does this approach perform feature selection, i.e. is theresulting model sparse?*** How good are the predictions resulting from this model?
37 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
No “Best” Approach
I There is no “best” approach to regression in high dimensions.
I Some approaches will work better than others. For instance:I Lasso will work well if it’s really true that just a few features
are associated with the response.I Ridge will do better if all of the features are associated with
the response.
I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.
I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!
38 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
No “Best” Approach
I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:
I Lasso will work well if it’s really true that just a few featuresare associated with the response.
I Ridge will do better if all of the features are associated withthe response.
I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.
I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!
38 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
No “Best” Approach
I There is no “best” approach to regression in high dimensions.I Some approaches will work better than others. For instance:
I Lasso will work well if it’s really true that just a few featuresare associated with the response.
I Ridge will do better if all of the features are associated withthe response.
I If somebody tells you that one approach is “best”... then theyare mistaken. Politely contradict them.
I While no approach is “best”, some approaches are wrong(e.g.: there is a wrong way to do cross-validation)!
38 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Let’s Try It Out in R!
Chapter 6 R Lab, Part 2www.statlearning.com
39 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Making Linear Regression Less Linear
What if the relationship isn’t linear?
y = 3 sin(x) + ε
y = 2ex + ε
y = 3x2 + 2x + 1 + ε
If we know the functional form we can still use “linear regression”
40 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Making Linear Regression Less Linear
y = 3 sin(x) + ε: x
→
sin(x)
y = 3x2 + 2x + 1 + ε:
x
→
x x2
41 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Making Linear Regression Less Linear
What if we don’t know the right functional form?
Use a flexible basis expansion:
I polynomial basis
x
→
x x2 · · · xk
I hockey-stick (/spline) basis
x
→
x (x − t1)+ · · · (x − tk)+
42 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Making Linear Regression Less Linear
0 1 2 3 4 5 6
−1.
0−
0.5
0.0
0.5
1.0
x
y
sinpoly
43 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Making Linear Regression Less Linear
For high dimensional problems, expand each variable
x1 x2 · · · xp
→
x1 · · · xk1 x2 · · · xk2 · · · xp · · · xkp
and use the Lasso on this expanded problem.
k must be small (∼ 5ish)
Spline basis generally outperforms polynomial
44 / 45
Variable Pre-SelectionSubset SelectionRidge RegressionLasso Regression
Bottom Line
Much more important than what model you fit is how you fit it.
I Was cross-validation performed properly?
I Did you select a model (or level of model complexity) basedon an estimate of test error?
45 / 45
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
High-Dimensional Statistical Learning:Classification
Ali ShojaieUniversity of Washington
http://faculty.washington.edu/ashojaie/
August 2018Tehran, Iran
1 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification
I Regression involves predicting a continuous-valued response,like tumor size.
I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3
I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.
I Just like regression,I Classification cannot be blindly performed in high-dimensions
because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification
approaches to high-dimensions, which we have already seen inthe regression context!
2 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification
I Regression involves predicting a continuous-valued response,like tumor size.
I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3
I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.
I Just like regression,I Classification cannot be blindly performed in high-dimensions
because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification
approaches to high-dimensions, which we have already seen inthe regression context!
2 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification
I Regression involves predicting a continuous-valued response,like tumor size.
I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3
I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.
I Just like regression,I Classification cannot be blindly performed in high-dimensions
because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification
approaches to high-dimensions, which we have already seen inthe regression context!
2 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification
I Regression involves predicting a continuous-valued response,like tumor size.
I Classification involves predicting a categorical response:I Cancer versus NormalI Tumor Type 1 versus Tumor Type 2 versus Tumor Type 3
I Classification problems tend to occur very frequently in theanalysis of high-dimensional data.
I Just like regression,I Classification cannot be blindly performed in high-dimensions
because you will get zero training error but awful test error;I Properly estimating the test error is crucial; andI There are a few tricks to extend classical classification
approaches to high-dimensions, which we have already seen inthe regression context!
2 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification
I There are many approaches out there for performingclassification.
I We will discuss a few, logistic regression, K -nearest neighbors,discriminant analysis and support vector machines.
3 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Logistic Regression
I Logistic regression is the straightforward extension of linearregression to the classification setting.
I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.
I The simple linear model
y = Xβ + ε
doesn’t make sense for classification.I Instead, the logistic regression model is
P(y = 1|X ) =exp(XTβ)
1 + exp(XTβ).
I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.
4 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Logistic Regression
I Logistic regression is the straightforward extension of linearregression to the classification setting.
I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.
I The simple linear model
y = Xβ + ε
doesn’t make sense for classification.I Instead, the logistic regression model is
P(y = 1|X ) =exp(XTβ)
1 + exp(XTβ).
I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.
4 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Logistic Regression
I Logistic regression is the straightforward extension of linearregression to the classification setting.
I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.
I The simple linear model
y = Xβ + ε
doesn’t make sense for classification.
I Instead, the logistic regression model is
P(y = 1|X ) =exp(XTβ)
1 + exp(XTβ).
I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.
4 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Logistic Regression
I Logistic regression is the straightforward extension of linearregression to the classification setting.
I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.
I The simple linear model
y = Xβ + ε
doesn’t make sense for classification.I Instead, the logistic regression model is
P(y = 1|X ) =exp(XTβ)
1 + exp(XTβ).
I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.
4 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Logistic Regression
I Logistic regression is the straightforward extension of linearregression to the classification setting.
I For simplicity, suppose y ∈ {0, 1}: a two-class classificationproblem.
I The simple linear model
y = Xβ + ε
doesn’t make sense for classification.I Instead, the logistic regression model is
P(y = 1|X ) =exp(XTβ)
1 + exp(XTβ).
I We usually fit this model using maximum likelihood — likeleast squares, but for logistic regression.
4 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Why Not Linear Regression?
0 2000 4000 6000 8000 10000
0.0
0.5
1.0
Biomarker Level
Pro
babi
lity
of C
ance
r
500 1000 1500 2000 2500 3000 3500
0.0
0.2
0.4
0.6
0.8
1.0
Biomarker Level
Pro
babi
lity
of C
ance
r
I Left: linear regression.
I Right: logistic regression.
5 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Ways to Extend Logistic Regression to High Dimensions
1. Variable Pre-Selection
2. Forward Stepwise Logistic Regression
3. Ridge Logistic Regression
4. Lasso Logistic Regression
How to decide which approach is best, and which tuning parametervalue to use for each approach? Cross-validation or validation setapproach.
6 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
What is an appropriate validation measure?
For classification without a probability or score:
I Misclassification rate:
#test samples misclassified
total # of test samples
7 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
What is an appropriate validation measure?
For probablistic classification
I Can still use misclassification rate.I Like in continuous regression could use SSE:
∑
i∈test(yi − pi )
2
I Often preferable to use “predictive [log]likelihood”:
− log
[ ∏
i∈testpyii (1− pi )
1−yi]
I Can also use ROC-curve-based metric (eg. AUC)
Remember though; all of these must be conducted on a separatevalidation set.
8 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Example in R: Lasso Logistic Regression
xtr <- matrix(rnorm(1000*20),ncol=20)
beta <- c(rep(1,5),rep(0,15))
ytr <- 1*((xtr%*%beta + .5*rnorm(1000)) >= 0)
cv.out <- cv.glmnet(xtr, ytr, family="binomial", alpha=1)
plot(cv.out)
9 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
I Can I take a totally non-parametric (model-free) approach toclassification?
I K-nearest neighbors:
1. Identify the K observations whose X values are closest to theobservation at which we want to make a prediction.
2. Classify the observation of interest to the most frequent classlabel of those K nearest neighbors.
10 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
11 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
12 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
KNN: K=10
13 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
14 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.0
00.0
50
.10
0.1
50
.20
1/K
Err
or
Ra
te
Training Errors
Test Errors
15 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
I Simple, intuitive, model-free.
I Good option when p is very small.
I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.
I Do not use in high dimensions!
I unless, you use it with a dimension reduction approach
16 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
K -Nearest Neighbors
I Simple, intuitive, model-free.
I Good option when p is very small.
I Curse of dimensionality: when p is large, no neighbors are“near”. All observations are close to the boundary.
I Do not use in high dimensions!I unless, you use it with a dimension reduction approach
16 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Bayes-Based Classifiers
Suppose rather than knowing P (y = j |x)...
we have information on fj(x) = P (x |y = j), the featuredistribution within each class
How do we use this to make predictions?
Using Bayes Theorem:
P (y = j |x) =fj(x)πj∑k fk(x)πk
here πk = P(y = k) is the prior probability of class k .
17 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Bayes-Based Classifiers
Suppose rather than knowing P (y = j |x)...
we have information on fj(x) = P (x |y = j), the featuredistribution within each class
How do we use this to make predictions?
Using Bayes Theorem:
P (y = j |x) =fj(x)πj∑k fk(x)πk
here πk = P(y = k) is the prior probability of class k .
17 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Estimating the Rule
To apply Bayes Theorem
P (y = j |x) =fj(x)πj∑k fk(x)πk
we need
I fk(x) for k = 1, . . . ,K
I πk for k = 1, . . . ,K
18 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Estimating πk
πk is generally simple to estimate
I If your data are a random sample; then can use the sampleproportion
πk =# {yi = k}
n
I Otherwise can use outside information (eg. historical data)
If you change population proportions; it is easy to adjust the rule.
19 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Estimating fk(x)
Estimate of fk(x) = P (x |y = k) is more difficult.
This is a density estimation problem.
The tools we discuss for this break down into 3 general categories
I flexible, non-parametric estimates
I parametric estimates
I shrunken parametric estimates
The above are ordered (more-or-less) by where they fall onbias/variance spectrum:
more flexible → less bias/more variance
20 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Parametric fk(x) Estimate
Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:
Here we assume that fk(x) is Gaussian density, N(µk ,Σk)
21 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Parametric fk(x) Estimate
Most well known estimator of this type is Linear/QuadraticDiscriminant Analysis:
Here we assume that fk(x) is Gaussian density, N(µk ,Σk)
21 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Discriminant Analysis
There are three main types of unpenalized discriminant analysis:
I Quadratic (QDA)
I Linear (LDA)
I Diagonal (DDA)
These make different assumptions on the covariance structure:
I QDA makes no assumptions
I LDA assumes a pooled variance Σ = Σk for all k
I DDA assumes a pooled variance; and further that Σ isdiagonal (i.e. no correlation among covariates!)
22 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Discriminant Analysis
Why would we choose DDA over QDA?
Remember, flexibility comes at a price!
QDA will have the least bias; but has many more parameters toestimate
Often good estimates of the correlation don’t improveclassifications much
DDA takes into account the scale of each feature, but trades a bitof bias for potentially a large reduction in variance
23 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p = 1
I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and
µk =1
nk
∑
i :yi=k
xi σ2 =1
n − K
K∑
k=1
∑
i :yi=k
(xi − µk)2
I The picture is very similar if K > 2...or if p > 1
24 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p = 1
I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and
µk =1
nk
∑
i :yi=k
xi σ2 =1
n − K
K∑
k=1
∑
i :yi=k
(xi − µk)2
I The picture is very similar if K > 2...or if p > 1
24 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p = 1
I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and
µk =1
nk
∑
i :yi=k
xi σ2 =1
n − K
K∑
k=1
∑
i :yi=k
(xi − µk)2
I The picture is very similar if K > 2...or if p > 1
24 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p = 1
I To make this work, we need to estimate the parameters. TheML estimates are given by πk = nk/n and
µk =1
nk
∑
i :yi=k
xi σ2 =1
n − K
K∑
k=1
∑
i :yi=k
(xi − µk)2
I The picture is very similar if K > 2...or if p > 124 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p > 1
25 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p > 1
25 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p > 1
25 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p > 1
26 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
LDA for p > 1
26 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
QDA vs LDA
The level-curves for each class look identical with LDA;
QDA allows for different classes to have differently shapedellipsoids...
Results in non-linear decision boundaries (quadratic in fact)
27 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
DDA
For DDA...
I level curves are spheres (not ellipsoids).
I decision boundaries are still linear
I sometimes called naive bayes (that doesn’t mean it’s badthough!)
I with πk = 1K for all k , and equal variances (ie. Σ = σI ); this
is just the nearest centroid classifier
28 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
-DA vs logistic regression
Discriminant Analysis model can actually be rewritten asmultinomial logistic models:
Beginning with
P (y = j |x) =fj(x)πj∑k fk(x)πk
and
fk(x) ∝ exp
[−1
2(x − µk)>Σ−1k (x − µk)
]
substituting and simplifying we get
P (y = j |x) =eηj∑k e
ηk
29 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
-DA vs logistic regression
P (y = j |x) =eηj∑k e
ηk
whereηk = β0 + x>β + x>Σ−1k x
This is just a multinomial logistic model with quadratic terms andinteractions.
In particular for LDA (where Σk = Σ is pooled) we havecancellation and get
ηk = β0 + x>β
Simply a linear logistic model.
30 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Shrunken Parametric Estimates
Sometimes the optimal bias/variance tradeoff is between twoparametric classes.
For example: We may not have the data to estimate completelydifferent covariance matrices for each class (i.e. QDA); but we maynot want to use identical covariance matrices.
In this case we can take a weighted combination of our estimates.This is called regularized discriminant analysis.
This is a type of shrunken parametric estimator.
31 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Regularized Discriminant Analysis
For shrinking between QDA/LDA we use:
ΣRDAk = λΣLDA + (1− λ)ΣQDA
k
For shrinking between LDA and Naive Bayes we use
ΣRDA = λΣLDA + (1− λ)ΣNB
λ is a tuning parameter, and is generally selected via CV
32 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
DA in High Dimensions
All of the Discriminant Analysis techniques discussed so far use allthe features.
For high dimensional problems this will lead to over-fitting
One popular solution is to shrink each class-mean estimate µktowards the overall mean µ using element-wise soft-thresholding
This method is called Nearest Shrunken Centroids (though itshould probably more appropriately be “nearest shrunken DDA”)
33 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Regularized DA Methods
I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .
I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty
∑
j
∑
k
|µjk |,
I or towards each other, using a fused lasso penalty∑
j
∑
k,k′
|µjk − µjk′ |.
Both of these are implemented in R-package penalizedLDA.
I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).
34 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Regularized DA Methods
I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .
I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty
∑
j
∑
k
|µjk |,
I or towards each other, using a fused lasso penalty∑
j
∑
k,k′
|µjk − µjk′ |.
Both of these are implemented in R-package penalizedLDA.
I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).
34 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Regularized DA Methods
I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .
I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty
∑
j
∑
k
|µjk |,
I or towards each other, using a fused lasso penalty∑
j
∑
k,k′
|µjk − µjk′ |.
Both of these are implemented in R-package penalizedLDA.
I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).
34 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Regularized DA Methods
I Recall that in PAM, µjk ’s are soft thresholded towards thecommon mean µj .
I Alternatively, µjk ’s can be penalizedI towards zero, using a lasso penalty
∑
j
∑
k
|µjk |,
I or towards each other, using a fused lasso penalty∑
j
∑
k,k′
|µjk − µjk′ |.
Both of these are implemented in R-package penalizedLDA.
I Another option, which is especially helpful when using QDA isto penalize the covariance matrices Σk (or their inverses).
34 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Let’s Try It Out in R!
Chapter 4 R Labwww.statlearning.com
35 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machines
I Developed in around 1995.
I Touted as “overcoming the curse of dimensionality.”
I Does not (automatically) overcome the curse ofdimensionality!
I Fundamentally and numerically very similar to logisticregression.
I But, it is a nice idea.
36 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machines
I Developed in around 1995.
I Touted as “overcoming the curse of dimensionality.”
I Does not (automatically) overcome the curse ofdimensionality!
I Fundamentally and numerically very similar to logisticregression.
I But, it is a nice idea.
36 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machines
I Developed in around 1995.
I Touted as “overcoming the curse of dimensionality.”
I Does not (automatically) overcome the curse ofdimensionality!
I Fundamentally and numerically very similar to logisticregression.
I But, it is a nice idea.
36 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machines
I Developed in around 1995.
I Touted as “overcoming the curse of dimensionality.”
I Does not (automatically) overcome the curse ofdimensionality!
I Fundamentally and numerically very similar to logisticregression.
I But, it is a nice idea.
36 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machines
I Developed in around 1995.
I Touted as “overcoming the curse of dimensionality.”
I Does not (automatically) overcome the curse ofdimensionality!
I Fundamentally and numerically very similar to logisticregression.
I But, it is a nice idea.
36 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Separating Hyperplane
37 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Classification Via a Separating Hyperplane
Blue class if β0 + β1X1 + β2X2 > c; red class otherwise.38 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Maximal Separating Hyperplane
Note that only a few observations are on the margin: these are thesupport vectors.
39 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
What if There is No Separating Hyperplane?
40 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Classifier: Allow for Violations
41 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machine
I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.
I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.
I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.
42 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machine
I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.
I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.
I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.
42 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machine
I The support vector machine is just like the support vectorclassifier, but it elegantly allows for non-linear expansions ofthe variables: “non-linear kernels”.
I However, linear regression, logistic regression, and otherclassical statistical approaches can also be applied tonon-linear functions of the variables.
I For historical reasons, SVMs are more frequently used withnon-linear expansions as compared to other statisticalapproaches.
42 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Non-Linear Class Structure
This will be hard for a linear classifier!43 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Try a Support Vector Classifier
Uh-oh!!44 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Support Vector Machine
Much Better.45 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Is A Non-Linear Kernel Better?
I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.
I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.
46 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Is A Non-Linear Kernel Better?
I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.
I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.
46 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Is A Non-Linear Kernel Better?
I Yes, if the true decision boundary between the classes isnon-linear, and you have enough observations (relative to thenumber of features) to accurately estimate the decisionboundary.
I No, if you are in a very high-dimensional setting such thatestimating a non-linear decision boundary is hopeless.
46 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Other Classification Methods
I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:
n∑
i=1
L(f (xi ), yi )
I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!
I Hinge loss: max(0, 1− f (xi )yi )
I Logistic loss: log(1 + expf (xi )yi
47 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Other Classification Methods
I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:
n∑
i=1
L(f (xi ), yi )
I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!
I Hinge loss: max(0, 1− f (xi )yi )
I Logistic loss: log(1 + expf (xi )yi
47 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Other Classification Methods
I The main difference between SVM and other classificationmethods (e.g. logistic regression) is the loss function used toassess the “fit”:
n∑
i=1
L(f (xi ), yi )
I Zero-one loss: I (f (xi ) = yi ), whereI () is the indicator function. Notcontinuous, so hard to work with!!
I Hinge loss: max(0, 1− f (xi )yi )
I Logistic loss: log(1 + expf (xi )yi
47 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
SVM vs Logistic Regression
I Bottom Line: Support vector classifier and logistic regressionaren’t that different!
I Neither they nor any other approach can overcome the “curseof dimensionality”.
I The “kernel trick” makes things computationally easier, but itdoes not remove the danger of overfitting.
I SVM uses a non-linear kernel... but could do that with logisticor linear regression too!
I A disadvantage of SVM (compared to, e.g. logistic regression)is that it does not provide a measure of uncertainty: cases are“classified” to belong to one of the two classes.
48 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
Let’s Try It Out in R!
Chapter 9 R Labwww.statlearning.com
49 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
In High Dimensions...
I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.
I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).
I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.
I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.
50 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
In High Dimensions...
I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.
I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).
I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.
I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.
50 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
In High Dimensions...
I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.
I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).
I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.
I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.
50 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
In High Dimensions...
I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.
I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).
I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.
I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.
50 / 50
Logistic RegressionK -Nearest Neighbors
Bayes-Based ClassifiersSVM
In High Dimensions...
I In SVMs, a tuning parameter controls the amount of flexibilityof the classifier.
I This tuning parameter is like a ridge penalty, bothmathematically and conceptually. The SVM decision ruleinvolves all of the variables (the SVM problem can be writtenas a ridge problem but with the Hinge loss).
I Can get a sparse SVM using a lasso penalty; this yields adecision rule involving only a subset of the features.
I Logistic regression and other classical statistical approachescould be used with non-linear expansions of features. But thismakes high-dimensionality issues worse.
50 / 50