he o and d p n ho ds - stanford universitystatweb.stanford.edu/~tibs/sta305files/lasso.pdfo and d o...
TRANSCRIPT
-
1
The Lasso and relatedpenalization methods
Basic lasso, related approaches, generalizations, novel applications
-
Lasso and related methods 3
Linear regression via the Lasso (Tibshirani, 1995)
• Outcome variable yi, for cases i = 1, 2, . . . n, features xij ,j = 1, 2, . . . p
• Minimize∑n
i=1(yi −∑
j xijβj)2 + λ
∑
|βj |
• Equivalent to minimizing sum of squares with constraint∑
|βj | ≤ s.
• Similar to ridge regression, which has constraint∑
j β2j ≤ t
• Lasso does variable selection and shrinkage, while ridge onlyshrinks.
• See also “Basis Pursuit” (Chen, Donoho and Saunders, 1998).
-
Lasso and related methods 4
Picture of Lasso and Ridge regression
β̂ β̂2. .β
1
β 2
β1
β
-
Lasso and related methods 5
Example: Prostate Cancer Data
yi = log (PSA), xij measurements on a man and his prostate
Shrinkage Factor s
Coe
ffici
ents
0.0 0.2 0.4 0.6 0.8 1.0
−0.2
0.0
0.2
0.4
0.6
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
-
Lasso and related methods 6
Algorithms for the lasso
• Standard convex optimizer
• Least angle regression (LAR) - Efron et al 2004- computesentire path of solutions. State-of-the-Art until 2008- see LARS
section in this course.
• Pathwise coordinate descent- new
-
Lasso and related methods 7
Pathwise coordinate descent for the lasso
• Coordinate descent: optimize one parameter (coordinate) at atime.
• How? suppose we had only one predictor. Problem is tominimize
∑
i
(yi − xiβ)2 + λ|β|
• Solution is the soft-thresholded estimate
sign(β̂)(|β|− λ)+
where β̂ is usual least squares estimate.
• Idea: with multiple predictors, cycle through each predictor inturn. We compute residuals ri = yi −
∑
j ̸=k xij β̂k and applying
univariate soft-thresholding, pretending that our data is
(xij , ri).
Robert Tibshirani
Robert Tibshirani
Robert Tibshirani
Robert Tibshirani
-
Lasso and related methods 8
• Turns out that this is coordinate descent for the lasso criterion∑
i
(yi −∑
j
xijβj)2 + λ
∑
|βj |
• like skiing to the bottom of a hill, going north-south, east-west,north-south, etc.
• Too simple?!
Demonstration:
-
Lasso and related methods 9
A brief history of coordinate descent for the lasso
1997 Tibshirani’s student Wenjiang Fu at U. Toronto develops
the “shooting algorithm” for the lasso. Tibshirani doesn’t fully
appreciate it.
2002 Ingrid Daubechies gives a talk at Stanford, describes a
one-at-a-time algorithm for the lasso. Hastie implements it,
makes an error, and Hastie +Tibshirani conclude that the
method doesn’t work.
2006 Friedman is external examiner at PhD oral of Anita van der
Kooij (Leiden) who uses coordinate descent for elastic net.
Friedman, Hastie + Tibshirani revisit this problem. Others
have too — Krishnapuram and Hartemink (2005), Genkin,
Lewis and Madigan (2007), Wu and Lange (2008), Meier, van
de Geer and Buehlmann (2008).
-
Lasso and related methods 10
Pathwise coordinate descent for the lasso
• Start with large value for λ (very sparse model) and slowlydecrease it
• most coordinates that are zero never become non-zero
• coordinate descent code for Lasso is just 73 lines ofFortran!
-
Lasso and related methods 11
Extensions
• Pathwise coordinate descent can be generalized to many othermodels: logistic/multinomial for classification, graphical lasso
for undirected graphs, fused lasso for signals.
• Its speed and simplicity are quite remarkable.
• glmnet R package now available on CRAN
-
Lasso and related methods 12
Logistic regression
• Outcome Y = 0 or 1; Logistic regression model
log(Pr(Y = 1)
1 − Pr(Y = 1)) = β0 + β1X1 + β2X2 . . .
• Criterion is binomial log-likelihood +absolute value penalty
-
Lasso and related methods 13
Speed Trials
Competitors:
lars As implemented in R package, for squared-error loss.
glmnet Fortran based R package using coordinate descent — topic
of this talk. Does squared error and logistic (2- and K-class).
l1logreg Lasso-logistic regression package by Koh, Kim and Boyd,
using state-of-art interior point methods for convex
optimization.
BBR/BMR Bayesian binomial/multinomial regression package by
Genkin, Lewis and Madigan. Also uses coordinate descent to
compute posterior mode with Laplace prior—the lasso fit.
Based on simulations (next 3 slides) and real data (4th slide).
-
Lasso and related methods 17
Logistic Regression — Real DatasetsName Type N p glmnet l1logreg BBR
BMR
Dense
Cancer 14 class 144 16,063 2.5 mins NA 2.1 hrs
Leukemia 2 class 72 3571 2.50 55.0 450
Sparse
Internet ad 2 class 2359 1430 5.0 20.9 34.7
Newsgroup 2 class 11,314 777,811 2 mins 3.5 hrs
Timings in seconds (unless stated otherwise). For Cancer, Leukemia and
Internet-Ad, times are for ten-fold cross-validation over 100 λ values; for
Newsgroup we performed a single run with 100 values of λ, with
λmin = 0.05λmax.
-
Lasso and related methods 19
Generalizations of the lasso
• The Elastic net (Zou and Hastie 2005)
minβ
1
2
n∑
i=1
(yi −n
∑
j=1
xijβj)2 + λ1
p∑
j=1
|βj | + λ2p
∑
j=1
β2j /2 (1)
The squared norm tends to average the coefficients of
predictors that are correlated, while the L1 norm chooses
among the averaged groups. When p > N , the number of
non-zero coefficients can exceed n- unlike the lasso.
Form of penalty used in glmnet:
p∑
j=1
[
1
2(1 − α)β2j + α|βj |
]
.
-
Lasso and related methods 20
2 4 6 8 10
−0.1
0.0
0.1
0.2
0.3
Step
Coefficients
Lasso
2 4 6 8 10
−0.1
0.0
0.1
0.2
0.3
Step
Coefficients
Elastic Net − alpha=0.4
2 4 6 8 10
−0.1
0.0
0.1
0.2
0.3
Step
Coefficients
Ridge Regression
Leukemia Data, Logistic, N=72, p=3571, first 10 steps shown
-
Lasso and related methods 21
Grouped Lasso
(Yuan and Lin 2005). Here the variables occur in groups (such as
dummy variables for multi-level factors). Suppose Xj is an N × pjorthonormal matrix that represents the jth group of pj variables,
j = 1, . . . , m, and βj the corresponding coefficient vector. The
grouped lasso solves
minβ
||y −m
∑
j=1
Xjβj ||22 +m
∑
j=1
λj ||βj ||2, (2)
where λj = λ√
pj . The 2-norm shrinks all coefficients in a group
down to zero together. Can apply coordinate descent in both
models.