he o and d p n ho ds - stanford universitystatweb.stanford.edu/~tibs/sta305files/lasso.pdfo and d o...

1

The Lasso and relatedpenalization methods

Basic lasso, related approaches, generalizations, novel applications

Lasso and related methods 3

Linear regression via the Lasso (Tibshirani, 1995)

• Outcome variable yi, for cases i = 1, 2, . . . n, features xij ,j = 1, 2, . . . p

• Minimize∑n

i=1(yi −∑

j xijβj)2 + λ

∑

|βj |

• Equivalent to minimizing sum of squares with constraint∑

|βj | ≤ s.

• Similar to ridge regression, which has constraint∑

j β2j ≤ t

• Lasso does variable selection and shrinkage, while ridge onlyshrinks.

• See also “Basis Pursuit” (Chen, Donoho and Saunders, 1998).


Picture of Lasso and Ridge regression

β̂ β̂2. .β

1

β 2

β1

β


Example: Prostate Cancer Data

yi = log (PSA), xij measurements on a man and his prostate

Shrinkage Factor s

Coe

ffici

ents

0.0 0.2 0.4 0.6 0.8 1.0

−0.2

0.0

0.2

0.4

0.6

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45


Algorithms for the lasso

• Standard convex optimizer

• Least angle regression (LAR) - Efron et al 2004- computesentire path of solutions. State-of-the-Art until 2008- see LARS

section in this course.

• Pathwise coordinate descent- new


Pathwise coordinate descent for the lasso

• Coordinate descent: optimize one parameter (coordinate) at atime.

• How? suppose we had only one predictor. Problem is tominimize

∑

i

(yi − xiβ)2 + λ|β|

• Solution is the soft-thresholded estimate

sign(β̂)(|β|− λ)+

where β̂ is usual least squares estimate.

• Idea: with multiple predictors, cycle through each predictor inturn. We compute residuals ri = yi −

∑

j ̸=k xij β̂k and applying

univariate soft-thresholding, pretending that our data is

(xij , ri).

Robert Tibshirani

Robert Tibshirani

Robert Tibshirani

Robert Tibshirani


• Turns out that this is coordinate descent for the lasso criterion∑

i

(yi −∑

j

xijβj)2 + λ

∑

|βj |

• like skiing to the bottom of a hill, going north-south, east-west,north-south, etc.

• Too simple?!

Demonstration:


A brief history of coordinate descent for the lasso

1997 Tibshirani’s student Wenjiang Fu at U. Toronto develops

the “shooting algorithm” for the lasso. Tibshirani doesn’t fully

appreciate it.

2002 Ingrid Daubechies gives a talk at Stanford, describes a

one-at-a-time algorithm for the lasso. Hastie implements it,

makes an error, and Hastie +Tibshirani conclude that the

method doesn’t work.

2006 Friedman is external examiner at PhD oral of Anita van der

Kooij (Leiden) who uses coordinate descent for elastic net.

Friedman, Hastie + Tibshirani revisit this problem. Others

have too — Krishnapuram and Hartemink (2005), Genkin,

Lewis and Madigan (2007), Wu and Lange (2008), Meier, van

de Geer and Buehlmann (2008).


Pathwise coordinate descent for the lasso

• Start with large value for λ (very sparse model) and slowlydecrease it

• most coordinates that are zero never become non-zero

• coordinate descent code for Lasso is just 73 lines ofFortran!


Extensions

• Pathwise coordinate descent can be generalized to many othermodels: logistic/multinomial for classification, graphical lasso

for undirected graphs, fused lasso for signals.

• Its speed and simplicity are quite remarkable.

• glmnet R package now available on CRAN


Logistic regression

• Outcome Y = 0 or 1; Logistic regression model

log(Pr(Y = 1)

1 − Pr(Y = 1)) = β0 + β1X1 + β2X2 . . .

• Criterion is binomial log-likelihood +absolute value penalty


Speed Trials

Competitors:

lars As implemented in R package, for squared-error loss.

glmnet Fortran based R package using coordinate descent — topic

of this talk. Does squared error and logistic (2- and K-class).

l1logreg Lasso-logistic regression package by Koh, Kim and Boyd,

using state-of-art interior point methods for convex

optimization.

BBR/BMR Bayesian binomial/multinomial regression package by

Genkin, Lewis and Madigan. Also uses coordinate descent to

compute posterior mode with Laplace prior—the lasso fit.

Based on simulations (next 3 slides) and real data (4th slide).


Logistic Regression — Real DatasetsName Type N p glmnet l1logreg BBR

BMR

Dense

Cancer 14 class 144 16,063 2.5 mins NA 2.1 hrs

Leukemia 2 class 72 3571 2.50 55.0 450

Sparse

Internet ad 2 class 2359 1430 5.0 20.9 34.7

Newsgroup 2 class 11,314 777,811 2 mins 3.5 hrs

Timings in seconds (unless stated otherwise). For Cancer, Leukemia and

Internet-Ad, times are for ten-fold cross-validation over 100 λ values; for

Newsgroup we performed a single run with 100 values of λ, with

λmin = 0.05λmax.


Generalizations of the lasso

• The Elastic net (Zou and Hastie 2005)

minβ

1

2

n∑

i=1

(yi −n

∑

j=1

xijβj)2 + λ1

p∑

j=1

|βj | + λ2p

∑

j=1

β2j /2 (1)

The squared norm tends to average the coefficients of

predictors that are correlated, while the L1 norm chooses

among the averaged groups. When p > N , the number of

non-zero coefficients can exceed n- unlike the lasso.

Form of penalty used in glmnet:

p∑

j=1

[

1

2(1 − α)β2j + α|βj |

]

.


2 4 6 8 10

−0.1

0.0

0.1

0.2

0.3

Step

Coefficients

Lasso

2 4 6 8 10

−0.1

0.0

0.1

0.2

0.3

Step

Coefficients

Elastic Net − alpha=0.4

2 4 6 8 10

−0.1

0.0

0.1

0.2

0.3

Step

Coefficients

Ridge Regression

Leukemia Data, Logistic, N=72, p=3571, first 10 steps shown


Grouped Lasso

(Yuan and Lin 2005). Here the variables occur in groups (such as

dummy variables for multi-level factors). Suppose Xj is an N × pjorthonormal matrix that represents the jth group of pj variables,

j = 1, . . . , m, and βj the corresponding coefficient vector. The

grouped lasso solves

minβ

||y −m

∑

j=1

Xjβj ||22 +m

∑

j=1

λj ||βj ||2, (2)

where λj = λ√

pj . The 2-norm shrinks all coefficients in a group

down to zero together. Can apply coordinate descent in both

models.

he o and d p n ho ds - stanford universitystatweb.stanford.edu/~tibs/sta305files/lasso.pdfo and d o...

Documents