lecture 1: variable selection - lassohzhang/math574m/2019lect13_shrink_lasso.pdf · oracle...
TRANSCRIPT
Oracle PropertiesShrinkage Methods
Ridge Penalty
Lecture 1: Variable Selection - LASSO
Hao Helen Zhang
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
Outline
Modern Penalization Methods
Lq penalty, ridge
LASSO
computation, solution path, LARS algorithmsparameter tuning, cross validationtheoretical properties
ridge regression
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyOracle Properties
Notations
data (X i ,Yi ), i = 1, · · · , nn: sample size
d : the number of predictors, X i ∈ Rd .
the full index set S = {1, 2, · · · , d}.the selected index set given by a procedure is A, its size is |A|.the linear coefficient vector β = (β1, · · · , βd)T .
the true linear coefficients β0 = (β10, · · · , βd0)T .
the true model A0 = {j : j = 1, · · · , d , |βj0| 6= 0}.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyOracle Properties
Motivations
Shrinkage method for variable selection a continuous process
more stable, not suffering from high variability.
one unified selection and estimation framework; easy to makeinferences
advances in theories: oracle properties, valid asymptoticinferences
In general, we standardize the inputs first before applying themethods.
X ’s are centered to mean 0 and variance 1
y is centered to mean 0
We often fit a model without an intercept.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyOracle Properties
Ideal Variable Selection
The best results we can expect from a selection procedure:
able to obtain the correct model structure
keep all the important variables in the modelfilter all the noisy variable out of the model
has some optimal inferential properties
consistent, optimal convergence rateasymptotic normality, most efficient
True model: yi = β00 +∑d
j=1 xijβj0 + εi ,
important index set: A0 = {j : βj0 6= 0, j = 1, · · · , d}unimportant index set: Ac
0 = {j : βj0 = 0, j = 1, · · · , d}
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyOracle Properties
Oracle Properties
An oracle performs as if the true model were known:
1 selection consistency
βj 6= 0 for j ∈ A0; βj = 0 for j ∈ Ac0;
2 estimation consistency:
√n(βA0
− βA0)→d N(0,ΣI ),
βA0= {βj , j ∈ A0} and ΣI is the covariance matrix if
knowing the true model.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Penalized Regularization Framework
One revolutionary work is the development of regularizationframework:
minβ
L(β; y,X ) + λJ(β)
L(β; y,X ) is the loss function
For OLS, L =∑n
i=1[yi − β0 −∑d
j=1 βjxij ]2
For MLE methods, L = log likelihood
For Cox’s PH models, L is the partial likelihood
In supervised learning, L is the hinge loss function (SVM), orexponential loss (AdaBoost)
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Lq Shrinkage Penalty
J(β) is the penalty function.
Jq(|β|) = λ‖β‖qq =d∑
j=1
|βj |q, q ≥ 0.
J0(|β|) =∑d
j=1 I (βj 6= 0) (L0; Donoho and Johnstone1988)
J1(|β|) =∑d
j=1 |βj | (LASSO; Tibshirani 1996)
J2(|β|) =∑d
j=1 β2j (Ridge; Hoerl and Kennard, 1970)
J∞(|β|) = maxj |βj | (Supnorm penalty; Zhang et al. 2008)
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3
PSfrag repla ementsq = 4 q = 2 q = 1 q = 0:5 q = 0:1
Figure 3.13: Contours of onstant value of Pj j�j jqfor given values of q.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
The LASSO
Applying the absolute penalty on the coefficients
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2 + λ
d∑j=1
|βj |
λ ≥ 0 is a tuning parameter
λ controls the amount of shrinkage; the larger λ, the greateramount of shrinkage
What happens if λ→ 0? (no penalty)
What happens if λ→∞?
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Equivalent formulation for LASSO
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2, subject to
d∑j=1
|βj | ≤ t.
Explicitly apply the magnitude constraint on the parameters
Making t sufficiently small will cause some coefficients to beexactly zero.
The lasso is a kind of continuous subset selection
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3
β^ β^2. .β
1
β2
β1β
Figure 3.12: Estimation pi ture for the lasso (left)and ridge regression (right). Shown are ontours of theerror and onstraint fun tions. The solid blue areas arethe onstraint regions j�1j+ j�2j � t and �21 + �22 � t2,respe tively, while the red ellipses are the ontours ofthe least squares error fun tion.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
LASSO Solution: soft thresholding
For the orthogonal design case with X ′X = I , the LASSO methodis equivalent to: solve βj ’s componentwisely by solving
minβj
(βj − βlsj )2 + λ|βj |
The solution to the above problem is
βlassoj = sign(βls
j )(|βlsj | −
λ
2)+
=
βlsj −
λ2 if βls
j > λ2
0 if |βlsj | ≤
λ2
βlsj + λ
2 if βlsj < −λ
2
shrinks big coefficients by a constant λ2 towards zero.
truncates small coefficients to zero exactly.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Comparison: Different Methods Under Orthogonal Design
When X is orthonormal, i.e. XTX = I
Best subset (of size M): βlsj .
keeps the largest coefficients
Ridge regression: βlsj /(1 + λ)
does a proportional shrinkage
Lasso: sign(βlsj )(βls
j −λ2 )+
transform each coefficient by a constant factor first, thentruncate it at zero with a certain threshold“soft thresholding”, used often in wavelet-based smoothing
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
How to Choose t or λ
The tuning parameter t or λ should be adaptively chosen tominimize the MSE or PE. Common tuning methods include
the tuning error, cross validation, leave-out-one crossvalidation (LOOCV), AIC, or BIC.
If t is chosen larger than t0 =∑d
j=1 |βlsj |, then βlasso
j = βlsj .
If t = t0/2, the least squares coefficients are shrunk by 50%.
In practice, we sometimes standardize t by s = t/t0 in [0, 1], andchoose the best value s from [0, 1].
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
K-fold Cross Validation
The simplest and most popular way to estimate the tuning error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K
leave the kth portion out, and fit the model using the other
K − 1 parts. Denote the solution by β−k
calculate prediction errors of β−k
on the left-out kth portion3 Average the prediction errors
Define the Index function (allocating memberships)
κ : {1, ..., n} → {1, ...,K}.The K -fold cross-validation estimate of prediction error (PE) is
CV =1
n
n∑i=1
(yi − xTi β
−κ(i))2Typical choice: K = 5, 10
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Choose λ
Choose a sequence of λ values.
For each λ, fit the LASSO and denote the solution by βlasso
λ .
Compute the CV (λ) curve as
CV (λ) =1
n
n∑i=1
(yi − xTi β
−κ(i)λ
),
which provides an estimate of the test error curve.
Find the best parameter λ∗ which minimizes CV (λ).
Fit the final LASSO model with λ∗. The final solution isdenoted as β
lasso
λ∗
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB
Subset Size p
Miscla
ssifica
tion Er
ror
5 10 15 20
0.00.1
0.20.3
0.40.5
0.6•
••
••
••
••
• • • • • • • • • • •
••
• •
••
• •
• • • • • • •• • • • •
DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
One-standard Error for CV
One-standard error rule:
choose the most parsimonious model with error no more thanone standard error above the best error.
used often in model selection (for a smaller model size).
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB
Subset Size p
Miscla
ssifica
tion Er
ror
5 10 15 20
0.00.1
0.20.3
0.40.5
0.6•
••
••
••
••
• • • • • • • • • • •
••
• •
••
• •
• • • • • • •• • • • •
DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Computation of LASSO
Quadratic Programming
Shooting Algorithms (Fu 1998; Zhang and Lu, 2007)
LARS solution path (Efron et al. 2001)
the most efficient algorithm for LASSOdesigned for standard linear models
Coordinate Descent Algorithm (CDA; Friedman et al. 2010)
designed for GLMsuitable for ultra-high dimensional problems.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
R package: lars
Provides the entire sequence of coefficients and fits, starting fromzero, to the least squares fit.
library(lars)
lars(x, y, type = c("lasso", "lar", "forward.stagewise",
"stepwise"), trace = FALSE, normalize = TRUE,
intercept = TRUE, Gram, eps = .Machine$double.eps,
max.steps, use.Gram = TRUE)
A “lars” object is returned, for which print, plot, predict, coefand summary methods exist.
Details in Efron, Hastie, Johnstone and Tibshirani (2002).
With ”lasso” option, it computes the complete lasso solutionsimultaneously for ALL values of the shrinkage parameter (λor t or s) in the same computational cost as a least squares fit.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Details About LARS function
x : matrix of predictors
y : response
type: One of ”lasso”, ”lar”, ”forward.stagewise” or”stepwise”, all variants of Lasso. Default is ”lasso”.
trace: if TRUE, lars prints out its progress
normalize: if TRUE, each variable is standardized to have unitL2 norm, otherwise it is left alone. Default is TRUE.
intercept: if TRUE, an intercept is included (not penalized),otherwise no intercept is included. Default is TRUE.
Gram: XTX matrix; useful for repeated runs (bootstrap)where a large XTX stays the same.
eps: An effective zero
max.steps: the number of steps taken. Default is8 min(d , n − intercept), n is the sample size.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3
Shrinkage Factor s
Coeff
icients
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.00.2
0.40.6
•
•
•
•
•
•
••
•• • • • • • • • • • • • • • • • lcavol
• • • • ••
••
•• • • • • • • • • • • • • • • • lweight
• • • • • • • • • • • • • ••
• • • • • • • • • •age
• • • • • • • • • ••
••
•• • • • • • • • • • • lbph
• • • • • • ••
••
••
•• • • • • • • • • • • •svi
• • • • • • • • • • • • • • ••
••
••
••
••
• lcp
• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •
••
•• • • • • • • • • •
••pgg45
Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Prediction for LARS
predict(object, newx, s, type = c("fit", "coefficients"),
mode = c("step", "fraction", "norm", "lambda"), ...)
coef(object, ...)
object: A fitted lars object
newx: If type=“fit”, then newx should be the x values atwhich the fit is required. If type=“coefficients”, then newxcan be omitted.
s: a value, or vector of values, indexing the path. Its valuesdepends on the mode . By default (mode=“step”), s takes onvalues between 0 and p.
type: If type=“fit”, predict returns the fitted values. Iftype=”coefficients”, predict returns the coefficients.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
“Mode” for Prediction in LARS
If mode=“step”, then s should be the argument indexes thelars step number, and the coefficients will be returnedcorresponding to the values corresponding to step s.
If mode=“fraction”, then s should be a number between 0and 1, and it refers to the ratio of the L1 norm of thecoefficient vector, relative to the norm at the full LS solution.
If mode=“norm”, then s refers to the L1 norm of thecoefficient vector.
If mode= λ, then s is the lasso regularization parameter
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Example: Diabetes Data
library(lars)
data(diabetes)
attach(diabetes)
## fit LASSO
object <- lars(x,y)
plot(object)
summary(object)
predict(object,newx=x)
## fit LARS
object2 <- lars(x,y,type="lar")
detach(diabetes)
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
R Code to Compute K-CV Error Curve for Lars
cv.lars(x, y, K = 10, index, type = c("lasso",
"lar", "forward.stagewise", "stepwise"),
mode=c("fraction", "step"), ...)
x : Input to lars
y : Input to lars
K : Number of folds
index: Abscissa values at which CV curve should becomputed. If mode=”fraction”, this is the fraction of thesaturated |β|. The default value index=seq(from = 0, to = 1,length = 100). If mode=”step”, this is the number of steps inlars procedure.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
R Code: CV.LARS
attach(diabetes)
lasso.s<-seq(0,1,length=100)
lasso.cv<-cv.lars(x,y,K=10,index=lasso.s,mode="fraction")
lasso.mcv<-which.min(lasso.cv$cv)
## minimum CV rule
bests1 <- lasso.s[lasso.mcv]
lasso.coef1 <- predict(fit.lasso, s=bests1, type="coef",
mode="frac")
## one-standard rule
bound<-lasso.cv$cv[lasso.mcv]+lasso.cv$cv.error[lasso.mcv]
bests2<-lasso.s[min(which(lasso.cv$cv<bound))]
lasso.coef2 <- coef(fit.lasso, s=bests2, mode="frac")
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Consistency - Definition
Let the true parameters be β0 and their estimates by βn. Thesubscript ‘n’ means the sample size n.
Estimation consistency
βn − β0 →p 0, as n→∞.
Model selection consistency
P({j : βj 6= 0} = {j : β0j 6= 0}
)→ 1, as n→∞.
Sign consistency
P(βn =s β0
)→ 1, as n→∞,
whereβn =s β0 ⇔ sign(βn) = sign(β0).
Sign consistency is stronger than model selection consistency.Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Consistency of LASSO Estimator
Knight and Fu (2000) have shown that
Estimation Consistency: The LASSO solution is ofestimation consistency for fixed p. In other words,
βlasso
(λn)→p β, as λn = o(n).
It is root-n consistent and asymptotically normal.
Model Selection Property: For λn ∝ n12 , as n→∞, there is
a non-vanishing positive probability for lasso to select the truemodel.
The l1 penalization approach is called basis pursuit in signalprocessing (Chen, Donoho, and Saunders, 2001).
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Model Selection Consistency of LASSO Estimator
Zhao and Yu (2006). On Model Selection Consistency of Lasso.JMLR, 7, 2541–2563.
If an irrelevant predictor is highly correlated with thepredictors in the true model, Lasso may not be able todistinguish it from the true predictors with any amount ofdata and any amount of regularization.
Under Irrepresentable Condition (IC), the LASSO is modelselection consistent in both fixed and large p settings.
The IC condition is represented as
|[Xn(1)TXn(1)]−1XTn (1)Xn(2)| < 1− η,
i.e., the regression coefficients of irrevelant covariates Xn(2)on the relevant covariates Xn(1) is constrained. It is almostnecessary and sufficient for Lasso to select the true model.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge Penalty
LASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Special Cases for LASSO to be Selection Consistent
Assume the true model size |A0| = q.
The underlying model must satisfy a nontrivial condition if thelasso variable selection is consistent (Zou, 2006).
The lasso is always consistent in model selection under thefollowing special cases:
when p = 2;when the design is orthgonal;when the covariates have bounded constant correlation,0 < rn <
11+cq for some c ;
when the design has power decay correlation with
corr(Xi ,Xj) = ρ|i−j|n , where |ρn| ≤ c < 1.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Ridge Regression (q = 2)
Applying the squared penalty on the coefficients
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2 + λ
d∑j=1
β2j
λ ≥ 0 is a tuning parameter
The solution
βridge
= (XTX + λId)−1XTy.
Note the similarity to the ordinary least squares solution, but withthe addition of a “ridge” down the diagonal.
As λ→ 0, βridge
→ βols
.
As λ→∞, βridge
→ 0.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Ridge Regression Solution
In the special case of an orthonormal design matrix,
βridgej =
βolsj
1 + λ.
This illustrates the essential “shrinkage” feature of ridge regression
Applying the ridge regression penalty has the effect ofshrinking the estimates toward zero
The penalty introducing bias but reducing the variance of theestimate
Ridge estimator does not threshold, since the shrinkage is smooth(proportional to the original coefficient).
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Invertibility
If the design matrix X is not full rank, then
XTX is not invertible,
For OLS estimator,, there is no unique solution for βols
.
This problem does not occur with ridge regression, however.
Theorem: For any design matrix X , the quantity XTX + λId is
always invertible; thus, there is always a unique solution βridge
.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Bias and Variance
The variance of the ridge regression estimate is
Var(βridge
) = σ2WXTXW ,
where W = (XTX + λId)−1.
The bias of the ridge regression estimate is
Bias(βridge
) = −λWXTβ.
It can be shown that
the total variance∑d
j=1 Var(βridgej ) is a monotone decreasing
sequence with respect to λ
while the total squared bias∑d
j=1 Bias2(βridgej ) is a
monotone increasing sequence with respect to λ.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Mean Squared Error (MSE) of Ridge
Existence Theorem:
There always exists a λ such that the MSE of βridge
is less than
the MSE of βols
.
a rather surprising result with somewhat radical implications:even if the model we fit is exactly correct and follows theexact distribution we specify, we can always obtain a better(in terms of MSE) estimator by shrinking towards zero.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Bayesian Interpretation of Ridge Estimation
Penalized regression can be interpreted in a Bayesian context:
Suppose y = Xβ + ε, where ε ∼ N(0, σ2In).
Given the prior β ∼ N(0, τ2Id), then the posterior mean of βgiven the data is
(XTX +σ2
τ2Id)−1XTy.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
R package: lm.ridge
library(MASS)
lm.ridge(formula, data, subset, na.action, lambda)
formula: a formula expression as for regression models, of form
response ∼ predictors
data: an optional data frame in which to interpret thevariables occurring in formula.
subset: expression saying which subset of the rows of the datashould be used in the fit. All observations are included bydefault.
na.action a function to filter missing data.
lambda: A scalar or vector of ridge constants.
If an intercept is present, its coefficient is not penalized.Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Output for lm.ridge
Returns a list with components
coef: matrix of coefficients, one row for each value of lambda.Note that these are not on the original scale and are for useby the coef method.
scales: scalings used on the X matrix.
Inter: was intercept included?
lambda: vector of lambda values
ym: mean of y
xm: column means of x matrix
GCV: vector of GCV values
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Example for Ridge Regression
library(MASS)
longley
names(longley)[1] <- "y"
lm.ridge(y ~ ., longley)
plot(lm.ridge(y ~ ., longley,
lambda = seq(0,0.1,0.001)))
(demo)
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3Co
efficie
nts
0 2 4 6 8
-0.2
0.00.2
0.40.6
•
••••
••
••
••
••
••
••
••
••
•••
•
lcavol
••••••••••••••••••••••••
•
lweight
••••••••••••••••••••••••
•
age
•••••••••••••••••••••••••
lbph
••••••••••••••••••••••••
•
svi
•
•••
••
••
••
••
••••••••••••
•
lcp
••••••••••••••••••••••••
•gleason
•
•••••••••••••••••••••••
•
pgg45
PSfrag repla ements df(�)Figure 3.7: Pro�les of ridge oeÆ ients for theprostate an er example, as tuning parameter � is var-ied. CoeÆ ients are plotted versus df(�), the e�e -tive degrees of freedom. A verti al line is drawn atdf = 4:16, the value hosen by ross-validation.Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Lasso vs Ridge
Ridge regression:
is a continuous shrinkage method; achieves better predictionperformance than OLS through a biase-variance trade-off.
cannot produce a parsimonious model, for it always keeps allthe predictors in the model.
The lasso
does both continuous shrinkage and automatic variableselection simultaneously.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO
Oracle PropertiesShrinkage Methods
Ridge PenaltyRidge vs LASSO
Empirical Comparison
Tibshirani (1996) and Fu (1998) compared the predictionperformance of the lasso, ridge and bridge regression (Frank andFriedman, 1993) and found
None of them uniformly dominates the other two.
However, as variable selection becomes increasingly importantin modern data analysis, the lasso is much more appealingowing to its sparse representation.
Hao Helen Zhang Lecture 1: Variable Selection - LASSO