lecture 5: variable selection and sparsitytzhao80/lectures/lecture_5.pdflecture 5: variable...
TRANSCRIPT
Lecture 5: Variable Selection and Sparsity
Tuo Zhao
Schools of ISyE and CSE, Georgia Tech
High Dimensional Variable Selection
ISYE/CSE 6740: Computational Data Analysis
Linear Models
The simplest regression model in the world:
y = Xθ∗ + ε.
Design Matrix: X ∈ Rn×d,
Response Vector: y ∈ Rn,
Random Noise: ε ∼ N(0, σ2In).
n > d: Ordinary Least Square Estimator (equivalent to MLE)
θo
= (X>X)−1X>y.
d n: X>X is not invertible. What to do?
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49
ISYE/CSE 6740: Computational Data Analysis
Linear Models
The simplest regression model in the world:
y = Xθ∗ + ε.
Design Matrix: X ∈ Rn×d,
Response Vector: y ∈ Rn,
Random Noise: ε ∼ N(0, σ2In).
n > d: Ordinary Least Square Estimator (equivalent to MLE)
θo
= (X>X)−1X>y.
d n: X>X is not invertible. What to do?
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49
ISYE/CSE 6740: Computational Data Analysis
Linear Models
The simplest regression model in the world:
y = Xθ∗ + ε.
Design Matrix: X ∈ Rn×d,
Response Vector: y ∈ Rn,
Random Noise: ε ∼ N(0, σ2In).
n > d: Ordinary Least Square Estimator (equivalent to MLE)
θo
= (X>X)−1X>y.
d n: X>X is not invertible. What to do?
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49
ISYE/CSE 6740: Computational Data Analysis
Motivating Example: Credit Card
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 4/49
ISYE/CSE 6740: Computational Data Analysis
Motivating Example: Credit Card
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 5/49
ISYE/CSE 6740: Computational Data Analysis
Motivating Example: Credit Card
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 6/49
ISYE/CSE 6740: Computational Data Analysis
Motivating Example: Medical Imaging
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 7/49
ISYE/CSE 6740: Computational Data Analysis
Sparsity-Inducing Norm Regularization
X
d
=n
y
+
Sparsity Assumption:∑d
j=1 1(θ∗j 6= 0) = s d.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 8/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Greedy Selection and Ridge Estimator
What we learned in textbooks :
Forward Selection: It always increases the model size
Backward Selection: It always decreases the model size
Stepwise Selection: It dynamically adjusts the model size
Hypothesis Testing: t-test for each coefficient.
Ridge Estimator: The model size is fixed
This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 9/49
ISYE/CSE 6740: Computational Data Analysis
Lasso and Ridge Regression
Lasso Regression:
θ = arg minθ
1
2n‖y −Xθ‖22 subject to ‖θ‖1 ≤ R,
where R is a tuning parameter.
Ridge Regression:
θ = arg minθ
1
2n‖y −Xθ‖22 subject to ‖θ‖2 ≤ R.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 10/49
ISYE/CSE 6740: Computational Data Analysis
Geometric Intuition
useR! 2009 Trevor Hastie, Stanford Statistics 4
Linear regression via the Lasso (Tibshirani, 1995)
• Given observations yi, xi1, . . . , xipNi=1
minβ
N!
i=1
(yi − β0 −p!
j=1
xijβj)2 subject to
p!
j=1
|βj | ≤ t
• Similar to ridge regression, which has constraint"
j β2j ≤ t
• Lasso does variable selection and shrinkage, while ridge only
shrinks.
` `2. .`
1
` 2
`1
`
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 11/49
ISYE/CSE 6740: Computational Data Analysis
Regularized Least Square Regression
Lasso (Tibshirani, 1996):
θ = arg minθ
1
2n‖y −Xθ‖22 + λ‖θ‖1,
where λ > 0 is the regularization parameter.
Ridge Regression:
θ = arg minθ
1
2n‖y −Xθ‖22 + λ‖θ‖22.
Remark: The `1 norm can trap some coordinates at zero values.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 12/49
ISYE/CSE 6740: Computational Data Analysis
Why does the `1 norm work?
Best Subset Selection using the `0 regularization:
θ = arg minθ
1
2n‖y −Xθ‖22 + λ‖θ‖0,
where ‖θ‖0 =∑d
j=1 1(θj 6= 0).
Differences:
Discontinuous v.s. Continuous
Nonconvex v.s. Convex
Unbiased v.s. Biased
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 13/49
ISYE/CSE 6740: Computational Data Analysis
Why the `1 norm works?
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
`1`0
j
The `1 and `0 regularizers
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 14/49
ISYE/CSE 6740: Computational Data Analysis
Extensions to Generalized Linear Models
Logistic Lasso (Tibshirani, 1996):
θ = arg minθ
1
n
n∑
i=1
log(1 + exp(−yix>i θ)) + λ‖θ‖1.
Design Matrix: X = [x1, ...,xn]> ∈ Rn×d,
Response Vector: y = [y1, ..., yn]> ∈ −1,+1n.
ERM Framework: Loss + `1 regularization:
Sparse Support Vector Machine,
Sparse LAD Regression
... ...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 15/49
ISYE/CSE 6740: Computational Data Analysis
Extensions to Undirected Graphical Models
Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).
Graphical Lasso:
Ω = arg minΩ
− log(|Ω|) + trace(S>Ω) + λ∑
j,k
|Ωjk|,
Data matrix: X = [x1, ...,xn]>,
Sample Mean: x = 1n
∑ni=1 xi,
Empirical Covariance: S = 1n
∑ni=1(xi − x)(xi − x)>.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49
ISYE/CSE 6740: Computational Data Analysis
Extensions to Undirected Graphical Models
Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).
Graphical Lasso:
Ω = arg minΩ
− log(|Ω|) + trace(S>Ω) + λ∑
j,k
|Ωjk|,
Data matrix: X = [x1, ...,xn]>,
Sample Mean: x = 1n
∑ni=1 xi,
Empirical Covariance: S = 1n
∑ni=1(xi − x)(xi − x)>.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 16/49
ISYE/CSE 6740: Computational Data Analysis
Examples of Undirected Graphical Models
MPDC2
FPPS1 HMGR2
MPDC1
IPPI2 FPPS2
IPPI1
HDR
PPDS2mt
AACT1
GPPS
MCT HDS
HMGR1
PPDS1 CMK
GGPPS6
GGPPS11
AACT2
DXR
DPPS2 DXPS2(cla1)
MECPS HMGS
MK
GGPPS12
UPPS1
The estimated undirected graph using the arabidopsis dataset.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 17/49
ISYE/CSE 6740: Computational Data Analysis
Group Lasso
Linear Model with Group Structure
y =
d∑
j=1
XGjθ∗Gj + ε,
where XGj ∈ Rn×mj , θ∗Gj ∈ Rmj , Gj ∩Gk = ∅.
Group Regularization
θ = arg minθ
1
2n
∣∣∣∣y −d∑
j=1
XGjθGj∣∣∣∣22
+ λ‖θ‖1,p,
where 2 ≤ p ≤ ∞ and ‖θ‖1,p =∑d
j=1 ‖θGj‖p.
Structural Sparsity Assumption: ‖θ∗‖0,p = s d.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 18/49
ISYE/CSE 6740: Computational Data Analysis
Region Sparsity of Brain Medical Imaging
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 19/49
ISYE/CSE 6740: Computational Data Analysis
Group Regularization
The group regularization yields joint sparsity over each block ofcoefficients. What is the difference between the Ridge and `2 normregularization?
The `2 and `∞ regularization functions
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 20/49
ISYE/CSE 6740: Computational Data Analysis
Extension to Multitask Regression
Multitask Regression Models
Y = XΘ∗ + W.
Response Matrix: Y ∈ Rn×m,
Regression Coefficient Matrix: Θ∗ ∈ Rd×m
Random Noise: W i.i.d. Gaussian entries.
Regularization Across Tasks
Θ = arg minΘ
1
2n‖Y −XΘ‖2F + λ‖Θ‖1,p,
where ‖Θ‖1,p =∑d
j=1 (∑m
k=1 |Θjk|p)1/p.
Structural Sparsity Assumption: ‖Θ∗‖0,p = s d.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 21/49
ISYE/CSE 6740: Computational Data Analysis
Elastic-net Regularization
Elastic-net Regularized Regression
θ = arg minθ
1
2n‖y −Xθ‖22 + λ1‖θ‖1 + λ2‖θ‖22,
where λ1 and λ2 are regularization parameters.
Remark:
Extra tuning efforts.
Collinearity
Grouping effects
Ease computation
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 22/49
ISYE/CSE 6740: Computational Data Analysis
Elasitic-net Regularization
Ridge Regularization:
d∑
j=1
θ2j ∝∑
j>k
[(θj − θk)2 + (θj + θk)2].
The Ridge regularization encourages the diminution of θj − θk’sand θj + θk’s for highly correlated variables.
Therefore, the elastic-net regularized regression tends to jointlyselect or remove highly correlated variables.
Extensions: Elastic-net Penalized Logistic/Poisson Regression
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 23/49
ISYE/CSE 6740: Computational Data Analysis
Dantzig Selector
Dantzig Selector:
θ = arg minθ
‖θ‖1 subject to1
n‖X>(y −Xθ)‖∞ ≤ λ.
General Form:
θ = arg minθ
R(θ) subject to R∗(∇L(θ)) ≤ λ.
Remark:
Essentially linear optimization
Similar performance
Less popular
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 24/49
ISYE/CSE 6740: Computational Data Analysis
Dantzig Selector as Linear Program
Parameter Decomposition
θ = θ+ − θ−
Reparamatrization:
minθ+,θ−
1>θ+ + 1>θ−
subject to X>(Xθ+ −Xθ− − y) ≤ λ1
− λ1 ≤ X>(Xθ+ −Xθ− − y)
θ+ ≥ 0,θ− ≥ 0
Remark: Efficiently solved by existing linear programming solvers.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 25/49
ISYE/CSE 6740: Computational Data Analysis
Statistical Properties
Parameter Estimation:
Lasso : ‖θ − θ∗‖22 = OP(s log d
n
)
Group Lasso: ‖θ − θ∗‖22 = OP(s log d
n+smmax
n
)
Remark:
Restricted Eigenvalue Conditions
Light Tail Conditions
Scaling: s log d/n→ 0, smmax/n→ 0
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 26/49
ISYE/CSE 6740: Computational Data Analysis
Statistical Properties
Variable Selection:
Lasso : P(
sign(θ) = sign (θ∗))→ 1
Group Lasso: P(
sign(θ) = sign (θ∗))→ 1
Remark:
Restricted Eigenvalue Conditions + Irrepresentable Conditions
Light Tail Conditions
Scaling: s log d/n→ 0, smmax/n→ 0
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 27/49
ISYE/CSE 6740: Computational Data Analysis
Statistical Properties
Excessive Bound:
Lasso : EL(θ)− EL(θ∗) = OP(√
s log d
n
)
Group Lasso: EL(θ)− EL(θ∗) = OP(√
s log d
n+
√smmax
n
)
Remark:
Statistical Learning Theory v.s. Statistics
Bounded Design and Response Conditions
Scaling: s log d/n→ 0, smmax/n→ 0
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 28/49
Nonsmooth Convex Optimization
ISYE/CSE 6740: Computational Data Analysis
Computational Algorithms
You may have heard:
1 Proximal Gradient Algorithm (Nesterov, 2007)
2 Accelerated Proximal Gradient Algorithm (Beck et al., 2009)
3 Coordinate Descent Algorithm (Friedman et al., 2007)
4 Accelerated Coordinate Descent Algorithm (Lin et al., 2014)
5 Extension to Stochastic Optimization and Parallel Optimization
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 30/49
ISYE/CSE 6740: Computational Data Analysis
Proximal Gradient Algorithm
The proximal gradient algorithm is the most fundamentalcomputational algorithm for solving high dimensional sparseestimation problem (Nesterov, 2007).
θ = arg minθ
L(θ) +Rλ(θ)︸ ︷︷ ︸Fλ(θ)
.
Remark:
Simple and easy to implement
Handle Complex Regularization
Software packages available in R
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 31/49
ISYE/CSE 6740: Computational Data Analysis
Proximal Gradient Algorithm
Given the solution θ(t), we take
θ(t+1) = arg minθ
L(θ(t)) + (θ − θ(t))>∇L(θ(t))
+1
2ηt‖θ − θ(t)‖22 +Rλ(θ)
= arg minθ
1
2‖θ − θ(t) + ηt∇L(θ(t))‖22 + ηt−1Rλ(θ),
where η is the step size parameter. Then we have
θ(t+1) = Tηλ(θ(t) − ηt∇L(θ(t))).
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 32/49
ISYE/CSE 6740: Computational Data Analysis
Proximal Gradient Algorithm
Lasso: At the t-th iteraiton,
θ(t+1)j = sign(θ
(t+1)j ) ·max|θ(t+1)
j | − ηλ, 0,where θ
(t+1)j = θ
(t)j − η∇jL(θ(t)).
Group Lasso: At the t-th iteraiton,
θ(t+1)Gj
= θ(t+1)
Gj ·max
1− λη
‖θ(t+1)
Gj ‖2, 0
,
where θ(t+1)
Gj = θ(t)Gj− η∇GjL(θ(t)).
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 33/49
ISYE/CSE 6740: Computational Data Analysis
Convergence Analysis
Sublinear Rate of Convergence:
T = O(L
ε
)iterations such that Fλ(θ(t+1))−Fλ(θ) ≤ ε.
Remark:
∇L(·) is Lipschitz continous
‖∇L(θ′)−∇L(θ)‖2 ≤ L‖θ′ − θ‖2.L ≤ 1/η ≤ 2L (Guaranteed by line search)
Accelerated version O(√
L/ε)
.
Linear rate of convergence requires strong convexity.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 34/49
ISYE/CSE 6740: Computational Data Analysis
Coordinate Descent Algorithm
The coordinate descent algorithm is the most famouscomputational algorithm for solving high dimension sparseestimation problem (Friedman et al., 2007, 2010).
Simple and easy to implement
Extremely efficient when the solution is sparse
High precision
Decomposable regularization: Rλ(θ) =∑d
j=1 rλ(θj)
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 35/49
ISYE/CSE 6740: Computational Data Analysis
Randomized Coordinate Descent Algorithm
At the t-th iteration, we sample j from 1, ..., d with equalprobability, and take
θ(t+1)j = arg min
θj
L(θ(t)) +(θj − θ(t)j
)∇jL(θ(t))
+1
2ηj
(θj − θ(t)j
)2+Rλ
(θ(t)\j
)+ rλ(θj)
= arg minθ
1
2
(θj − θ(t)j + ηj∇jL(θ(t)
)2+ ηjrλ(θj),
where ηj is the step size parameter. Then we have
θ(t+1)j = Tηjλ
(θ(t)j − ηj∇jL(θ(t))
)and θ
(t+1)\j = θ
(t)\j
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 36/49
ISYE/CSE 6740: Computational Data Analysis
Convergence Analysis
Sublinear Rate of Convergence:
T = O(dmaxjM
ε
)iterations s. t. E Fλ(θ(t+1))−Fλ(θ) ≤ ε.
Remark:
∇jL(·,θ\j) is Lipschitz continuous for all j = 1, ..., d
‖∇jL(θ′j ,θ\j)−∇jL(θj ,θ\j)‖2 ≤Mj |θ′j − θj |2.
1/ηj = Mj (Often explicitly calculated)
Accelerated version O(d√
maxjMj/ε)
Partial Residual Update Trick.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 37/49
ISYE/CSE 6740: Computational Data Analysis
Warm Start Initialization
Regularization sequence λKNK=0 : λ0 = 1n‖X>y‖∞
θKNK=0 from sparse to dense : θ0
= 0
0
b0
min
L() + R1()
1 = 0.960
Initia
lizati
on
b1 Initia
lizati
on
min
L() + R2()
2 = 0.961
b2 Initia
lizati
on
· · ·
· · ·
· · ·
· · ·
· · ·
· · ·
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 38/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 39/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 40/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
2
5
8
11
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 41/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
2
5
8
11
2
5
8
11
1
3
4
6
7
9
10
12
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 42/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
2
5
8
11
2
5
8
11
1
3
4
6
7
9
10
12
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 43/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
2
5
8
11
2
5
8
11
1
3
4
6
7
9
10
12
8
1
3
4
6
7
9
10
12
...
...
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 44/49
ISYE/CSE 6740: Computational Data Analysis
Active Set Strategy
1
2
3
4
5
6
7
8
9
10
12
11
A: Active Set
A: Inactive Set
2
5
8
11
2
5
8
11
1
3
4
6
7
9
10
12
8
1
3
4
6
7
9
10
12
2
5
8
1
3
11
4
6
7
10
12
... ...
... ...9
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 45/49
ISYE/CSE 6740: Computational Data Analysis
Solution Path
Lasso and Elastic-net solution paths
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 46/49
ISYE/CSE 6740: Computational Data Analysis
Proximal Newton Algorithm
Given the solution θ(t), we take
θ(t+0.5) = arg minθ
L(θ(t)) + (θ − θ(t))>∇L(θ(t))
+1
2(θ − θ(t))>∇2L(θ(t))(θ − θ(t)) +Rλ(θ).
Combined with the backtracking line search, we have
θ(t+1) = θ(t) + η(θ(t+0.5) − θ(t)).
Remark: Each subproblem is solved by the coordinate descentalgorithm. The proximal Newton algorithm can be much moreefficient than coordinate descent algorithm in practice.
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 47/49
Software Libraries
ISYE/CSE 6740: Computational Data Analysis
Available Packages
glmnet: Lasso, Logistic/Poisson Lasso.Developed by J. Friedman; Maintained by T. Hastie
PICASSO: Lasso, Logistic/Poisson Lasso.
huge: Graphical Lasso
liblinear: Sparse Support Vector Machine, Logistic LassoDeveloped by C. Jin and Maintained by T. Helleputte
QUIC: Graphical LassoDeveloped by C. Hsieh and Maintained by M. Sustik
Tuo Zhao — Lecture 5: Variable Selection and Sparsity 49/49