lecture 5: variable selection and sparsitytzhao80/lectures/lecture_5.pdflecture 5: variable...

Lecture 5: Variable Selection and Sparsity

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

High Dimensional Variable Selection

ISYE/CSE 6740: Computational Data Analysis

Linear Models

The simplest regression model in the world:

y = Xθ∗ + ε.

Design Matrix: X ∈ Rn×d,

Response Vector: y ∈ Rn,

Random Noise: ε ∼ N(0, σ2In).

n > d: Ordinary Least Square Estimator (equivalent to MLE)

θo

= (X>X)−1X>y.

d n: X>X is not invertible. What to do?

Tuo Zhao — Lecture 5: Variable Selection and Sparsity 3/49


Motivating Example: Credit Card



Motivating Example: Medical Imaging



Sparsity-Inducing Norm Regularization

X

d

=n

y

+

Sparsity Assumption:∑d

j=1 1(θ∗j 6= 0) = s d.



Greedy Selection and Ridge Estimator

What we learned in textbooks :

Forward Selection: It always increases the model size

Backward Selection: It always decreases the model size

Stepwise Selection: It dynamically adjusts the model size

Hypothesis Testing: t-test for each coefficient.

Ridge Estimator: The model size is fixed

This lecture is about: Lasso, Logistic Lasso, Graphical Lasso, GroupLasso, Elastic-net, Dantzig Selector, ...



Lasso and Ridge Regression

Lasso Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖1 ≤ R,

where R is a tuning parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 subject to ‖θ‖2 ≤ R.



Geometric Intuition

useR! 2009 Trevor Hastie, Stanford Statistics 4

Linear regression via the Lasso (Tibshirani, 1995)

• Given observations yi, xi1, . . . , xipNi=1

minβ

N!

i=1

(yi − β0 −p!

j=1

xijβj)2 subject to

p!

j=1

|βj | ≤ t

• Similar to ridge regression, which has constraint"

j β2j ≤ t

• Lasso does variable selection and shrinkage, while ridge only

shrinks.

` `2. .`

1

` 2

`1

`



Regularized Least Square Regression

Lasso (Tibshirani, 1996):

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖1,

where λ > 0 is the regularization parameter.

Ridge Regression:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖22.

Remark: The `1 norm can trap some coordinates at zero values.



Why does the `1 norm work?

Best Subset Selection using the `0 regularization:

θ = arg minθ

1

2n‖y −Xθ‖22 + λ‖θ‖0,

where ‖θ‖0 =∑d

j=1 1(θj 6= 0).

Differences:

Discontinuous v.s. Continuous

Nonconvex v.s. Convex

Unbiased v.s. Biased



Why the `1 norm works?

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

`1`0

j

The `1 and `0 regularizers



Extensions to Generalized Linear Models

Logistic Lasso (Tibshirani, 1996):

θ = arg minθ

1

n

n∑

i=1

log(1 + exp(−yix>i θ)) + λ‖θ‖1.

Design Matrix: X = [x1, ...,xn]> ∈ Rn×d,

Response Vector: y = [y1, ..., yn]> ∈ −1,+1n.

ERM Framework: Loss + `1 regularization:

Sparse Support Vector Machine,

Sparse LAD Regression

... ...



Extensions to Undirected Graphical Models

Gaussian Graphical Models: X = (X1, ..., Xd) ∈ Rd ∼ N(0,Σ).Precision Matrix: Ω = Σ−1. Xj and Xk are independent given theother variables if Ωjk = 0. The sparsity pattern of Ω encodes theconditional independence graph G = (V,E).

Graphical Lasso:

Ω = arg minΩ

− log(|Ω|) + trace(S>Ω) + λ∑

j,k

|Ωjk|,

Data matrix: X = [x1, ...,xn]>,

Sample Mean: x = 1n

∑ni=1 xi,

Empirical Covariance: S = 1n

∑ni=1(xi − x)(xi − x)>.



Examples of Undirected Graphical Models

MPDC2

FPPS1 HMGR2

MPDC1

IPPI2 FPPS2

IPPI1

HDR

PPDS2mt

AACT1

GPPS

MCT HDS

HMGR1

PPDS1 CMK

GGPPS6

GGPPS11

AACT2

DXR

DPPS2 DXPS2(cla1)

MECPS HMGS

MK

GGPPS12

UPPS1

The estimated undirected graph using the arabidopsis dataset.



Group Lasso

Linear Model with Group Structure

y =

d∑

j=1

XGjθ∗Gj + ε,

where XGj ∈ Rn×mj , θ∗Gj ∈ Rmj , Gj ∩Gk = ∅.

Group Regularization

θ = arg minθ

1

2n

∣∣∣∣y −d∑

j=1

XGjθGj∣∣∣∣22

+ λ‖θ‖1,p,

where 2 ≤ p ≤ ∞ and ‖θ‖1,p =∑d

j=1 ‖θGj‖p.

Structural Sparsity Assumption: ‖θ∗‖0,p = s d.



Region Sparsity of Brain Medical Imaging



Group Regularization

The group regularization yields joint sparsity over each block ofcoefficients. What is the difference between the Ridge and `2 normregularization?

The `2 and `∞ regularization functions



Extension to Multitask Regression

Multitask Regression Models

Y = XΘ∗ + W.

Response Matrix: Y ∈ Rn×m,

Regression Coefficient Matrix: Θ∗ ∈ Rd×m

Random Noise: W i.i.d. Gaussian entries.

Regularization Across Tasks

Θ = arg minΘ

1

2n‖Y −XΘ‖2F + λ‖Θ‖1,p,

where ‖Θ‖1,p =∑d

j=1 (∑m

k=1 |Θjk|p)1/p.

Structural Sparsity Assumption: ‖Θ∗‖0,p = s d.



Elastic-net Regularization

Elastic-net Regularized Regression

θ = arg minθ

1

2n‖y −Xθ‖22 + λ1‖θ‖1 + λ2‖θ‖22,

where λ1 and λ2 are regularization parameters.

Remark:

Extra tuning efforts.

Collinearity

Grouping effects

Ease computation



Elasitic-net Regularization

Ridge Regularization:

d∑

j=1

θ2j ∝∑

j>k

[(θj − θk)2 + (θj + θk)2].

The Ridge regularization encourages the diminution of θj − θk’sand θj + θk’s for highly correlated variables.

Therefore, the elastic-net regularized regression tends to jointlyselect or remove highly correlated variables.

Extensions: Elastic-net Penalized Logistic/Poisson Regression



Dantzig Selector

Dantzig Selector:

θ = arg minθ

‖θ‖1 subject to1

n‖X>(y −Xθ)‖∞ ≤ λ.

General Form:

θ = arg minθ

R(θ) subject to R∗(∇L(θ)) ≤ λ.

Remark:

Essentially linear optimization

Similar performance

Less popular



Dantzig Selector as Linear Program

Parameter Decomposition

θ = θ+ − θ−

Reparamatrization:

minθ+,θ−

1>θ+ + 1>θ−

subject to X>(Xθ+ −Xθ− − y) ≤ λ1

− λ1 ≤ X>(Xθ+ −Xθ− − y)

θ+ ≥ 0,θ− ≥ 0

Remark: Efficiently solved by existing linear programming solvers.



Statistical Properties

Parameter Estimation:

Lasso : ‖θ − θ∗‖22 = OP(s log d

n

)

Group Lasso: ‖θ − θ∗‖22 = OP(s log d

n+smmax

n

)

Remark:

Restricted Eigenvalue Conditions

Light Tail Conditions

Scaling: s log d/n→ 0, smmax/n→ 0




Variable Selection:

Lasso : P(

sign(θ) = sign (θ∗))→ 1

Group Lasso: P(

sign(θ) = sign (θ∗))→ 1

Remark:

Restricted Eigenvalue Conditions + Irrepresentable Conditions

Light Tail Conditions





Excessive Bound:

Lasso : EL(θ)− EL(θ∗) = OP(√

s log d

n

)

Group Lasso: EL(θ)− EL(θ∗) = OP(√

s log d

n+

√smmax

n

)

Remark:

Statistical Learning Theory v.s. Statistics

Bounded Design and Response Conditions



Nonsmooth Convex Optimization


Computational Algorithms

You may have heard:

1 Proximal Gradient Algorithm (Nesterov, 2007)

2 Accelerated Proximal Gradient Algorithm (Beck et al., 2009)

3 Coordinate Descent Algorithm (Friedman et al., 2007)

4 Accelerated Coordinate Descent Algorithm (Lin et al., 2014)

5 Extension to Stochastic Optimization and Parallel Optimization



Proximal Gradient Algorithm

The proximal gradient algorithm is the most fundamentalcomputational algorithm for solving high dimensional sparseestimation problem (Nesterov, 2007).

θ = arg minθ

L(θ) +Rλ(θ)︸︷︷︸Fλ(θ)

.

Remark:

Simple and easy to implement

Handle Complex Regularization

Software packages available in R




Given the solution θ(t), we take

θ(t+1) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2ηt‖θ − θ(t)‖22 +Rλ(θ)

= arg minθ

1

2‖θ − θ(t) + ηt∇L(θ(t))‖22 + ηt−1Rλ(θ),

where η is the step size parameter. Then we have

θ(t+1) = Tηλ(θ(t) − ηt∇L(θ(t))).




Lasso: At the t-th iteraiton,

θ(t+1)j = sign(θ

(t+1)j ) ·max|θ(t+1)

j | − ηλ, 0,where θ

(t+1)j = θ

(t)j − η∇jL(θ(t)).

Group Lasso: At the t-th iteraiton,

θ(t+1)Gj

= θ(t+1)

Gj ·max

1− λη

‖θ(t+1)

Gj ‖2, 0

,

where θ(t+1)

Gj = θ(t)Gj− η∇GjL(θ(t)).



Convergence Analysis

Sublinear Rate of Convergence:

T = O(L

ε

)iterations such that Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇L(·) is Lipschitz continous

‖∇L(θ′)−∇L(θ)‖2 ≤ L‖θ′ − θ‖2.L ≤ 1/η ≤ 2L (Guaranteed by line search)

Accelerated version O(√

L/ε)

.

Linear rate of convergence requires strong convexity.



Coordinate Descent Algorithm

The coordinate descent algorithm is the most famouscomputational algorithm for solving high dimension sparseestimation problem (Friedman et al., 2007, 2010).

Simple and easy to implement

Extremely efficient when the solution is sparse

High precision

Decomposable regularization: Rλ(θ) =∑d

j=1 rλ(θj)



Randomized Coordinate Descent Algorithm

At the t-th iteration, we sample j from 1, ..., d with equalprobability, and take

θ(t+1)j = arg min

θj

L(θ(t)) +(θj − θ(t)j

)∇jL(θ(t))

+1

2ηj

(θj − θ(t)j

)2+Rλ

(θ(t)\j

)+ rλ(θj)

= arg minθ

1

2

(θj − θ(t)j + ηj∇jL(θ(t)

)2+ ηjrλ(θj),

where ηj is the step size parameter. Then we have

θ(t+1)j = Tηjλ

(θ(t)j − ηj∇jL(θ(t))

)and θ

(t+1)\j = θ

(t)\j



Convergence Analysis

Sublinear Rate of Convergence:

T = O(dmaxjM

ε

)iterations s. t. E Fλ(θ(t+1))−Fλ(θ) ≤ ε.

Remark:

∇jL(·,θ\j) is Lipschitz continuous for all j = 1, ..., d

‖∇jL(θ′j ,θ\j)−∇jL(θj ,θ\j)‖2 ≤Mj |θ′j − θj |2.

1/ηj = Mj (Often explicitly calculated)

Accelerated version O(d√

maxjMj/ε)

Partial Residual Update Trick.



Warm Start Initialization

Regularization sequence λKNK=0 : λ0 = 1n‖X>y‖∞

θKNK=0 from sparse to dense : θ0

= 0

0

b0

min

L() + R1()

1 = 0.960

Initia

lizati

on

b1 Initia

lizati

on

min

L() + R2()

2 = 0.961

b2 Initia

lizati

on

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·



Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

...

...



Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

...

...



Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

...

...



Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

...

...



Active Set Strategy

1

2

3

4

5

6

7

8

9

10

12

11

A: Active Set

A: Inactive Set

2

5

8

11

2

5

8

11

1

3

4

6

7

9

10

12

8

1

3

4

6

7

9

10

12

2

5

8

1

3

11

4

6

7

10

12

... ...

... ...9



Solution Path

Lasso and Elastic-net solution paths



Proximal Newton Algorithm

Given the solution θ(t), we take

θ(t+0.5) = arg minθ

L(θ(t)) + (θ − θ(t))>∇L(θ(t))

+1

2(θ − θ(t))>∇2L(θ(t))(θ − θ(t)) +Rλ(θ).

Combined with the backtracking line search, we have

θ(t+1) = θ(t) + η(θ(t+0.5) − θ(t)).

Remark: Each subproblem is solved by the coordinate descentalgorithm. The proximal Newton algorithm can be much moreefficient than coordinate descent algorithm in practice.


Software Libraries


Available Packages

glmnet: Lasso, Logistic/Poisson Lasso.Developed by J. Friedman; Maintained by T. Hastie

PICASSO: Lasso, Logistic/Poisson Lasso.

huge: Graphical Lasso

liblinear: Sparse Support Vector Machine, Logistic LassoDeveloped by C. Jin and Maintained by T. Helleputte

QUIC: Graphical LassoDeveloped by C. Hsieh and Maintained by M. Sustik


lecture 5: variable selection and sparsitytzhao80/lectures/lecture_5.pdflecture 5: variable...

Documents