distributed sparse optimizationoptimization/disparse/yin_statos_2013.pdf · 2013. 6. 17. ·...

59
Distributed Sparse Optimization Wotao Yin Computational & Applied Mathematics (CAAM) Rice University STATOS 2013 – in memory of Alex Gershman Zhimin Peng, Ming Yan, Wotao Yin. Parallel and Distributed Sparse Optimization, 2013

Upload: others

Post on 19-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Distributed Sparse Optimization

Wotao Yin

Computational & Applied Mathematics (CAAM)

Rice University

STATOS 2013 – in memory of Alex Gershman

Zhimin Peng, Ming Yan, Wotao Yin. Parallel and Distributed Sparse Optimization, 2013

Page 2: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

Page 3: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Coverage

This talk will not

� survey the existing sparse optimization models or algorithms

� cover decentralized algorithms due to time limit

This talk will

� illustrate how to apply parallel computing techniques on some large-scale

sparse optimization problems

� discuss parallel speedup and overhead along the way

Page 4: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Sparse optimization

A tool that processes “large” data sets for “structured” solutions

Examples of structures: sparse, joint/group sparse, sparse graph, low-rank,

smooth, low-dim manifold, their combinations, ...

Key points:

• finding sparsest solution is generally NP-hard1

• `1 minimization gives the sparsest solution, given sufficient measurements

and other conditions2

• `1 minimization is tractable

• has many interesting applications and generalization (requiring smart

designs of regularization and sensing matrix)

1Natarajan [1995]2Donoho and Elad [2003], Donoho [2006], Candes, Romberg, and Tao [2006]

Page 5: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Basic approach: `1 minimization

minimizex

‖ΨT x‖1, s.t. Ax = b, (1a)

minimizex

λ‖ΨT x‖1 +1

2‖Ax− b‖22, (1b)

minimizex

‖ΨT x‖1, s.t. ‖Ax− b‖2 ≤ δ. (1c)

The settings:

• Ψ is the sparsifying transform; ΨT x is sparse

• A is a linear operator / sensing operator / feature matrix

• when b or ΨT x has noise, (1b) or (1c) is used

Other regularization: joint/group variants, nuclear norm, ...

Standard form of regularization: minimize∑

i(atomi), good for parallelism

Page 6: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

`1 optimization: single-threaded approaches

Off-the-shelf : subgradient descent, reduction to LP/SOCP/SDP

Smoothing: approximate `1 by `1+ε-norm,∑

i

√x2

i + ε, Huber-norm,

augmented `1

then, apply gradient-based and quasi-Newton methods.

Splitting: split smooth and nonsmooth terms into simple subproblems

• primal splitting: GPSR/FPC/SpaRSA/FISTA/SPGL1/...

• dual operator splitting: Bregman/ADMM/split Bregman/...

Greedy approaches: OMP and variants, greedy coordinate descent

Non-convex approaches

......

Page 7: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel computing

• a problem is broken into concurrent parts, executed simultaneously,

coordinated by a controller

• uses multiple cores/CPUs/networked computers, or their combinations

• expect to solve a problem in less time, or a larger problem

CPU

CPU

CPU

t1t2tN ∙ ∙ ∙

Problem

Page 8: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Benefits of being parallel

• break single-CPU limits

• save time

• process big data

• access non-local resource

• handle concurrent data streams / enable collaboration

Page 9: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel overhead

• computing overhead: start up time, synchronization wait time, data

communication time, termination (data collection time)

• I/O overhead: slow read and write of non-local data

• algorithm overhead: extra variables, data duplication

• coding overhead: language, library, operating system, debug, maintenance

Page 10: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel speedup

• definition:

speedup =serial time

parallel time

time is in the wall-clock sense

• Amdahl’s Law: assume infinite processors and no overhead

ideal max speedup =1

1− P

where P = parallelized fraction of code

0 0.2 0.4 0.6 0.8 10

5

10

15

20

Parallel portion of code

Spee

dup

Page 11: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel speedup

• assume N processors and no overhead

ideal speedup =1

(P/N) + (1− P )

• note: P may depend on data size, which depends on input size and N

100

102

104

106

108

0

5

10

15

20

Number of processors

Spee

dup

25%50%90%95%

Page 12: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel speedup

• E := parallel overhead

• in the real world

actual speedup =1

(P/N) + (1− P ) + E

100

102

104

106

108

0

2

4

6

8

10

Number of processors

Spee

dup

25%50%90%95%

100

102

104

106

108

0

5

10

15

20

Number of processors

Spee

dup

25%50%90%95%

when E = N when E = log(N)

Page 13: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Types of communication

Broadcast Scatter

Gather Reduce

1 2 3 3

9

Page 14: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Allreduce

Allreduce Sum via butterfly communication

2 3 6 4 8 1 7 5

5 5 10 10 9 9 12 12

15 15 15 15 21 21 21 21

36 36 36 36 36 36 36 36

0 1 2 3 4 5 6 7

log(N) layers, N parallel communications per layer

Page 15: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Decomposable objective and constraints enables parallelism

• variables x = [x1,x2, . . . ,xN ]T

• embarrassingly parallel (not an interesting case): min∑

i fi(xi)

• consensus minimization, no constraints

min f(x) =N∑

i=1

fi(x) + r(x)

example: f(x) = λ2

∑i ‖A(i)x− bi‖22 + ‖x‖1

• coupling variable z

minx,z

f(x, z) =N∑

i=1

[fi(xi, z) + ri(xi)] + r(z)

Page 16: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

• coupling constraints

min f(x) =N∑

i=1

fi(xi) + ri(xi)

subject to

• equality constraints:∑N

i=1 Aixi = b

• inequality constraints:∑N

i=1 Aixi ≤ b

• graph-coupling constraints Ax = b, where A is sparse

• combinations of independent variables, coupling variables and constraints

example:

minx,w,z

∑Ni=1 fi(xi, z)

s.t.∑

i Aixi ≤ b

Bixi ≤ w, ∀i

w ∈ Ω

• variable coupling ←→ constraint coupling:

minN∑

i=1

fi(xi, z) −→ minN∑

i=1

fi(xi, zi) s.t. zi = z ∀i

Page 17: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Approaches toward parallelism

� Keep an existing serial algorithm, parallelize its expensive part(s)

� Introduce a new parallel algorithm by

• formulating an equivalent model

• replacing the model by an approximate model that’s easier to parallelize

Page 18: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

Page 19: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: parallel ISTA

minimizex

λ‖x‖1 + f(x)

� Prox-linear approximation:

arg minx

λ‖x‖1 + 〈∇f(x),x− xk〉+1

2δk‖x− xk‖22

� Iterative soft-threshold algorithm:

xk+1 = shrink(xk − δk∇f(x), λδk

)

• shrink(y, t) has a simple close form: max(abs(y)-t,0).*sign(y)

• the dominating computation is ∇f(x)

� examples: f(x) = 12‖Ax− b‖22, logistic loss of Ax− b, hinge loss ... where

A is often dense and can go very large-scale

Page 20: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel gradient computing of f(x) = g(Ax + b): row partition

Let A be very big and dense, and g(y) =∑

gi(yi).

Then f(x) = g(Ax + b) =∑

gi(A(i)x + bi)

• let A(i) and bi be the ith row block of A and b, respectively

...

A(1)

A(2)

A(M)

• store A(i),bi on node i

• goal: to compute ∇f(x) =∑

AT(i)∇gi(A(i)x + bi), or similarly ∂f(x)

Page 21: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel gradient computing of f(x) = g(Ax + b): row partition

Steps

1. compute A(i)x + bi ∀i in parallel;

2. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

3. compute AT(i)gi ∀i in parallel;

4. allreduce∑M

i=1 AT(i)gi.

• step 1, 2 and 3 are local computation and communication-free;

• step 4 requires a collective communication.

Page 22: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel gradient computing of f(x) = g(Ax + b): column partition

∙ ∙ ∙A1 A2 AN

column block partition

Ax =N∑

i=1

Aixi =⇒ f(x) = g(Ax + b) = g(∑

Aixi + b)

∇f(x) =

AT1∇g(Ax + b)

AT2∇g(Ax + b)

...

ATN∇g(Ax + b)

, similar for ∂f(x)

Page 23: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel computing of ∇f(x) = g(Ax + b): column partition

Steps

1. compute Aixi + 1N

b ∀i in parallel

2. allreduce Ax + b =∑N

i=1 Aixi + 1N

b;

3. evaluate ∇g(Ax + b) ∀i in each process;

4. compute ATi ∇g(Ax + b) ∀i in parallel.

• each processor keeps Ai and xi, as well as a copy of b;

• step 2 requires a collective communication;

• step 3 involves duplicated computation in each process.

Page 24: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Two-way partition

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙∙∙

∙∙∙

∙∙∙

A1,1 A1,2 A1,N

A2,2A2,1

AM,1

A2,N

AM,NAM,2

∙∙∙

f(x) =∑

gi(A(i)x + bi) =∑

i

gi(∑

j

Ai,jxj + bi)

∇f(x) =∑

i

AT(i)∇gi(A(i)x + bi) =

i

AT(i)∇gi(

j

Ai,jxj + bi)

Page 25: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Two-way partition

Step

1. compute Ai,jxj + 1N

bi ∀i in parallel;

2. allreduce∑N

j=1 Ai,jxj + bi for every i;

3. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

4. compute ATj,igi ∀i in parallel;

5. allreduce∑N

i=1 ATj,igi for every j.

• processor (i, j) keeps Ai,j , xj and bi;

• step 2 and 5 require the collective communication;

Page 26: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: parallel ISTA for LASSO

LASSO

min f(x) = λ‖x‖1 +1

2‖Ax− b‖22

algorithm

xk+1 = shrink(xk − δkA

T Axk + δkAT b, λδk

)

Serial pseudo-code

1: initialize x = 0 and δ;

2: pre-compute AT b

3: while not converged do

4: g = AT Ax−AT b

5: x = shrink (x− δg, λδ);

6: end while

Page 27: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: parallel ISTA for LASSO

distribute blocks of rows to M nodes

...

A(1)

A(2)

A(M)

1: initialize x = 0 and δ;

2: i = find my processor id

3: processor i loads A(i)

4: pre-compute δAT b

5: while not converged do

6: processor i computes ci = AT(i)A(i)x

7: y = allreduce(ci, SUM)

8: x = shrink(xk − δc + δAT b, λδ

);

9: end while

• assume A ∈ Rm×n

• one allreduce per iteration

• speedup

≈ 1P/M+(1−P )+O(log(M))

• P is close to 1

• requires synchronization

Page 28: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: parallel ISTA for LASSO

distribute blocks of columns to N nodes

∙ ∙ ∙A1 A2 AN

1: initialize x = 0 and δ;

2: i = find my processor id

3: processor i loads Ai

4: pre-compute δATi b

5: while not converged do

6: processor i computes yi = Aixi

7: y = allreduce(yi, SUM)

8: processor i computes zi = ATi y

9: xk+1i = shrink

(xk

i − δzi + δATi b, λδ

);

10: end while

• one allreduce per iteration

• speedup

≈ 1P/N+(1−P )+O( m

nlog(N))

• P is close to 1

• requires synchronization

• if m� n, this approach is

faster

Page 29: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition3

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

3Dantzig and Wolfe [1960], Spingarn [1985], Bertsekas and Tsitsiklis [1997]

Page 30: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Unconstrained minimization with coupling variables

x = [x1,x2, . . . ,xN ]T .

z is the coupling/bridging/complicating variable.

minx,z

f(x, z) =N∑

i=1

fi(xi, z)

equivalently to separable objective with coupling constraints

minx,z

N∑

i=1

fi(xi, zi) s.t. zi = z ∀i

Examples

• network utility maximization (NUM), extend to multiple layers4

• domain decomposition (z are overlapping boundary variables)

• system of equations: A is block-diagonal but with a few dense columns

4see tutorial: Palomar and Chiang [2006]

Page 31: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Primal decomposition (bi-level optimization)

Introduce easier subproblems

gi(z) = minxi

fi(xi, z), i = 1, . . . , N

then

minx,z

N∑

i=1

fi(xi, z) ⇐⇒ minz

∑gi(z)

the RHS is called the master problem

parallel computing: iterate

1. fix z, update xi and compute gi(z) in parallel

2. update z using a standard method (typically, subgradient descent)

• good for small-dimensional z

• fast if the master problem converges fast

Page 32: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Dual decomposition

Consider

minx,z

i

fi(xi, zi)

s.t.∑

i

Aizi = b

Introduce

gi(z) = minxi

f(xi, z)

g∗i (y) = max

zyT z− gi(z) //convex conjugate

Dualization:

⇔ minx,z

maxy

i

fi(xi, zi) + yT (b−∑

i

Aizi)

(generalized minimax thm.) ⇔ maxyi

minxi,zi

{∑

i

(fi(xi, zi)− yT Aizi

)+ yT b

}

⇔ maxyi

{

minzi

i

(gi(zi)− yT Aizi

)+ yT b

}

⇔ miny

i

g∗i (AT

i y)− yT b

where gi(y) = minxi f(xi,y), g∗i (z) = miny gi(y)− zT y is the conjugate

function of gi(y).

Page 33: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: consensus and exchange problems

Consensus problem

minz

i

gi(z)

Reformulate as

minz

i

gi(zi) s.t. zi = z ∀i

Its dual problem is the exchange problem

miny

i

g∗i (yi) s.t.

i

yi = 0

Page 34: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Primal-dual objective properties

Constrained separable problem

minz

i

gi(z) s.t.∑

i

Aizi = b

Dual problem

miny

i

g∗i (AT

i y)− bT y

� If gi is proper, closed, convex, g∗i (aT

i y) is sub-differentiable

� If gi is strictly convex, g∗i (aT

i y) is differentiable

� If gi is strongly convex, g∗i (aT

i y) is differentiable and has Lipschitz gradient

� Easily obtaining z∗ from y∗ generally requires strict convexity of gi or strict

complementary slackness

Example:

• gi(z) = |zi| and g∗i (aT

i y) = ι{−1 ≤ aTi y ≤ 1}

• gi(z) = 12|zi|2 and g∗

i (aTi y) = 1

2|aT

i y|2

• gi(z) = exp(zi) and g∗i (aT

i y) = (aTi y) ln(aT

i y)− aTi y + ι{aT

i y ≥ 0}

Page 35: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman5)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

5Yin, Osher, Goldfarb, and Darbon [2008]

Page 36: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Example: augmented `1

Change |zi| into strongly convex gi(zi) = |zi|+ 12α|zi|2

Dual problem

miny

i

g∗i (AT

i y)− bT y =∑

i

α

2| shrink(AT

i y)|2

︸ ︷︷ ︸hi(y)

−bT y

Dual gradient iteration (equivalent to linearized Bregman):

yk+1 = yk − tk

(∑

i

∇hi(yk)− b

)

, where ∇hi(y) = αAi shrink(ATi y).

� Dual is C1 and unconstrained. One can apply (accelerated) gradient

descent, line search, quasi-Newton method, etc.

� Global linear convergence.6

� Easy to parallelize. One collective communication per iteration.

� Recover x∗ = α shrink(ATi y∗).

6Lai and Yin [2011]

Page 37: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Augmented `1: exact regularization7

Theorem

There exists a finite α0 > 0 such that whenever α > α0, the solution to

(L1+LS) min ‖x‖1 +1

2α‖x‖22 s.t. Ax = b

is also a solution to

(L1) min ‖x‖1 s.t. Ax = b.

L1 L1+LS

7Friedlander and Tseng [2007], Yin [2010]

Page 38: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Augmented `1: CS recovery guarantees

• if min ‖x‖1 gives exact or stable recovery provided

#measurements ≥ C ∙ f(signal dim, signal sparsity).

• adding 12α‖x‖22 changes the condition to

#measurements ≥ (C + O(1

2α)) ∙ f(signal dim, signal sparsity).

• in theory and practice, α ≥ 10‖xo‖∞ suffices8

8Lai and Yin [2011]

Page 39: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Augmented `1: simple test

• Gaussian random A

• xo had 10% nonzero entries

• recover xo from under-sampled b = Axo

min{‖x‖1 : Ax = b} min{‖x‖22 : Ax = b} min{‖x‖1 + 1

25 ‖x‖22 : Ax = b}

exactly the same as `1 solution

Page 40: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM9

5. Parallel greedy coordinate descent

6. Numerical results with big data

9Bertsekas and Tsitsiklis [1997]

Page 41: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

Page 42: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

step 2: iterate

1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,

2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,

3. zk+1 ← zk − (Axk+1 + Byk+1 − b).

history: dates back to 60s, formalized 80s, parallel versions appeared late 80s,

revived very recently, new convergence results recently

Page 43: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

ADMM and parallel/distributed computing

Two approaches:

• parallelize the serial algorithm(s) for ADMM subproblem(s),

• turn problem into a parallel-ready form10

minx,y

∑i fi(xi) + g(y)

s.t.

A1

. . .

AM

x1

...

xM

+

B1

...

BM

y =

b1

...

bM

consequently, computing xk+1 reduces to parallel subproblems

xk+1i ← min

xfi(xi) +

β

2‖Aixi + Biy

k − bi − zki ‖

22, i = 1, . . . , M

Issue: to have block diagonal A and simple B, additional variables are often

needed, thus increasing the problem size and parallel overhead.

10a recent survey Boyd, Parikh, Chu, Peleato, and Eckstein [2011]

Page 44: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Parallel Linearized Bregman vs ADMM on basis pursuit

Basis pursuit:

minx‖x‖1 s.t. Ax = b.

• example with dense A ∈ R1024×2048;

• distribute A to N computing nodes

A = [A1 A2 ∙ ∙ ∙ AN ];

• compare

1. parallel ADMM in the survey paper Boyd, Parikh, Chu, Peleato, and

Eckstein [2011];

2. parallel linearized Bregman (dual gradient descent), un-accelerated.

• tested N = 1, 2, 4, . . . , 32 computing nodes;

Page 45: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

parallel ADMM v.s. parallel linearized Bregman

100

101

100

101

# of cores

time

ADMM

Parallel LBreg

parallel linearized Bregman scales much better

Page 46: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

Page 47: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

(Block)-coordinate descent/update

Definition: update one block of variables each time, keeping others fixed

Advantage: update is simple

Disadvantages:

• more iterations (with exceptions)

• may stuck at non-stationary points if problem is non-convex or

non-smooth (with exceptions)

Block selection: cycle (Gauss-Seidel), parallel (Jacobi), random, greedy

Specific for sparse optimization: greedy approach11 can be exceptionally

effective (because most time correct variables are selected to update; most zero

variables are never touched; “reducing” the problem to a much smaller one.)

11Li and Osher [2009]

Page 48: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

GRock: greedy coordinate-block descent

Example:

min λ‖x‖1 + f(Ax− b)

� decompose Ax =∑

j Aixj ; block Aj and xj are kept on node j

Parallel GRock:

1. (parallel) compute a merit value for each coordinate i

di = arg mind

λ ∙ r(xi + d) + gid +1

2d2, where gi = aT

(i)∇f(Ax− b)

2. (parallel) compute a merit value for each block j

mj = max{|d| : d is an element of dj}

let sj be the index of the maximal coordinate within block j

3. (allreduce) P ← select P blocks with largest mj , 2 ≤ P ≤ N

4. (parallel) update xsj ← xsj + dsj for all j ∈ P

5. (allreduce) update Ax

Page 49: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

How large P can be?

P depends on the block spectral radius

ρP = maxM∈M

ρ(M),

where M is the set of all P × P submatrices that we can obtain from AT A

corresponding to selecting exactly one column from each of the P blocks

Theorem

Assume each column of A has unit 2-norm. If ρP < 2, GRock with P parallel

updates per iteration gives

F(x + d)−F(x) ≤ρP − 2

2β‖d‖22

moreover,

F(xk)−F(x∗) ≤2C2

(

2L + β√

NP

)2

(2− ρP )β∙1

k.

Page 50: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Compare different block selection rules

0 50 10010

-20

10-10

100

1010

nomalized # coordinates calculated

f-fm

in

Cyclic CDGreedy CDMixed CDGRock

0 50 10010

-15

10-10

10-5

100

105

Normalized # coordinates calculated

Rel

ativ

eer

ror

Cyclic CDGreedy CDMixed CDGRock

• LASSO test with A ∈ R512×1024, N = 64 column blocks

• GRock uses P = 8 updates each iteration

• greedy CD12 uses P = 1

• mixed CD13 selects P = 8 random blocks and best coordinate in each

iteration

12Li and Osher [2009]13Scherrer, Tewari, Halappanavar, and Haglin [2012]

Page 51: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

Page 52: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Test on Rice cluster STIC

specs:

• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon

(Nahalem) CPUs

• each node has 12GB of memory shared by all 8 cores

• # of processes used on each node = 8

test dataset:

A type A size λ sparsity of x∗

dataset I Gaussian 1024× 2048 0.1 100

dataset II Gaussian 2048× 4096 0.01 200

Page 53: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

iterations vs cores

100

101

102

102

103

104

number of cores

num

ber

ofit

erat

ions

p D-ADMMp FISTAGRock

100

101

102

102

104

number of cores

num

ber

ofit

erat

ions

p D-ADMMp FISTAGRock

dataset I dataset II

Note: we set P = number of cores

Page 54: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

time vs cores

100

101

102

10-1

100

101

number of cores

tim

e(s

)

p D-ADMMp FISTAGRock

P = 16

100

101

102

100

102

number of cores

tim

e(s

)

p D-ADMMp FISTAGRock

dataset I dataset II

Note: we set P = number of cores

Page 55: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

(% communication time) vs cores

100

101

102

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of cores

com

m.

tim

eper

cent

age

p D-ADMMp FISTAGRock

100

101

102

0.1

0.2

0.3

0.4

number of cores

com

m.

tim

eper

cent

age p D-ADMM

p FISTAGRock

dataset I dataset II

Note: we set P = number of cores

Page 56: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

big data LASSO on Amazon EC2

Amazon EC2 is an elastic, pay-as-you-use cluster;

advantage: no hardware investment, everyone can have an account

test dataset

• A dense matrix, 20 billion entries, and 170GB size

• x: 200K entries, 4K nonzeros, Gaussian values

requested system

• 20 “high-memory quadruple extra-large instances”

• each instance has 8 cores and 60GB memory

code

• written in C using GSL (for matrix-vector multiplication) and MPI

Page 57: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

parallel Dual-ADMM vs FISTA14 vs GRock

p D-ADMM p FISTA GRock

estimate stepsize (min.) n/a 1.6 n/a

matrix factorization (min.) 51 n/a n/a

iteration time (min.) 105 40 1.7

number of iterations 2500 2500 104

communication time (min.) 30.7 9.5 0.5

stopping relative error 1E-1 1E-3 1E-5

total time (min.) 156 41.6 1.7

Amazon charge $85 $22.6 $0.93

• ADMM’s performance depends on penalty parameter β

we picked β as the best out of only a few trials (we cannot afford more)

• parallel Dual ADMM and FISTA were capped at 2500 iterations

• GRock used adaptive P and stopped at relative error 1E-5

14Beck and Teboulle [2009]

Page 58: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

Software codes can be found in the author’s website.

Funding agencies: US NSF, DoD

Acknowledgements: Qing Ling (USTC) and Zaiwen Wen (SJTU)

References:

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on

Computing, 24:227–234, 1995.

D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal)

dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences,

100:2197–2202, 2003.

D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):

1289–1306, 2006.

E.J. Candes, E. Romberg, and T. Tao. Robust uncertainty principles: Exact signal

reconstruction from highly incomplete frequency information. Information Theory,

IEEE Transactions on, 52(2):489–509, 2006.

G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations

Research, 8:101–111, 1960.

J.E. Spingarn. Applications of the method of partial inverses to convex programming:

decomposition. Mathematical Programming, 32(2):199–223, 1985.

D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical

Methods, Second Edition. Athena Scientific, 1997.

Page 59: Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. · Outline 1. Background: Sparse optimization 2. Background: Parallel benefits, speedup,

D.P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility

maximization. IEEE Journal on Selected Areas in Communications, 24(8):

1439–1451, 2006.

W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for

l1-minimization with applications to compressed sensing. SIAM Journal on Imaging

Sciences, 1(1):143–168, 2008.

M.-J. Lai and W. Yin. Augmented l1 and nuclear-norm models with a globally linearly

convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059–1091, 2011.

M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM

Journal on Optimization, 18(4):1326–1350, 2007.

W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM

Journal on Imaging Sciences, 3(4):856–877, 2010.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations

and Trends in Machine Learning, 3(1):1–122, 2011.

Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with

application to compressed sensing: a greedy algorithm. Inverse Problems and

Imaging, 3(3):487–503, 2009.

C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for

accelerating parallel coordinate descent. In NIPS, pages 28–36, 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.