parallel and distributed sparse optimization

50
Parallel and Distributed Sparse Optimization Zhimin Peng Ming Yan Wotao Yin Computational and Applied Mathematics Rice University May 2013

Upload: others

Post on 12-Sep-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel and Distributed Sparse Optimization

Parallel and Distributed Sparse Optimization

Zhimin Peng Ming Yan Wotao Yin

Computational and Applied Mathematics

Rice University

May 2013

Page 2: Parallel and Distributed Sparse Optimization

outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 3: Parallel and Distributed Sparse Optimization

sparse optimization

processes “large” data sets for “structured” solutions

key points:

• finding the sparsest solution is generally NP-hard

• under some conditions (e.g., enough equations, dual certificate,

incoherence, RIP, etc.), minimal `1 solution is the sparsest solution 1

• most time, `1 minimization in Rn is quicker than solving an n× n dense

system

• has many interesting generalizations in signal/image processing, machine

learning, data mining, etc.

1Donoho [2006], Candes, Romberg, and Tao [2006]

Page 4: Parallel and Distributed Sparse Optimization

current challenges

the size, complexity, and diversity of sparse optimization grow significantly

examples:

• video, 4D CT images, dynamical system, higher-dimensional tensors

• (the Internet) distributed data, streaming data, cloud computing

Page 5: Parallel and Distributed Sparse Optimization

single-threaded approaches for sparse optimization

off-the-shelf: subgradient descent, LP/SOCP/SDP

smoothing: approximate `1 by

• `1+ε-norm

•∑i

√x2i + ε

• Huber-norm

and apply gradient-based and quasi-Newton methods.

splitting: split smooth and nonsmooth terms into simple subproblems

• operator splitting: GPSR/FPC/SpaRSA/FISTA/SPGL1/...2

• dual operator splitting: Bregman/ADMM/split Bregman/... 3

all operations have forms: Ax and ATy

greedy approaches: GRock, OMP, CoSaMP ...4

2Hale, Yin, and Zhang [2008], Beck and Teboulle [2009], van den Berg and Friedlander [2008]3Yin [2010], Yang and Zhang [2011]4Tropp and Gilbert [2007], Needell and Tropp [2009]

Page 6: Parallel and Distributed Sparse Optimization

parallel/distributed sparse optimization

should be capable of handling instances of very large scales

this set of slides covers

• parallel/distrbuted alternating direction method of multipliers (ADMM)

• parallel/distrbuted prox-linear approaches

• (coming ...) parallel implementation of linearize Bregman

• GRock: parallel greedy coordinate-block descent

going down the list, algorithms assume more structures but are also faster

Page 7: Parallel and Distributed Sparse Optimization

Outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 8: Parallel and Distributed Sparse Optimization

general model

minimizex∈Rn

F(x) = λ · R(x) + L(Ax,b)

• R(x) is a regularizer such as ‖x‖1, ‖ΨTx‖1, ‖x‖2,1, etc.

• L(Ax,b) is a fitting measure such as squared loss, logistic loss, etc.

• we say f(x) is separable if

f(x) =

S∑s=1

fs(xs)

Page 9: Parallel and Distributed Sparse Optimization

distributing big data A

reasons:

• A is too large to store centrally

• A is collected in a distributed manner

we consider three distribution schemes:

a. row blocks: stacking row-blocks A(1),A(2), . . . ,A(M) forms A

b. columns blocks A = [A1 A2 · · · AN ]

c. A is partitioned into MN sub-blocks. The (i, j)th block is Ai,j .

· · ·A1 A2 AN...

A(1)

A(2)

A(M)

· · ·· · ·

· · ·

··· ··· ···

A1,1 A1,2 A1,N

A2,2A2,1

AM,1

A2,N

AM,NAM,2

···

a b c

assume each computing node stores a block

Page 10: Parallel and Distributed Sparse Optimization

computing Ax and ATy: broadcast then parallel

computing

Ax =

A(1)x

A(2)x

. . .

A(M)x

in scenario a

and

ATy =

AT

1 y

AT2 y

. . .

ATNy

in scenario b

each takes two steps

1. x and y are broadcasted to or synchronized among all the nodes

2. every node independently computes A(i)x and ATj y

Page 11: Parallel and Distributed Sparse Optimization

computing Ax and ATy: parallel then reduce

computing

ATy =

M∑i=1

AT(i)yi in scenario a

and

Ax =

N∑i=1

Ajxj in scenario b

each takes two steps

1. every node computes AT(i)yi and Ajxj independently

2. a reduce operation computes the sum

(allreduce also returns the sum to every node)

Page 12: Parallel and Distributed Sparse Optimization

broadcast, parallel, then reduce

in scenario c, either one of Ax and ATy requires both broadcast and reduce

example of Ax:

Ax =

A(1)x

A(2)x

. . .

A(M)x

=

∑Nj=1 A1,jxj∑Nj=1 A2,jxj

. . .∑Nj=1 AM,jxj

1. for every j, xj is broadcasted to nodes (1, j), (2, j), . . . , (M, j)

2. compute Ai,jxj in parallel over all i, j

3. for every i, reduce over nodes (i, 1), (i, 2), . . . , (i,N) to compute

A(i)x =∑Nj=1 Ai,jxj

Page 13: Parallel and Distributed Sparse Optimization

node speed and block size

• most clusters have identical nodes

• but clouds can have different nodes

• most parallel algorithms require synchronized computation

faster (or less busy) nodes wait for slower (or more busy) nodes

• equal block size + different nodes =⇒ faster nodes wait for slower ones

a better solution: larger blocks assigned to faster (or less busy) nodes

Page 14: Parallel and Distributed Sparse Optimization

Outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 15: Parallel and Distributed Sparse Optimization

background: prox-linear iteration

original model:

minimizex∈Rn

F(x) = λ · R(x) + L(Ax,b)

linearize L(Ax,b) at xk:

xk+1 ← arg minx

λ · R(x) + 〈x,AT∇L(Axk,b)〉+1

2δk‖x− xk‖22,

advantages: well understood and its solution

xk+1 = proxλR(xk − δkAT∇L(Axk,b)),

is often in closed form

example: if R(x) = ‖x‖1, proxλR is known as soft-thresholding

algorithms: FPC/SpaSRA/ISTA/FISTA ...

Page 16: Parallel and Distributed Sparse Optimization

scenario a

assume separability L(z) =∑i Li(zi)

storage: node i keeps row block A(i), subvector bi, and current xk

each iteration: compute

xk+1 = proxλR(xk − δkAT∇L(Axk,b)) (1)

in three steps, noticing

AT∇L(Axk,b) =

M∑i=1

AT(i)∇Li(A(i)x

k;bi)

1. parallel: every node i computes AT(i)∇Li(A(i)x

k,bi)

2. allreduce: AT∇L(Axk,b) =∑Mi=1 A

T(i)∇Li(A(i)x

k;bi) for all nodes

3. parallel: every node gets identical xk+1 following (1)

Page 17: Parallel and Distributed Sparse Optimization

scenario b

assume separability R(x) =∑j Rj(xj)

consequently:

storage: node i keeps column block Aj , entire b, and subvector xkj

each iteration: computes

xk+1 = proxλR(xk − δkAT∇L(Axk,b))

⇒ for each j, xk+1j = proxλRj

(xkj − δkATj ∇L(Axk,b)) (2)

in three steps

• parallel: every node i computes Ajxkj

• allreduce: Axk =∑Nj=1 Ajx

kj for all nodes

• parallel: ever node computes ATj ∇L(Axk,b) and then xk+1

j following (2)

Page 18: Parallel and Distributed Sparse Optimization

scenario c

assume separability R(x) =∑j R(xj) and L(z) =

∑i L(zi)

two allreduce operations at each iteration, remaining operations are parallel

(details left to the reader)

Page 19: Parallel and Distributed Sparse Optimization

Outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 20: Parallel and Distributed Sparse Optimization

GRock: greedy coordinate-block descent

general messages about coordinate (block) descent:

• update one coordinate (block) at a time

• coordinate selection: cycle (Gauss-Seidel), random, greedy, mixtures

• advantage: update is easy

• disadvantage: (usually but not always) more iterations, may fail on

non-smooth and non-convex problems (stuck at non-stationary points)

but GRock converges after fewer iterations for sparse optimization with A

formed of nearly orthogonal columns

Page 21: Parallel and Distributed Sparse Optimization

GRock iteration for scenario b

1. assign a merit value for each coordinate i

di = arg mind

λ · r(xi + d) + gid+1

2d2

• gi is the ith component of g(x) = AT(i)∇Li(A(i)x;bi)

• larger di means xi is more willing to reduce energy

2. assign a merit value for each block

mj = max{|d| : d is an element of dj}, mj = dsj

3. P ← select P blocks with largest mj , 2 ≤ P ≤ N

4. parallel update xsj ← xsj + dsj for all j ∈ P

Page 22: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 23: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 24: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 25: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 26: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 27: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3

Page 28: Parallel and Distributed Sparse Optimization

how large P can be

-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

P = 1 converge

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

P = 3 diverge

Page 29: Parallel and Distributed Sparse Optimization

near orthogonality enables P ≥ 2

assume each column of A has unit 2-norm

define block spectral radius

ρP = maxM∈M

ρ(M),

where M is the set of all P × P submatrices that we can obtain from ATA

corresponding to selecting exactly one column from each of the P blocks

Theorem

if ρP < 2, then GRock with P parallel updates per iteration gives

F(x + d)−F(x) ≤ ρP − 2

2β‖d‖22

moreover,

F(xk)−F(x∗) ≤2C2

(2L+ β

√NP

)2

(2− ρP )β· 1

k.

Page 30: Parallel and Distributed Sparse Optimization

compare different block selection rules

0 50 10010

-20

10-10

100

1010

nomalized # coordinates calculated

f-fm

in

Cyclic CDGreedy CDMixed CDGRock

0 50 10010

-15

10-10

10-5

100

105

Normalized # coordinates calculated

Relativeerror

Cyclic CDGreedy CDMixed CDGRock

• LASSO test with A ∈ R512×1024, N = 64 column blocks

• GRock uses P = 8 updates each iteration

• greedy CD 5 uses P = 1

• mixed CD 6 selects P = 8 random blocks and best coordinate in each

iteration

5Li and Osher [2009]6Scherrer, Tewari, Halappanavar, and Haglin [2012]

Page 31: Parallel and Distributed Sparse Optimization

real CMU data 7

A has dimension [6205, 24820] with about 0.4% nonzero entries

0 20 40 60 80 10010

-15

10-10

10-5

100

105

iteration number

f-fm

in

FISTAGreedy CDMixed CDGRock

0 20 40 60 80 10010

-15

10-10

10-5

100

Iteration number

Relativeerror

FISTAGreedy CDMixed CDGRock

GRock is often (but not always) quicker than FISTA

update of g in GRock is about 50% cheaper than in FISTA

GRock requires near orthogonality to work but FISTA does not

7Bradley, Kyrola, Bickson, Guestrin, and Guestrin [2011]

Page 32: Parallel and Distributed Sparse Optimization

Why GRock rocks:

• coordinate selections often occur on those corresponding to nonzero

entries in the solution

• such selection keeps most entries in x as zero throughout the iterations

• hence, the problem dimension is effectively reduced

• sparse update also makes gradient easier to update

prox-linear iterations typically have dense intermediate solutions at the

beginning and can take many iterations before the intermediate solutions

become finally sparse

Page 33: Parallel and Distributed Sparse Optimization

Outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 34: Parallel and Distributed Sparse Optimization

alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

step 2: iterate

1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,

2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,

3. zk+1 ← zk − (Axk+1 + Byk+1 − b).

history: dates back to 60s, formalized 80s, parallel versions appeared late 90s,

revived very recently, convergence results enriched recently

Page 35: Parallel and Distributed Sparse Optimization

parallel/distributed ADMM8

idea: form separable f =∑i fi(xi), block diagonal A, and simple g and B

minx,y

∑i fi(xi) +g(y)

s.t.

A1

. . .

AM

x1

...

xM

+

B1

...

BM

y =

b1

...

bM

consequently, computing xk+1 reduces to parallel subproblems

xk+1i ← min

xfi(xi) +

β

2‖Aixi + Biy

k − bi − zki ‖22, i = 1, . . . ,M

8Boyd, Parikh, Chu, Peleato, and Eckstein [2011]

Page 36: Parallel and Distributed Sparse Optimization

scenario a: primal ADMM

assume separability L(z) =∑i Li(zi)

reformulate

minx

∑i

Li(Aix,bi) +R(x)

intominx,y

∑i Li(A(i)x(i),bi) +R(x)

s.t.

I

. . .

I

x(1)

...

x(M)

I...

I

x =

0...

0

note: x(i) are not subvectors of x but local copies of x

Page 37: Parallel and Distributed Sparse Optimization

scenario a: primal ADMM

storage: node i keeps row block A(i), subvector bi, and current xk(i), xk, zk

each iteration:

1. parallel: xk+1(i) ← minx(i)

Li(Aix(i),bi) + β2‖x(i) − xk − zki ‖22, for every i

2. allreduce: 1M

∑i(x

k+1(i) − zki ) for all nodes

3. parallel: xk+1 ←R(x) + βM2

∥∥∥x− 1M

∑i(x

k+1(i) − zki )

∥∥∥22

4. parallel: zk+1i ← zki − (xk+1

(i) − xk+1) for every i

note: steps 2 & 3 are due to

β

2

∥∥∥∥∥∥∥∥I

. . .

I

x(1)

...

x(M)

−I...

I

x−

z1...

zM

∥∥∥∥∥∥∥∥2

2

=βM

2

∥∥∥∥∥x− 1

M

∑i

(x(i) − zi)

∥∥∥∥∥2

2

+C

where C is independent of x

Page 38: Parallel and Distributed Sparse Optimization

dual ADMM

reformulate

minxλR(x) + L(Ax,b)

into

minx,y

λR(x) + L(y,b)

s.t. Ax− y = 0

dualization:

⇒ minx,y

maxz{λR(x) + L(y,b) + zT (Ax− y)}

⇒ maxz

minx,y {λR(x) + L(y,b) + zT (Ax− y)}

⇒ maxz

minx{λR(x) + zTAx}+ miny{L(y,b)− zTy}

⇒ maxz

G(AT z) +H(z)

where G(z) := minx{λR(x) + zTx} and H(z) := miny{L(y,b)− zTy}

Page 39: Parallel and Distributed Sparse Optimization

scenario b: dual ADMM

assume separability R(x) =∑j Rj(xj), then G is also separable

G(AT z) =∑j

Gj(ATj z)

reformulate the dual problem as:

minz

− 1N

∑j H(z(j))−

∑j Gj(yj)

s.t.

AT

1

. . .

ATN

z(1)

...

z(N)

−I

. . .

I

y1

...

yN

=

0...

0

I

. . .

I

z(1)

...

z(N)

−I...

I

z =

0...

0

note: z(j) are not subvectors of z but local copies of z

Page 40: Parallel and Distributed Sparse Optimization

scenario b: dual ADMM

storage: node j keeps column block Aj , vector b, and current ykj , zk, zk(j),

and dual variables xkj , pki of the dual problem

each iteration:

1. parallel: yk+1i ← arg minyi

−Gi(yi) + α2‖Aiz

k(i) − yi − xki ‖22, for every i

2. parallel: zk+1(i) ← arg minz(i)

− 1NH(z(i)) + α

2‖Aiz(i) − yk+1

i − xki ‖22+β

2‖z(i) − zk − pki ‖22, for every i

3. allreduce: z← 1N

∑Ni (zk+1

(i) − pki ) for all nodes

4. parallel: xk+1(i) ← xk(i) − γ(AT

i zk+1(i) − yk+1

i ) for every i

5. parallel: pk+1i ← pki − γ(zk+1

(i) − zk+1) for every i

Page 41: Parallel and Distributed Sparse Optimization

Outline

1. sparse optimization background

2. model and data distribution

3. parallel and distributed algorithms

parallel/distributed prox-linear iteration

GRock: greedy coordinate-block descent

parallel/distributed ADMM

4. numerical experiments

Rice cluster STIC

Amazon EC2

5. conclusions

Page 42: Parallel and Distributed Sparse Optimization

test on STIC, a cluster at Rice university

specs:

• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon

(Nahalem) CPUs

• each node has 12GB of memory shared by all 8 cores

• # of processes used on each node = 8

test dataset:

A type A size λ sparsity of x∗

dataset I Gaussian 1024× 2048 0.1 100

dataset II Gaussian 2048× 4096 0.01 200

Page 43: Parallel and Distributed Sparse Optimization

cores vs iterations

100

101

102

102

103

104

number of cores

number

ofiterations

p D-ADMMp FISTAGRock

100

101

102

102

104

number of cores

number

ofiterations

p D-ADMMp FISTAGRock

dataset I dataset II

Page 44: Parallel and Distributed Sparse Optimization

cores vs time

100

101

102

10-1

100

101

number of cores

time(s)

p D-ADMMp FISTAGRock

P = 16

100

101

102

100

102

number of cores

time(s)

p D-ADMMp FISTAGRock

dataset I dataset II

Page 45: Parallel and Distributed Sparse Optimization

cores vs (% communication/total time)

100

101

102

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of cores

comm.timepercentage

p D-ADMMp FISTAGRock

100

101

102

0.1

0.2

0.3

0.4

number of cores

comm.timepercentage p D-ADMM

p FISTAGRock

dataset I dataset II

Page 46: Parallel and Distributed Sparse Optimization

big data LASSO on Amazon EC2

Amazon EC2 is an elastic, pay-as-you-use cluster;

advantage: no hardware investment, everyone can have an account

test dataset

• A dense matrix, 20 billion entries, and 170GB size

• x: 200K entries, 4K nonzeros, Gaussian values

requested system

• 20 “high-memory quadruple extra-large instances”

• each instance has 8 cores and 60GB memory

code

• written in C using GSL (for matrix-vector multiplication) and MPI

Page 47: Parallel and Distributed Sparse Optimization

parallel dual ADMM vs GRock

p D-ADMM p FISTA GRock

estimate stepsize (min.) n/a 1.6 n/a

matrix factorization (min.) 51 n/a n/a

iteration time (min.) 105 40 1.7

number of iterations 2500 2500 104

communication time (min.) 30.7 9.5 0.5

stopping relative error 1E-1 1E-3 1E-5

total time (min.) 156 41.6 1.7

Amazon charge $85 $22.6 $0.93

• ADMM’s performance depends on penalty parameter β

we picked β as the best out of only a few trials (we cannot afford more)

• parallel dual ADMM stopped at 2500 iterations

• GRock stopped at relative error 1E-5

Page 48: Parallel and Distributed Sparse Optimization

conclusions

summary of results

• ADMM is flexible but does not scale well; its iterations increase with

distribution

• separable objectives enable parallel and distributed prox-linear iterations,

which scale very well

• nearly orthogonal A further enables parallel GRock, which scales even

better

• in general, model structures and data properties of sparse optimization

enable very efficient parallel and distributed approaches

acknowledgements: NSF DMS and ECCS, ARO/ARL MURI

open resources: report, slides, and codes available on our websites

Page 49: Parallel and Distributed Sparse Optimization

References:

David Leigh Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):

1289–1306, 2006.

Emmanuel J Candes, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact

signal reconstruction from highly incomplete frequency information. Information Theory, IEEE

Transactions on, 52(2):489–509, 2006.

Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for `1-minimization:

Methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse

problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions.

SIAM Journal on Scientific Computing, 31(2):890–912, 2008. doi: 10.1137/080714488. URL

http://link.aip.org/link/?SCE/31/890.

Wotao Yin. Analysis and generalizations of the linearized bregman method. SIAM Journal on

Imaging Sciences, 3(4):856–877, 2010.

Junfeng Yang and Yin Zhang. Alternating direction algorithms for `1-problems in compressive

sensing. SIAM journal on scientific computing, 33(1):250–278, 2011.

Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal

matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007.

Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and

inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.

Page 50: Parallel and Distributed Sparse Optimization

Yingying Li and Stanley Osher. Coordinate descent optimization for `1 minimization with

application to compressed sensing; a greedy algorithm. Inverse Problems and Imaging, 3(3):

487–503, 2009.

Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering for

accelerating parallel coordinate descent. In Peter L. Bartlett, Fernando C. N. Pereira,

Christopher J. C. Burges, Leon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 28–36,

2012.

Joseph K. Bradley, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Carlos Guestrin. Parallel

coordinate descent for `1-regularized loss minimization. In ICML, pages 321–328, 2011.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed

optimization and statistical learning via the alternating direction method of multipliers.

Foundations and Trends R© in Machine Learning, 3(1):1–122, 2011.