sparse optimization - lecture: parallel and distributed ...wotaoyin/summer2013/slides/lec...sparse...

Sparse Optimization

Lecture: Parallel and Distributed Sparse Optimization

Instructor: Wotao Yin

July 2013

online discussions on piazza.com

Those who complete this lecture will know

• basics of parallel computing

• how to parallel a bottleneck of existing sparse optimization method

• primal and dual decomposition

• GRock: greedy coordinate-block descent, and its parallel implementation

1 / 74

Outline

1. Background of parallel computing: benefits, speedup, and overhead

2. Parallelize existing algorithms

3. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

4. Parallel greedy coordinate descent

5. Numerical results with big data

2 / 74

Serial computing

Traditional computing is serial

• a single CPU, one instruction is executed at a time

• a problem is broken into a series of instructions, executed one after another

CPU

Problem

tN t1t2· · ·

7 / 74

Parallel computing

• a problem is broken into concurrent parts, executed simultaneously,

coordinated by a controller

• uses multiple cores/CPUs/networked computers, or their combinations

• expected to solve a problem in less time, or a larger problem

CPU

CPU

CPU

t1t2tN · · ·

Problem

8 / 74

Commercial applications

Examples: internet data mining, logistics/supply chain, finance, online ad,

recommendation, sales data analysis, virtual reality, computer graphics (cartoon

movies), medicine design, medical imaging, network video, ...

9 / 74

Applications in science and engineering

• simulation: many real-world events happen concurrently, interrelated

examples: galaxy, ocean, weather, traffic, assembly line, queues

• use in science/engineering: environment, physics, bioscience, chemistry,

geoscience, mechanical engineering, mathematics (computerized proofs),

defense, intelligence

10 / 74

Benefits of being parallel

• break single-CPU limits

• save time

• process big data

• access non-local resource

• handle concurrent data streams / enable collaboration

11 / 74

Parallel overhead

• computing overhead: start up time, synchronization wait time, data

communication time, termination (data collection time)

• I/O overhead: slow read and write of non-local data

• algorithm overhead: extra variables, data duplication

• coding overhead: language, library, operating system, debug, maintenance

12 / 74

Different parallel computers

• basic element is like a single computer, a 70-year old architecture

components: CPU (control/arithmetic units), memory, input/output

• SIMD: single instruction multiple data, GPU-like

applications: image filtering, PDE on grid, ...

• MISD: multiple instruction single data, rarely seen

conceivable applications: cryptography attack, global optimization

• MIMD: multiple instruction multiple data

most of today’s supercomputer, clusters, clouds, multi-core PCs

13 / 74

Tech jargons

• HPC: high-performance computing

• node/CPU/processor/core

• shared memory: either all processors access a common physical memory,

or parallel tasks have direct address of logical memory

• SMP: multiple processors share a single memory, shared-memory

computing

• distributed memory: parallel tasks see local memory and use

communication to access remote memory

• communication: exchange of data or synchronization controls; either

through shared memory or network

• synchronization: coordination, one processor awaits others

14 / 74

Tech jargons

• granularity: coarse – more work between communication events; fine –

less work between communication events

• Embarrassingly parallel: main tasks are independent, little need of

coordination

• speed scalability: how speedup can increases with additional resources

data scalability: how problem size can increase with additional resources

scalability affected by: problem model, algorithm, memory bandwidth,

network speed, parallel overhead

sometimes, adding more processors/memory may not save time!

• memory architectures: uniform or non-uniform shared, distributed, hybrid

• parallel models: shared memory, message passing, data parallel, SPMD,

MPMD, etc.

15 / 74

Parallel speedup

• definition:

speedup =serial time

parallel time

time is in the wall-clock sense

• Amdahl’s Law: assume infinite processors and no overhead

ideal max speedup =1

1− ρ

where ρ = percentage of parallelized code, can depend on input size

0 0.2 0.4 0.6 0.8 10

5

10

15

20

Parallel portion of code

Speedup

16 / 74

Parallel speedup

• assume N processors and no overhead

ideal speedup =1

(ρ/N) + (1− ρ)

• ρ may also depend on N

100

102

104

106

108

0

5

10

15

20

Number of processors

Speedup

25%50%90%95%

17 / 74

Parallel speedup

• ε := parallel overhead

• in the real world

actual speedup =1

(ρ/N) + (1− ρ) + ε

100

102

104

106

108

0

2

4

6

8

10


Speedup

25%50%90%95%

100

102

104

106

108

0

5

10

15

20


Speedup

25%50%90%95%

when ε = N when ε = log(N)

18 / 74

Basic types of communication

Broadcast Scatter

Gather Reduce

1 2 3 3

9

19 / 74

Allreduce

Allreduce sum via butterfly communication

2 3 6 4 8 1 7 5

5 5 10 10 9 9 12 12

15 15 15 15 21 21 21 21

36 36 36 36 36 36 36 36

0 1 2 3 4 5 6 7

log(N) layers, N parallel communications per layer

20 / 74

Additional important topics

• synchronization types: barrier, lock, synchronous communication

• load balancing: static or dynamic distribution of tasks/data so that all

processors are busy all time

• granularity: fine grain vs coarse grain

• I/O: a traditional parallelism killer, but modern database/parallel file

system alleviates the problem (e.g., Hadoop Distributed File System

(HDFS))

21 / 74

Example: π, area computation

-0.4 -0.2 0 0.2 0.4

-0.4

-0.2

0

0.2

0.4

105 W 100 W 95 W

25 N

30 N

35 N

• box area is 1× 1

• disc area is π/4

• ratio of area ≈ % inside points

• π = 4# points in the disc

# all the points

22 / 74

Serial pseudo-code

1: total = 100000

2: in disc = 0

3: for k = 1 to total do

4: x = a uniformly random point in [−1/2, 1/2]2

5: if x is in the disc then

6: in disc++

7: end if

8: end for

9: return 4*in disc/total

23 / 74

Parallel π computation

-0.4 -0.2 0 0.2 0.4

-0.4

-0.2

0

0.2

0.4

• each sample is independent of others

• use multiple processors to run the for-loop

• use SPMD, process #1 collects results and return π

24 / 74

SPMD pseudo-code

1: total = 100000

2: in disc = 0

3: P = find total parallel processors

4: i = find my processor id

5: total = total/P

6: for k = 1 to total do

7: x← a uniformly random point in [0, 1]2

8: if x is in the disc then

9: in disc++

10: end if

11: end for

12: if i == 1 then

13: total in disc = reduce(in disc, SUM)

14: return 4*total in disc/(total* P)

15: end if

• requires only one collective

communication;

• speedup = 11/N+O(log(N))

= N1+cN log(N)

, where c is a

very small number;

• main loop doesn’t require any

synchronization.

25 / 74

Outline




• Parallel dual gradient ascent (linearized Bregman)• Parallel ADMM



26 / 74

Decomposable objective and constraints enables parallelism

• variables x = [x1,x2, . . . ,xN ]T

• embarrassingly parallel (not an interesting case): min∑i fi(xi)

• consensus minimization, no constraints

min f(x) =N∑i=1

fi(x) + r(x)

LASSO example: f(x) = λ2

∑i ‖A(i)x− bi‖22 + ‖x‖1

• coupling variable z

minx,z

f(x, z) =

N∑i=1

[fi(xi, z) + ri(xi)] + r(z)

27 / 74

• coupling constraints

min f(x) =

N∑i=1

fi(xi) + ri(xi)

subject to

• equality constraints:∑Ni=1 Aixi = b

• inequality constraints:∑Ni=1 Aixi ≤ b

• graph-coupling constraints Ax = b, where A is sparse

• combinations of independent variables, coupling variables and constraints

example:

minx,w,z

∑Ni=1 fi(xi, z)

s.t.∑iAixi ≤ b

Bixi ≤ w, ∀iw ∈ Ω

• variable coupling ←→ constraint coupling:

min

N∑i=1

fi(xi, z) −→ min

N∑i=1

fi(xi, zi) s.t. zi = z ∀i

28 / 74

Approaches toward parallelism

Keep an existing serial algorithm, parallelize its expensive part(s)

Introduce a new parallel algorithm by

• formulating an equivalent model

• replacing the model by an approximate model

where the new model lands itself for parallel computing

29 / 74

Outline





• Parallel ADMM



30 / 74

Example: parallel ISTA

minimizex

λ‖x‖1 + f(x)

Prox-linear approximation:

arg minx

λ‖x‖1 + 〈∇f(x),x− xk〉+1

2δk‖x− xk‖22

Iterative soft-threshold algorithm:

xk+1 = shrink(xk − δk∇f(x), λδk

)• shrink(y, t) has a simple close form: max(abs(y)-t,0).*sign(y)

• the dominating computation is ∇f(x)

examples: f(x) = 12‖Ax− b‖22, logistic loss of Ax− b, hinge loss ... where

A is often dense and can go very large-scale

31 / 74

Parallel gradient computing of f(x) = g(Ax+ b): row partition

Assumption: A is big and dense; g(y) =∑gi(yi) decomposable

Then f(x) = g(Ax + b) =∑gi(A(i)x + bi)

• let A(i) and bi be the ith row block of A and b, respectively

...

A(1)

A(2)

A(M)

• store A(i),bi on node i

• goal: to compute ∇f(x) =∑

AT(i)∇gi(A(i)x + bi), or similarly ∂f(x)

32 / 74

Parallel gradient computing of f(x) = g(Ax+ b): row partition

Steps

1. compute A(i)x + bi ∀i in parallel;

2. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

3. compute AT(i)gi ∀i in parallel;

4. allreduce∑Mi=1 A

T(i)gi.

• step 1, 2 and 3 are local computation and communication-free;

• step 4 requires an allreduce sum communication.

33 / 74

Parallel gradient computing of f(x) = g(Ax+ b): column partition

· · ·A1 A2 AN

column block partition

Ax =N∑i=1

Aixi =⇒ f(x) = g(Ax + b) = g(∑

Aixi + b)

∇f(x) =

AT

1∇g(Ax + b)

AT2∇g(Ax + b)

...

ATN∇g(Ax + b)

, similar for ∂f(x)

34 / 74

Parallel computing of ∇f(x) = g(Ax+ b): column partition

Steps

1. compute Aixi + 1Nb ∀i in parallel

2. allreduce Ax + b =∑Ni=1 Aixi + 1

Nb;

3. evaluate ∇g(Ax + b) ∀i in each process;

4. compute ATi ∇g(Ax + b) ∀i in parallel.

• each processor keeps Ai and xi, as well as a copy of b;

• step 2 requires an allreduce sum communication;

• step 3 involves duplicated computation in each process (hopefully, g is

simple).

35 / 74

Two-way partition

· · ·· · ·

· · ·

··· ··· ···

A1,1 A1,2 A1,N

A2,2A2,1

AM,1

A2,N

AM,NAM,2

···

f(x) =∑

gi(A(i)x + bi) =∑i

gi(∑j

Ai,jxj + bi)

∇f(x) =∑i

AT(i)∇gi(A(i)x + bi) =

∑i

AT(i)∇gi(

∑j

Ai,jxj + bi)

36 / 74

Two-way partition

Step

1. compute Ai,jxj + 1Nbi ∀i in parallel;

2. allreduce∑Nj=1 Ai,jxj + bi for every i;

3. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

4. compute ATj,igi ∀i in parallel;

5. allreduce∑Ni=1 A

Tj,igi for every j.

• processor (i, j) keeps Ai,j , xj and bi;

• step 2 and 5 require parallel allreduce sum communication;

37 / 74

Example: parallel ISTA for LASSO

LASSO

min f(x) = λ‖x‖1 +1

2‖Ax− b‖22

algorithm

xk+1 = shrink(xk − δkATAxk + δkA

Tb, λδk)

Serial pseudo-code

1: initialize x = 0 and δ;

2: pre-compute ATb

3: while not converged do

4: g = ATAx−ATb

5: x = shrink (x− δg, λδ);

6: end while

38 / 74


distribute blocks of rows to M nodes

...

A(1)

A(2)

A(M)



3: processor i loads A(i)

4: pre-compute δATb


6: processor i computes ci = AT(i)A(i)x

7: y = allreduce(ci, SUM)

8: x = shrink(xk − δc + δATb, λδ

);

9: end while

• assume A ∈ Rm×n

• one allreduce per iteration

• speedup

≈ 1ρ/M+(1−ρ)+O(log(M))

• P is close to 1

• requires synchronization

39 / 74


distribute blocks of columns to N nodes

· · ·A1 A2 AN



3: processor i loads Ai

4: pre-compute δATi b


6: processor i computes yi = Aixi

7: y = allreduce(yi, SUM)

8: processor i computes zi = ATi y

9: xk+1i = shrink

(xki − δzi + δAT

i b, λδ);

10: end while

• one allreduce per iteration

• speedup

≈ 1ρ/N+(1−ρ)+O(m

nlog(N))

• P is close to 1

• requires synchronization

• if m n, this approach is

faster

40 / 74

Outline



3. Primal decomposition / dual decomposition3


• Parallel ADMM



3Dantzig and Wolfe [1960], Spingarn [1985], Bertsekas and Tsitsiklis [1997]

41 / 74

Unconstrained minimization with coupling variables

x = [x1,x2, . . . ,xN ]T .

z is the coupling/bridging/complicating variable.

minx,z

f(x, z) =

N∑i=1

fi(xi, z)

equivalently to separable objective with coupling constraints

minx,z

N∑i=1

fi(xi, zi) s.t. zi = z ∀i

Examples

• network utility maximization (NUM), extend to multiple layers4

• domain decomposition (z are overlapping boundary variables)

• system of equations: A is block-diagonal but with a few dense columns

4see tutorial: Palomar and Chiang [2006]

42 / 74

Primal decomposition (bi-level optimization)

Introduce easier subproblems

gi(z) = minxi

fi(xi, z), i = 1, . . . , N

then

minx,z

N∑i=1

fi(xi, z) ⇐⇒ minz

∑gi(z)

the RHS is called the master problem

parallel computing: iterate

1. fix z, update xi and compute gi(z) in parallel

2. update z using a standard method (typically, subgradient descent)

• good for small-dimensional z

• fast if the master problem converges fast

43 / 74

Dual decomposition

Consider

minx,z

∑i

fi(xi, zi)

s.t.∑i

Aizi = b

Introduce

gi(z) = minxi

f(xi, z)

g∗i (y) = maxz

yT z− gi(z) //convex conjugate

Dualization:

⇔ minx,z

maxy

∑i

fi(xi, zi) + yT (b−∑i

Aizi)

(generalized minimax thm.) ⇔ maxyi

minxi,zi

∑i

(fi(xi, zi)− yTAizi

)+ yTb

⇔ maxyi

minzi

∑i

(gi(zi)− yTAizi

)+ yTb

⇔ min

y

∑i

g∗i (ATi y)− yTb

where gi(y) = minxi f(xi,y), g∗i (z) = miny gi(y)− zTy is the conjugate

function of gi(y).

44 / 74

Example: consensus and exchange problems

Consensus problem

minz

∑i

gi(z)

Reformulate as

minz

∑i

gi(zi) s.t. zi = z ∀i

Its dual problem is the exchange problem

miny

∑i

g∗i (yi) s.t.∑i

yi = 0

45 / 74

Primal-dual objective properties

Constrained separable problem

minz

∑i

gi(z) s.t.∑i

Aizi = b

Dual problem

miny

∑i

g∗i (ATi y)− bTy

If gi is proper, closed, convex, g∗i (aTi y) is sub-differentiable

If gi is strictly convex, g∗i (aTi y) is differentiable

If gi is strongly convex, g∗i (aTi y) is differentiable and has Lipschitz gradient

Obtaining z∗ from y∗ generally requires strict convexity of gi or strict

complementary slackness

Example:

• gi(z) = |zi| and g∗i (aTi y) = ι−1 ≤ aTi y ≤ 1

• gi(z) = 12|zi|2 and g∗i (aTi y) = 1

2|aTi y|2

• gi(z) = exp(zi) and g∗i (aTi y) = (aTi y) ln(aTi y)− aTi y + ιaTi y ≥ 046 / 74

Outline




• Parallel dual gradient ascent (linearized Bregman5)• Parallel ADMM



5Yin, Osher, Goldfarb, and Darbon [2008]

47 / 74

Example: augmented `1

Change |zi| into strongly convex gi(zi) = |zi|+ 12α|zi|2

Dual problem

miny

∑i

g∗i (ATi y)− bTy =

∑i

α

2| shrink(AT

i y)|2︸︷︷︸hi(y)

−bTy

Dual gradient iteration (equivalent to linearized Bregman):

yk+1 = yk − tk(∑

i

∇hi(yk)− b

), where ∇hi(y) = αAi shrink(AT

i y).

Dual is C1 and unconstrained. One can apply (accelerated) gradient

descent, line search, quasi-Newton method, etc.

Recover x∗ = α shrink(ATi y∗).

Easy to parallelize. Node i computes ∇hi(y). One collective

communication per iteration.

48 / 74

Outline





• Parallel ADMM8



8Bertsekas and Tsitsiklis [1997]

56 / 74

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

step 2: iterate

1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,

2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,

3. zk+1 ← zk − (Axk+1 + Byk+1 − b).

history: dates back to 60s, formalized 80s, parallel versions appeared late 80s,

revived very recently, new convergence results recently

57 / 74

ADMM and parallel/distributed computing

Two approaches:

• parallelize the serial algorithm(s) for ADMM subproblem(s),

• turn problem into a parallel-ready form9

minx,y

∑i fi(xi) + g(y)

s.t.

A1

. . .

AM

x1

...

xM

+

B1

...

BM

y =

b1

...

bM

consequently, computing xk+1 reduces to parallel subproblems

xk+1i ← min

xfi(xi) +

β

2‖Aixi + Biy

k − bi − zki ‖22, i = 1, . . . ,M

Issue: to have block diagonal A and simple B, additional variables are often

needed, thus increasing the problem size and parallel overhead.

9a recent survey Boyd, Parikh, Chu, Peleato, and Eckstein [2011]

58 / 74

Parallel Linearized Bregman vs ADMM on basis pursuit

Basis pursuit:

minx‖x‖1 s.t. Ax = b.

• example with dense A ∈ R1024×2048;

• distribute A to N computing nodes

A = [A1 A2 · · · AN ];

• compare

1. parallel ADMM in the survey paper Boyd, Parikh, Chu, Peleato, and

Eckstein [2011];

2. parallel linearized Bregman (dual gradient descent), un-accelerated.

• tested N = 1, 2, 4, . . . , 32 computing nodes;

59 / 74

parallel ADMM v.s. parallel linearized Bregman

parallel linearized Bregman scales much better

60 / 74

Outline





• Parallel ADMM



61 / 74

(Block)-coordinate descent/update

Definition: update one block of variables each time, keeping others fixed

Advantage: update is simple

Disadvantages:

• more iterations (with exceptions)

• may stuck at non-stationary points if problem is non-convex or

non-smooth (with exceptions)

Block selection: cycle (Gauss-Seidel), parallel (Jacobi), random, greedy

Specific for sparse optimization: greedy approach10 can be exceptionally

effective (because most time correct variables are selected to update; most zero

variables are never touched; “reducing” the problem to a much smaller one.)

10Li and Osher [2009]

62 / 74

GRock: greedy coordinate-block descent

Example:

minλ‖x‖1 + f(Ax− b)

decompose Ax =∑j Aixj ; block Aj and xj are kept on node j

Parallel GRock:

1. (parallel) compute a merit value for each coordinate i

di = arg mind

λ · r(xi + d) + gid+1

2d2, where gi = aT(i)∇f(Ax− b)

2. (parallel) compute a merit value for each block j

mj = max|d| : d is an element of dj

let sj be the index of the maximal coordinate within block j

3. (allreduce) P ← select P blocks with largest mj , 2 ≤ P ≤ N

4. (parallel) update xsj ← xsj + dsj for all j ∈ P

5. (allreduce) update Ax

63 / 74

How large P can be?

P depends on the block spectral radius

ρP = maxM∈M

ρ(M),

where M is the set of all P × P submatrices that we can obtain from ATA

corresponding to selecting exactly one column from each of the P blocks

Theorem

Assume each column of A has unit 2-norm. If ρP < 2, GRock with P parallel

updates per iteration gives

F(x + d)−F(x) ≤ ρP − 2

2β‖d‖22

moreover,

F(xk)−F(x∗) ≤2C2

(2L+ β

√NP

)2

(2− ρP )β· 1

k.

64 / 74

Compare different block selection rules

0 50 10010

-20

10-10

100

1010

nomalized # coordinates calculated

f-fm

in

Cyclic CDGreedy CDMixed CDGRock

0 50 10010

-15

10-10

10-5

100

105

Normalized # coordinates calculated

Relativeerror

Cyclic CDGreedy CDMixed CDGRock

• LASSO test with A ∈ R512×1024, N = 64 column blocks

• GRock uses P = 8 updates each iteration

• greedy CD11 uses P = 1

• mixed CD12 selects P = 8 random blocks and best coordinate in each

iteration

11Li and Osher [2009]12Scherrer, Tewari, Halappanavar, and Haglin [2012]

65 / 74

Outline





• Parallel ADMM



66 / 74

Test on Rice cluster STIC

specs:

• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon

(Nahalem) CPUs

• each node has 12GB of memory shared by all 8 cores

• # of processes used on each node = 8

test dataset:

A type A size λ sparsity of x∗

dataset I Gaussian 1024× 2048 0.1 100

dataset II Gaussian 2048× 4096 0.01 200

67 / 74

iterations vs cores

100

101

102

102

103

104

number of cores

number

ofiterations

p D-ADMMp FISTAGRock

100

101

102

102

104

number of cores

number

ofiterations


dataset I dataset II

Note: we set P = number of cores

68 / 74

time vs cores

100

101

102

10-1

100

101

number of cores

time(s)


P = 16

100

101

102

100

102

number of cores

time(s)




69 / 74

(% communication time) vs cores

100

101

102

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of cores

comm.timepercentage


100

101

102

0.1

0.2

0.3

0.4

number of cores

comm.timepercentage p D-ADMM

p FISTAGRock



70 / 74

big data LASSO on Amazon EC2

Amazon EC2 is an elastic, pay-as-you-use cluster;

advantage: no hardware investment, everyone can have an account

test dataset

• A dense matrix, 20 billion entries, and 170GB size

• x: 200K entries, 4K nonzeros, Gaussian values

requested system

• 20 “high-memory quadruple extra-large instances”

• each instance has 8 cores and 60GB memory

code

• written in C using GSL (for matrix-vector multiplication) and MPI

71 / 74

parallel Dual-ADMM vs FISTA13 vs GRock

p D-ADMM p FISTA GRock

estimate stepsize (min.) n/a 1.6 n/a

matrix factorization (min.) 51 n/a n/a

iteration time (min.) 105 40 1.7

number of iterations 2500 2500 104

communication time (min.) 30.7 9.5 0.5

stopping relative error 1E-1 1E-3 1E-5

total time (min.) 156 41.6 1.7

Amazon charge $85 $22.6 $0.93

• ADMM’s performance depends on penalty parameter β

we picked β as the best out of only a few trials (we cannot afford more)

• parallel Dual ADMM and FISTA were capped at 2500 iterations

• GRock used adaptive P and stopped at relative error 1E-5

13Beck and Teboulle [2009]

72 / 74

Software codes can be found in instructor’s website.

Acknowledgements: Zhimin Peng (Rice), Ming Yan (UCLA). Also, Qing

Ling (USTC) and Zaiwen Wen (SJTU)

References:

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on

Computing, 24:227–234, 1995.

D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal)

dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences,

100:2197–2202, 2003.

D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):

1289–1306, 2006.

E.J. Candes, E. Romberg, and T. Tao. Robust uncertainty principles: Exact signal

reconstruction from highly incomplete frequency information. Information Theory,

IEEE Transactions on, 52(2):489–509, 2006.

G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations

Research, 8:101–111, 1960.

J.E. Spingarn. Applications of the method of partial inverses to convex programming:

decomposition. Mathematical Programming, 32(2):199–223, 1985.

73 / 74

D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical

Methods, Second Edition. Athena Scientific, 1997.

D.P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility

maximization. IEEE Journal on Selected Areas in Communications, 24(8):

1439–1451, 2006.

W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for

l1-minimization with applications to compressed sensing. SIAM Journal on Imaging

Sciences, 1(1):143–168, 2008.

M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM

Journal on Optimization, 18(4):1326–1350, 2007.

W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM

Journal on Imaging Sciences, 3(4):856–877, 2010.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations

and Trends in Machine Learning, 3(1):1–122, 2011.

Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with

application to compressed sensing: a greedy algorithm. Inverse Problems and

Imaging, 3(3):487–503, 2009.

C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for

accelerating parallel coordinate descent. In NIPS, pages 28–36, 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

74 / 74

sparse optimization - lecture: parallel and distributed ...wotaoyin/summer2013/slides/lec...sparse...

Documents