asynchronous parallel computing in signal processing and …€¦ · asynchronous parallel...

Post on 05-Oct-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Asynchronous Parallel Computing in Signal Processing andMachine Learning

Wotao Yin (UCLA Math)

joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU)

Optimization and Parsimonious Modeling – IMA, Jan 25, 2016

1 / 49

Do we need parallel computing?

2 / 49

Back in 1993

3 / 49

2006

4 / 49

2015

5 / 49

35 Years of CPU Trend

1995 2000 2005 2010 2015

Number

of CPUs

Performance

per core

Cores

per CPU

D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012.

• In May 2014, Intel cancelled its Tejas project (single-core) and announced anew multi-core project.

6 / 49

Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total)

7 / 49

Today: Tesla K80 GPU (2496 cores)

8 / 49

Today: Octa-Core Headsets

9 / 49

Free lunch was over

• before 2005: a single-threaded algorithm automatically gets faster

• now, new algorithms must be developed for faster speeds by

• exploiting problem structures

• taking advantages of dataset properties

• using all the cores available

10 / 49

How to use all the cores available?

11 / 49

Parallel computing

Agent

Agent

Agent

t1t2tN · · ·

Problem

12 / 49

Parallel speedup

• definition:speedup = serial time

parallel timetime is in the wall-clock sense

• Amdahl’s Law: N agents, no overhead, ρ = percentage of parallelcomputing

ideal speedup = 1(ρ/N) + (1− ρ)

100

102

104

106

108

0

5

10

15

20

Number of processors

Speedup

25%50%90%95%

13 / 49

Parallel speedup

• ε := parallel overhead (startup, synchronization, collection)• in the real world

actual speedup = 1(ρ/N) + (1− ρ) + ε

100

102

104

106

108

0

2

4

6

8

10

Number of processors

Speedup

25%50%90%95%

100

102

104

106

108

0

5

10

15

20

Number of processors

Speedup

25%50%90%95%

when ε = N when ε = log(N)

14 / 49

Sync-parallel versus async-parallel

Agent 1

Agent 2

Agent 3

idle idle

idle

idle

Synchronous(wait for the slowest)

Agent 1

Agent 2

Agent 3

Asynchronous(non-stop, no wait)

15 / 49

Async-parallel coordinate updates

16 / 49

Fixed point iteration and its parallel version

• H = H1 × · · · × Hm

• original iteration: xk+1 = Txk =: (I − ηS)xk

• all agents do in parallel:

agent 1: xk+11 ← T1(xk) = xk1 − ηS1(xk)

agent 2: xk+12 ← T2(xk) = xk2 − ηS2(xk)

...agent m: xk+1

m ← Tm(xk) = xkm − ηSm(xk)

• assumption:1. coordinate friendliness: cost of Six ∼ 1

mcost of Sx

2. synchronization after each iteration

17 / 49

Comparison

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

18 / 49

ARock1: Async-parallel coordinate update

• H = H1 × · · · × Hm• p agents, possibly p 6= m

• each agent randomly picks i ∈ {1, . . . ,m} and updates just xi:

xk+11 ← xk1

...xk+1i ← xki − ηkSixk−dk

...xk+1m ← xkm

• 0 ≤ dk ≤ τ , maximum delay

1Peng-Xu-Yan-Y.’1519 / 49

Two ways to model xk−dk

definitions: let x0, ..., xk, ... be the states of x in the memory

1. xk−dk is consistent if dk is a scalar2. xk−dk is possibly inconsistent if dk is a vector, different components are

delayed by different amounts

ARock allows both consistent and inconsistent read.

20 / 49

Memory lock illustration

Agent 1 read [0, 0, 0, 0]T = x0

consistent read

Agent 1 read [0, 0, 0, 2]T 6∈ {x0, x1, x2}

inconsistent read

21 / 49

History and recent literature

22 / 49

Brief history of async-parallel algorithms(mostly worst case analysis)

• 1969 – a linear equation solver by Chazan and Miranker;

• 1978 – extended to the fixed-point problem by Baudet under theabsolute-contraction2 type of assumption.

• For 20–30 years, mainly solve linear, nonlinear and differential equations bymany people

• 1989 – Parallel and Distributed Computation: Numerical Methods byBertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld.

• 1991 – gradient-projection itr assuming a local linear-error bound by Tseng

• 2001 – domain decomposition assuming strong convexity by Tai & Tseng

2An operator T : Rn → Rn is absolute-contractive if |T (x)− T (y)| ≤ P |x− y|, component-wise, where|x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n

+ and ρ(P ) < 1.23 / 49

Absolute-contraction

• Absolute-contractive operator T : Rn → Rn:if |T (x)− T (y)| ≤ P |x− y|, component-wise, where |x| denotes thevector with components |xi|, i = 1, ..., n, and P ∈ Rn×n+ and ρ(P ) < 1.

• Interpretation: a series of nested rectangular boxes for xk+1 = Txk

• Applications:• diagonally dominated A for Ax = b

• diagonally dominated ∇2f for minx f(x) (just strong convexity is notenough)

• some network flow problems

24 / 49

Recent work(stochastic analysis)

• AsySCD for convex smooth and composite minimization by Liu et al’14and Liu-Wright’14. Async dual CD (regression problems) by Hsieh et al.’15

• Async randomized (splitting/distributed/incremental) methods:Wei-Ozdaglar’13, Iutzeler et al’13, Zhang-Kwok’14, Hong’14, Chang etal’15

• Async SGD: Hogwild!, Lian’15, etc.

• Async operator sample and CD: SMART Davis’15

25 / 49

Random coordinate selection

• select xi to update with probability pi, where mini pi > 0

• drawback:• agents cannot cache data• either global memory or communication is required• pseudo-random number generation takes time

• benefits:• often faster than the fixed cyclic order• automatic load balance• simplifies certain analysis

26 / 49

Convergence summary

27 / 49

Convergence guarantees

m is # coordinates, τ is the maximum delay, uniform selection pi ≡ 1m

Theorem (almost sure convergence)Assume that T is nonexpansive and has a fixed point. Use step sizesηk ∈ [ε, 1

2m−1/2τ+1 ), ∀k. Then, with probability one, xk ⇀ x∗ ∈ FixT .

In addition, rates can be derived.Consequence: step size is O(1) if τ ∼ √m. Under equal agents and updates,attaining linear speedup if using p = O(

√m) agents. p can be bigger if T is

sparse.

28 / 49

Sketch of proof

• typical inequality:

‖xk+1 − x∗‖2 ≤‖xk − x∗‖2 − c‖Txk − xk‖2

+ harmful terms(xk−1, . . . , xk−τ )

• Descent inequality under a new metric:

E(‖xk+1 − x∗‖2

M

∣∣X k) ≤ ‖xk − x∗‖2M − c ‖Txk − xk‖2

where• the history up to iteration k• xk = (xk, xk−1, . . . , xk−τ ) ∈ Hτ+1, k ≥ 0• any x∗ = (x∗, x∗, . . . , x∗) ∈ X∗ ⊆ Hτ+1

• M is a positive definite matrix.• c = c(ηk,m, τ)

29 / 49

• apply the Robbins-Siegmund theorem:

E(αk+1|Fk) + vk ≤ (1 + ξk)αk + ηk

where all are nonnegative, αk is random, and ξk, ηk are summable. Thenαk converges a.s.

• prove weakly convergent clustering points are fixed-points;

• assume H is separable and apply results [Combettes, Pesquet 2014].

30 / 49

Applications and numerical results

31 / 49

Linear equations (asynchronous Jacobi)

• require: invertible square matrix A with nonzero diagonal entries

• let D be the diagonal part of A; then

Ax = b ⇐⇒ (I −D−1A)x+D−1b︸ ︷︷ ︸Tx

= x.

• T is nonexpansive if ‖I −D−1A‖2 ≤ 1, i.e., A is diagonal dominating

• xk+1 = Txk recovers the Jacobi algorithm

32 / 49

Algorithm 1: ARock for linear equations

Input : shared variables x ∈ Rn, K > 0;set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do

sample i ∈ {1, . . . ,m} uniformly at random;add − ηk

aii(∑

jaij x

kj − bi) to shared variable xi;

update the global counter k ← k + 1;

33 / 49

Sample code

loadData (A, data_file_name );loadData (b, label_file_name );# pragma omp parallel num_threads (p) shared (A,b,x,para){ // A, b, x, and para are passed by reference

call Jacobi (A,b,x,para) or ARock(A,b,x,para );}

• p: the number of threads• A,b,x: shared variable• para: other parameters

34 / 49

Jacobi worker function

for(int itr =0; itr < max_itr ; itr ++){

// compute the update for the assigned x[i]// ...

# pragma omp barrier{

// write x[i] in global memory}

# pragma omp barrier}

Jacobi needs the barrier directive for synchronization

35 / 49

ARock worker function

for(int itr =0; itr < max_itr ; itr ++){

// pick i at random// compute the update for x[i]// ...// write x(i) in global memory

}

ARock has no synchronization barrier directive

36 / 49

Minimizing smooth functions

• require: convex and Lipschitz differentiable function f

• if ∇f is L-Lipschitz, then

minimizex

f(x) ⇐⇒ x =(I − 2

L∇f)︸ ︷︷ ︸

T

x.

where T is nonexpansive

• ARock will be efficient when ∇xif(x) is easy to compute

37 / 49

Minimizing composite functions

• require: convex smooth g(·) and convex (possibly nonsmooth) f(·)

• proximal map: proxγf (y) = arg min f(x) + 12γ ‖x− y‖

2.

minimizex

f(x) + g(x) ⇐⇒ x = proxγf ◦ (I − γ∇g)︸ ︷︷ ︸T

x.

• ARock will be fast if• ∇xig(x) is easy to compute• f(·) is separable (e.g., `1 and `1,2)

38 / 49

Example: sparse logistic regression

• n features, N labeled samples

• each sample ai ∈ Rn has its label bi ∈ {1,−1}

• `1 regularized logistic regression:

minimizex∈Rn

λ‖x‖1 + 1N

N∑i=1

log(1 + exp(−bi · aTi x)

), (1)

• compare sync-parallel and ARock (async-parallel) on two datasets:

Name N (#samples) n (# features) # nonzeros in {a1, . . . , aN}rcv1 20,242 47,236 1,498,952

news20 19,996 1,355,191 9,097,916

39 / 49

Speedup tests

• implemented in C++ and OpenMP.• 32 cores shared memory machine.

#coresrcv1 news20

Time (s) Speedup Time (s) Speedupasync sync async sync async sync async sync

1 122.0 122.0 1.0 1.0 591.1 591.3 1.0 1.02 63.4 104.1 1.9 1.2 304.2 590.1 1.9 1.04 32.7 83.7 3.7 1.5 150.4 557.0 3.9 1.18 16.8 63.4 7.3 1.9 78.3 525.1 7.5 1.1

16 9.1 45.4 13.5 2.7 41.6 493.2 14.2 1.232 4.9 30.3 24.6 4.0 22.6 455.2 26.1 1.3

40 / 49

More applications

41 / 49

Minimizing composite functions

• require: both f and g are convex (possibly nonsmooth) functions

minimize f(x) + g(x) ⇐⇒ z = reflγf ◦ reflγg︸ ︷︷ ︸TPRS

(z).

recover x = proxγg(z)

• TPRS is known as the Peaceman-Rachford splitting operator3

• also, the Douglas-Rachford splitting operator: 12I + 1

2TPRS

• ARock runs fast when• reflγf is separable• (reflγg)i is easy-to-compute/maintain

3reflective proximal map: reflγf := 2proxγf − I. The maps reflγf , reflγg and thusreflγf ◦ reflγg are nonexpansive

42 / 49

Parallel/distributed ADMM

• require: m convex functions fi

• consensus problem:

minimizex

∑m

i=1 fi(x) + g(x)

⇐⇒ minimizexi,y

∑m

i=1 fi(xi) + g(y)

subject to

I

I

. . .I

x1

x2...xm

−I

I...I

y = 0

• Douglas-Rachford-ARock to the dual problem ⇒ async-parallel ADMM:

• m subproblems are solved in the async-parallel fashion• y and zi (dual variables) are updated in global memory (no lock)

43 / 49

Decentralized computing

• n agents in a connected network G = (V,E) with bi-directional links E

• each agent i has a private function fi

• problem: find a consensus solution x∗ to

minimizex∈Rp

f(x) :=n∑i=1

fi(Aixi) subject to xi = xj , ∀i, j.

• challenges: no center, only between-neighbor communication

• benefits: fault tolerance, no long-dist communication, privacy

44 / 49

Async-parallel decentralized ADMM

• a graph of connected agents: G = (V,E).

• decentralized consensus optimization problem:

minimizexi∈Rd,i∈V

f(x) :=∑

i∈V fi(xi)

subject to xi = xj , ∀(i, j) ∈ E

• ADMM reformulation: constraints xi = yij , xj = yij , ∀(i, j) ∈ E

• apply• ARock version 1: nodes asynchronously activate• ARock version 2: edges asynchronously activate

no global clock, no central controller, each agent keeps fi private andtalks to its neighbors

45 / 49

notation:

• N(i) all edges of agent i, N(i) = L(i) ∪R(i)

• L(i) neighbors j of agent i, j < i

• R(i) neighbors j of agent i, j > i

Algorithm 2: ARock for the decentralized consensus problem

Input : each agent i sets x0i ∈ Rd, dual variables z0

e,i for e ∈ E(i), K > 0.while k < K, any activated agent i do

receive zkli,l from neighbors l ∈ L(i) and zkir,r from neighbors r ∈ R(i);update local xki , zk+1

li,i and zk+1ir,i according to (2a)–(2c), respectively;

send zk+1li,i to neighbors l ∈ L(i) and zk+1

ir,i to neighbors r ∈ R(i).

xki ∈ arg minxi

fi(xi) +(∑

l∈L(i) zkli,l +

∑r∈R(i) z

kir,r

)xi + γ

2 |E(i)|‖xi‖2,

(2a)

zk+1ir,i = zkir,i − ηk((zkir,i + zir,r)/2 + γxki ), ∀r ∈ R(i), (2b)zk+1li,i = zkli,i − ηk((zkli,i + zli,l)/2 + γxki ), ∀l ∈ L(i). (2c)

46 / 49

Summary

47 / 49

Summary of async-parallel coordinate descent

benefits:

• eliminate idle time

• reduce communication / memory-access congestion

• random job selection: load balance

mathematics: analysis of disorderly (partial) updates

• asynchronous delay

• inconsistent read and write

48 / 49

Thank you!

Acknowledgements: NSF DMSReference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM15-37.Website: http://www.math.ucla.edu/˜wotaoyin/arock

49 / 49

top related