Asynchronous Parallel Computing in Signal Processing andMachine Learning
Wotao Yin (UCLA Math)
joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU)
Optimization and Parsimonious Modeling – IMA, Jan 25, 2016
1 / 49
Do we need parallel computing?
2 / 49
Back in 1993
3 / 49
2006
4 / 49
2015
5 / 49
35 Years of CPU Trend
1995 2000 2005 2010 2015
Number
of CPUs
Performance
per core
Cores
per CPU
D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012.
• In May 2014, Intel cancelled its Tejas project (single-core) and announced anew multi-core project.
6 / 49
Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total)
7 / 49
Today: Tesla K80 GPU (2496 cores)
8 / 49
Today: Octa-Core Headsets
9 / 49
Free lunch was over
• before 2005: a single-threaded algorithm automatically gets faster
• now, new algorithms must be developed for faster speeds by
• exploiting problem structures
• taking advantages of dataset properties
• using all the cores available
10 / 49
How to use all the cores available?
11 / 49
Parallel computing
Agent
Agent
Agent
t1t2tN · · ·
Problem
12 / 49
Parallel speedup
• definition:speedup = serial time
parallel timetime is in the wall-clock sense
• Amdahl’s Law: N agents, no overhead, ρ = percentage of parallelcomputing
ideal speedup = 1(ρ/N) + (1− ρ)
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
13 / 49
Parallel speedup
• ε := parallel overhead (startup, synchronization, collection)• in the real world
actual speedup = 1(ρ/N) + (1− ρ) + ε
100
102
104
106
108
0
2
4
6
8
10
Number of processors
Speedup
25%50%90%95%
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
when ε = N when ε = log(N)
14 / 49
Sync-parallel versus async-parallel
Agent 1
Agent 2
Agent 3
idle idle
idle
idle
Synchronous(wait for the slowest)
Agent 1
Agent 2
Agent 3
Asynchronous(non-stop, no wait)
15 / 49
Async-parallel coordinate updates
16 / 49
Fixed point iteration and its parallel version
• H = H1 × · · · × Hm
• original iteration: xk+1 = Txk =: (I − ηS)xk
• all agents do in parallel:
agent 1: xk+11 ← T1(xk) = xk1 − ηS1(xk)
agent 2: xk+12 ← T2(xk) = xk2 − ηS2(xk)
...agent m: xk+1
m ← Tm(xk) = xkm − ηSm(xk)
• assumption:1. coordinate friendliness: cost of Six ∼ 1
mcost of Sx
2. synchronization after each iteration
17 / 49
Comparison
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
18 / 49
ARock1: Async-parallel coordinate update
• H = H1 × · · · × Hm• p agents, possibly p 6= m
• each agent randomly picks i ∈ {1, . . . ,m} and updates just xi:
xk+11 ← xk1
...xk+1i ← xki − ηkSixk−dk
...xk+1m ← xkm
• 0 ≤ dk ≤ τ , maximum delay
1Peng-Xu-Yan-Y.’1519 / 49
Two ways to model xk−dk
definitions: let x0, ..., xk, ... be the states of x in the memory
1. xk−dk is consistent if dk is a scalar2. xk−dk is possibly inconsistent if dk is a vector, different components are
delayed by different amounts
ARock allows both consistent and inconsistent read.
20 / 49
Memory lock illustration
Agent 1 read [0, 0, 0, 0]T = x0
consistent read
Agent 1 read [0, 0, 0, 2]T 6∈ {x0, x1, x2}
inconsistent read
21 / 49
History and recent literature
22 / 49
Brief history of async-parallel algorithms(mostly worst case analysis)
• 1969 – a linear equation solver by Chazan and Miranker;
• 1978 – extended to the fixed-point problem by Baudet under theabsolute-contraction2 type of assumption.
• For 20–30 years, mainly solve linear, nonlinear and differential equations bymany people
• 1989 – Parallel and Distributed Computation: Numerical Methods byBertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld.
• 1991 – gradient-projection itr assuming a local linear-error bound by Tseng
• 2001 – domain decomposition assuming strong convexity by Tai & Tseng
2An operator T : Rn → Rn is absolute-contractive if |T (x)− T (y)| ≤ P |x− y|, component-wise, where|x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n
+ and ρ(P ) < 1.23 / 49
Absolute-contraction
• Absolute-contractive operator T : Rn → Rn:if |T (x)− T (y)| ≤ P |x− y|, component-wise, where |x| denotes thevector with components |xi|, i = 1, ..., n, and P ∈ Rn×n+ and ρ(P ) < 1.
• Interpretation: a series of nested rectangular boxes for xk+1 = Txk
• Applications:• diagonally dominated A for Ax = b
• diagonally dominated ∇2f for minx f(x) (just strong convexity is notenough)
• some network flow problems
24 / 49
Recent work(stochastic analysis)
• AsySCD for convex smooth and composite minimization by Liu et al’14and Liu-Wright’14. Async dual CD (regression problems) by Hsieh et al.’15
• Async randomized (splitting/distributed/incremental) methods:Wei-Ozdaglar’13, Iutzeler et al’13, Zhang-Kwok’14, Hong’14, Chang etal’15
• Async SGD: Hogwild!, Lian’15, etc.
• Async operator sample and CD: SMART Davis’15
25 / 49
Random coordinate selection
• select xi to update with probability pi, where mini pi > 0
• drawback:• agents cannot cache data• either global memory or communication is required• pseudo-random number generation takes time
• benefits:• often faster than the fixed cyclic order• automatic load balance• simplifies certain analysis
26 / 49
Convergence summary
27 / 49
Convergence guarantees
m is # coordinates, τ is the maximum delay, uniform selection pi ≡ 1m
Theorem (almost sure convergence)Assume that T is nonexpansive and has a fixed point. Use step sizesηk ∈ [ε, 1
2m−1/2τ+1 ), ∀k. Then, with probability one, xk ⇀ x∗ ∈ FixT .
In addition, rates can be derived.Consequence: step size is O(1) if τ ∼ √m. Under equal agents and updates,attaining linear speedup if using p = O(
√m) agents. p can be bigger if T is
sparse.
28 / 49
Sketch of proof
• typical inequality:
‖xk+1 − x∗‖2 ≤‖xk − x∗‖2 − c‖Txk − xk‖2
+ harmful terms(xk−1, . . . , xk−τ )
• Descent inequality under a new metric:
E(‖xk+1 − x∗‖2
M
∣∣X k) ≤ ‖xk − x∗‖2M − c ‖Txk − xk‖2
where• the history up to iteration k• xk = (xk, xk−1, . . . , xk−τ ) ∈ Hτ+1, k ≥ 0• any x∗ = (x∗, x∗, . . . , x∗) ∈ X∗ ⊆ Hτ+1
• M is a positive definite matrix.• c = c(ηk,m, τ)
29 / 49
• apply the Robbins-Siegmund theorem:
E(αk+1|Fk) + vk ≤ (1 + ξk)αk + ηk
where all are nonnegative, αk is random, and ξk, ηk are summable. Thenαk converges a.s.
• prove weakly convergent clustering points are fixed-points;
• assume H is separable and apply results [Combettes, Pesquet 2014].
30 / 49
Applications and numerical results
31 / 49
Linear equations (asynchronous Jacobi)
• require: invertible square matrix A with nonzero diagonal entries
• let D be the diagonal part of A; then
Ax = b ⇐⇒ (I −D−1A)x+D−1b︸ ︷︷ ︸Tx
= x.
• T is nonexpansive if ‖I −D−1A‖2 ≤ 1, i.e., A is diagonal dominating
• xk+1 = Txk recovers the Jacobi algorithm
32 / 49
Algorithm 1: ARock for linear equations
Input : shared variables x ∈ Rn, K > 0;set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do
sample i ∈ {1, . . . ,m} uniformly at random;add − ηk
aii(∑
jaij x
kj − bi) to shared variable xi;
update the global counter k ← k + 1;
33 / 49
Sample code
loadData (A, data_file_name );loadData (b, label_file_name );# pragma omp parallel num_threads (p) shared (A,b,x,para){ // A, b, x, and para are passed by reference
call Jacobi (A,b,x,para) or ARock(A,b,x,para );}
• p: the number of threads• A,b,x: shared variable• para: other parameters
34 / 49
Jacobi worker function
for(int itr =0; itr < max_itr ; itr ++){
// compute the update for the assigned x[i]// ...
# pragma omp barrier{
// write x[i] in global memory}
# pragma omp barrier}
Jacobi needs the barrier directive for synchronization
35 / 49
ARock worker function
for(int itr =0; itr < max_itr ; itr ++){
// pick i at random// compute the update for x[i]// ...// write x(i) in global memory
}
ARock has no synchronization barrier directive
36 / 49
Minimizing smooth functions
• require: convex and Lipschitz differentiable function f
• if ∇f is L-Lipschitz, then
minimizex
f(x) ⇐⇒ x =(I − 2
L∇f)︸ ︷︷ ︸
T
x.
where T is nonexpansive
• ARock will be efficient when ∇xif(x) is easy to compute
37 / 49
Minimizing composite functions
• require: convex smooth g(·) and convex (possibly nonsmooth) f(·)
• proximal map: proxγf (y) = arg min f(x) + 12γ ‖x− y‖
2.
minimizex
f(x) + g(x) ⇐⇒ x = proxγf ◦ (I − γ∇g)︸ ︷︷ ︸T
x.
• ARock will be fast if• ∇xig(x) is easy to compute• f(·) is separable (e.g., `1 and `1,2)
38 / 49
Example: sparse logistic regression
• n features, N labeled samples
• each sample ai ∈ Rn has its label bi ∈ {1,−1}
• `1 regularized logistic regression:
minimizex∈Rn
λ‖x‖1 + 1N
N∑i=1
log(1 + exp(−bi · aTi x)
), (1)
• compare sync-parallel and ARock (async-parallel) on two datasets:
Name N (#samples) n (# features) # nonzeros in {a1, . . . , aN}rcv1 20,242 47,236 1,498,952
news20 19,996 1,355,191 9,097,916
39 / 49
Speedup tests
• implemented in C++ and OpenMP.• 32 cores shared memory machine.
#coresrcv1 news20
Time (s) Speedup Time (s) Speedupasync sync async sync async sync async sync
1 122.0 122.0 1.0 1.0 591.1 591.3 1.0 1.02 63.4 104.1 1.9 1.2 304.2 590.1 1.9 1.04 32.7 83.7 3.7 1.5 150.4 557.0 3.9 1.18 16.8 63.4 7.3 1.9 78.3 525.1 7.5 1.1
16 9.1 45.4 13.5 2.7 41.6 493.2 14.2 1.232 4.9 30.3 24.6 4.0 22.6 455.2 26.1 1.3
40 / 49
More applications
41 / 49
Minimizing composite functions
• require: both f and g are convex (possibly nonsmooth) functions
minimize f(x) + g(x) ⇐⇒ z = reflγf ◦ reflγg︸ ︷︷ ︸TPRS
(z).
recover x = proxγg(z)
• TPRS is known as the Peaceman-Rachford splitting operator3
• also, the Douglas-Rachford splitting operator: 12I + 1
2TPRS
• ARock runs fast when• reflγf is separable• (reflγg)i is easy-to-compute/maintain
3reflective proximal map: reflγf := 2proxγf − I. The maps reflγf , reflγg and thusreflγf ◦ reflγg are nonexpansive
42 / 49
Parallel/distributed ADMM
• require: m convex functions fi
• consensus problem:
minimizex
∑m
i=1 fi(x) + g(x)
⇐⇒ minimizexi,y
∑m
i=1 fi(xi) + g(y)
subject to
I
I
. . .I
x1
x2...xm
−I
I...I
y = 0
• Douglas-Rachford-ARock to the dual problem ⇒ async-parallel ADMM:
• m subproblems are solved in the async-parallel fashion• y and zi (dual variables) are updated in global memory (no lock)
43 / 49
Decentralized computing
• n agents in a connected network G = (V,E) with bi-directional links E
• each agent i has a private function fi
• problem: find a consensus solution x∗ to
minimizex∈Rp
f(x) :=n∑i=1
fi(Aixi) subject to xi = xj , ∀i, j.
• challenges: no center, only between-neighbor communication
• benefits: fault tolerance, no long-dist communication, privacy
44 / 49
Async-parallel decentralized ADMM
• a graph of connected agents: G = (V,E).
• decentralized consensus optimization problem:
minimizexi∈Rd,i∈V
f(x) :=∑
i∈V fi(xi)
subject to xi = xj , ∀(i, j) ∈ E
• ADMM reformulation: constraints xi = yij , xj = yij , ∀(i, j) ∈ E
• apply• ARock version 1: nodes asynchronously activate• ARock version 2: edges asynchronously activate
no global clock, no central controller, each agent keeps fi private andtalks to its neighbors
45 / 49
notation:
• N(i) all edges of agent i, N(i) = L(i) ∪R(i)
• L(i) neighbors j of agent i, j < i
• R(i) neighbors j of agent i, j > i
Algorithm 2: ARock for the decentralized consensus problem
Input : each agent i sets x0i ∈ Rd, dual variables z0
e,i for e ∈ E(i), K > 0.while k < K, any activated agent i do
receive zkli,l from neighbors l ∈ L(i) and zkir,r from neighbors r ∈ R(i);update local xki , zk+1
li,i and zk+1ir,i according to (2a)–(2c), respectively;
send zk+1li,i to neighbors l ∈ L(i) and zk+1
ir,i to neighbors r ∈ R(i).
xki ∈ arg minxi
fi(xi) +(∑
l∈L(i) zkli,l +
∑r∈R(i) z
kir,r
)xi + γ
2 |E(i)|‖xi‖2,
(2a)
zk+1ir,i = zkir,i − ηk((zkir,i + zir,r)/2 + γxki ), ∀r ∈ R(i), (2b)zk+1li,i = zkli,i − ηk((zkli,i + zli,l)/2 + γxki ), ∀l ∈ L(i). (2c)
46 / 49
Summary
47 / 49
Summary of async-parallel coordinate descent
benefits:
• eliminate idle time
• reduce communication / memory-access congestion
• random job selection: load balance
mathematics: analysis of disorderly (partial) updates
• asynchronous delay
• inconsistent read and write
48 / 49
Thank you!
Acknowledgements: NSF DMSReference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM15-37.Website: http://www.math.ucla.edu/˜wotaoyin/arock
49 / 49