lossy data compression reduces communication time in ... › weiser › papers ›...

Noname manuscript No.(will be inserted by the editor)

Lossy data compression reduces communication time inhybrid time-parallel integrators

Lisa Fischer · Sebastian Gotschel · Martin Weiser

To appear in Computing and Visualization in Science, DOI 10.1007/s00791-018-0293-2

Abstract Parallel-in-time methods for solving initial

value problems are a means to increase the parallelism

of numerical simulations. Hybrid parareal schemes in-

terleaving the parallel-in-time iteration with an itera-

tive solution of the individual time steps are among

the most efficient methods for general nonlinear prob-

lems. Despite the hiding of communication time be-

hind computation, communication has in certain sit-

uations a significant impact on the total runtime. Here

we present strict, yet not sharp, error bounds for hy-

brid parareal methods with inexact communication due

to lossy data compression, and derive theoretical esti-

mates of the impact of compression on parallel efficiency

of the algorithms. These and some computational ex-

periments suggest that compression is a viable method

to make hybrid parareal schemes robust with respect to

low bandwidth setups.

1 Introduction

Nowadays, the computing speed of single CPU cores

barely increases, such that performance gains are mostly

due to increasing parallelism: number of compute nodes,

number of CPU cores per node, graphics cards, and

vectorization. Correspondingly, parallelization of algo-

rithms such as PDE solvers for initial value problems

is constantly gaining importance. In addition to estab-

lished spatial domain decomposition methods [33], the

surprisingly old idea of parallelization in time has seen

growing interest in the last decade [12].

One of the prototypical parallel-in-time algorithms

for initial value problems is the parareal scheme [23],

L. Fischer · S. Gotschel · M. Weiser

Zuse Institute Berlin, Takustr. 7, 14195 Berlin, GermanyE-mail: lisa.fischer | goetschel | [email protected]

computing N subtrajectories in parallel by an exact or

fine propagator and transporting the resulting jumps

over the whole time horizon by means of a suitable, se-

quential fast or coarse propagator, the latter being in-

dispensable for efficiency. Explicit error estimates guar-

antee r-linear [4, 8, 13] or q-linear [15, 36] convergence

rates.

As after N steps the exact solution has been com-

puted, both in sequential and in parareal schemes, the

parallel efficiency is bounded by 1/Jh, the inverse of the

number of iterations needed. A fast convergence within

a small number of iterations independent of N usually

requires a rather accurate and hence expensive coarse

propagator, which in turn limits parallel efficiency.

A step towards higher parallel efficiency are hybrid

parareal methods computing the fine propagator in an

iterative way. This allows to interleave the parareal it-

eration with each fine propagator iteration and hence to

perform the global transport of corrections more often.

Examples for iterative fine propagator schemes used in

hybrid parareal methods are multigrid solvers for im-

plicit Runge-Kutta methods [25] or spectral deferred

correction methods [9,34] for solving higher order collo-

cation systems [7,10,26]. Convergence theory for hybrid

parareal schemes is less well covered, see, e.g., [6,24,25].

Apart from convergence speed, the repeated com-

munication necessary in parareal schemes for large scale

problems such as discretized partial differential equa-

tions affects wall clock time and therefore parallel effi-

ciency. Its impact has been reduced by adaptive work

scheduling [3,27] and pipelining computations, such that

communication occurs mostly parallel to computation

and both transmission time and latency are hidden to

some extent [7,10], even though this appears to be diffi-

cult to achieve in practice with current message passing

interface (MPI) implementations. Hiding communica-

2

tion behind computation is also difficult because what is

called computation might be a spatially parallelized do-

main decomposition solver, performing inter-node com-

munication and possibly saturating the communication

network on its own. The impact of communication is

moreover growing, not only because of increasing par-

allelism, but also because computing power tends to

grow faster than communication bandwidth [20].

Approaches for reducing communication include Nie-

vergelt type parallel-in-time integration [5] mainly for

linear equations, and the concept of guided simulations

in case several similar equations are to be solved [32],

e.g., for parameter studies. Especially for hyperbolic

equations, the so called swept rule [1, 2] can be used

to reduce not the amount of data transmitted, but the

number of required messages, which is of interest in

systems with high latency.

Besides methods for such specific problem settings,

compression of MPI messages has gained interest in the

last decade, aiming at methods which can be used re-

gardless of the actual application. Mostly, lossless com-

pression algorithms are considered in order not to spoil

the results of computations, such that only a small re-

duction in size can be achieved. Nevertheless, reduc-

tions of computation times are still achieved [11,19,28].

Let us also note that data compression may also re-

duce the energy spent on communication – an inter-

esting option for HPC systems approaching the power

wall [22, 29,31].

Here, we explore the reduction of communication in

hybrid parareal schemes by lossy data compression. On

the one hand, lossy compression, if applicable, achieves

higher compression factors than lossless compression,

e.g., for audio (MP3) and images (JPEG). On the other

hand, it affects the convergence of the parallel-in-time

algorithm. We estimate the influence of the quantiza-

tion error onto convergence, such that a suitable tol-

erance with only minor impact can be chosen. In the

course of estimating the effect of inexact communica-

tion, we also derive a rigorous, though not sharp, er-

ror bound for hybrid parareal methods. For estimating

compression efficiency and in the numerical tests we em-

ploy transform coding tailored towards finite element

coefficient vectors [16,35].

In Section 2 we specify the abstract problem set-

ting and state the general assumptions. For compari-

son purposes, we give a sequential iterative time step-

ping scheme as a reference in Section 3, using the same

stationary iteration as will be employed for the hybrid

parareal method in Section 4. Besides an a priori error

bound that takes inexact communication into account,

parallel efficiency is estimated. Section 5 briefly recalls

lossy compression before investigating its expected im-

pact on parallel efficiency in several situations. Finally,

actual computation results are presented in Section 6.

2 Problem setting

Let us begin with stating the abstract problem to be

solved and introducing the associated notation. We are

interested in numerically solving the initial value prob-

lem

u = f(u), u(0) = v0∗

for given starting value v0∗ on the global time inter-

val [0, T ], where we assume that the exact solution

u ∈ C([0, T ], V ) exists and assumes values in some

Hilbert space V . Without loss of generality, we restrict

the attention to autonomous problems.

For a time-parallel solution, the global interval is

subdivided into N equidistant local time intervals In =

[nh, (n+ 1)h], n = 0, . . . , N − 1, with length h = T/N .

On each local interval In, the solution subtrajectory

un = u|In is approximated by an element of a sub-

space Un ⊂ C(In, V ) equipped with the max-norm

‖un‖ := maxt∈In ‖un(t)‖. Furthermore, assume that at

least affine functions of time are contained in Un, i.e.

P1(In, V ) ⊂ Un, with Pk denoting the space of polyno-

mials of order up to k.

Abstracting from the details of time stepping, we

assume the availability of convergent fixed point iter-

ations in terms of Fn(u; v), Fn : Un × V → Un for

solving the initial value problems u = f(u), u(nh) = v

on In for initial value v ∈ V . We accept the fixed point

un∗ (v) := Fn(un∗ (v); v) as a function of the initial value v

as the “exact” solution on In, even if it is only a particu-lar approximation, e.g., a collocation solution. Concrete

examples of such iterative solvers are spectral deferred

correction methods converging towards the collocation

solution, where Un = Pk(In, V ), or multigrid V-cycles

converging towards an implicit Euler step, where V is

a finite element space and Un(In, V ) = P1(In, V ) is

affine in time. The global trajectory u∗ over the global

interval [0, T ] is then defined by chaining the local fixed

points, i.e. u∗|In = un∗ (vn∗ ) via the continuity condition

vn+1∗ = un∗ ((n+1)h), n = 0, . . . , N−1. We simply write

un∗ for un∗ (vn∗ ) if no ambiguity arises.

In other words, unj denotes the approximate subtra-

jectory on the n-th time interval at iteration j, defined

in terms of the initial value vnj and the previous iterate

unj−1. The subscript ∗ always indicates the limit point

for j →∞.

Assumption 1 Let the exact evolution have a Lipschitz

constant of L ≥ 1, i.e. ‖un∗ (v) − un∗ (v)‖ ≤ L‖v − v‖for arbitrary v, v ∈ V , the fixed point iterations Fn

3

have a uniform contraction factor 0 < ρ < 1, and

the right hand side be bounded such that ‖un∗ (v)((n+

1)h) − v‖ ≤ ch. Moreover, assume there is K < ∞such that ‖un∗ (v) − un∗ (v) − (v − v)‖ ≤ η‖v − v‖ with

η := hK exp(hK) holds for any v and v.

Remark 1 Already from the choice of the max-norm on

Un, the bound L ≥ 1 is implied. This excludes the

proper treatment of purely dissipative problems, which,

however, are anyways not particularly interesting in the

autonomous setting.

Note that usually both L and ρ depend on h, which

is important to keep in mind when looking at strong

scalability with T = const and growing N . The last as-

sumption is natural as it is a direct result for any rea-

sonable discretization defining un∗ if the right hand side

f is Lipschitz continuous due to the following lemma.

Lemma 1 Assume that u(v) ∈ C([0, T ], V ) satisfies

u(v)(t) = f(u(v)(t)) on [0, h]

for any initial value v and that the right hand side f is

Lipschitz continuous with constant K. Then,

‖u(v)−u(v)− (v− v)‖ ≤ η‖v− v‖, η := hK exp(hK)

Proof Let ξ(t) := (u(v) − u(v))(t) − (v − v). Then, ξ

satisfies

‖ξ(t)‖ =

∥∥∥∥∫ t

τ=0

(f(u(v)(τ))− f(u(v)(τ))) dτ

∥∥∥∥≤∫ t

τ=0

K‖u(v)(τ)− u(v)(τ)‖ dτ

≤∫ t

τ=0

K(‖ξ(τ)‖+ ‖v − v‖) dτ

= tK‖v − v‖+

∫ t

τ=0

K‖ξ‖ dτ.

Gronwall’s inequality now yields

‖ξ(t)‖ ≤ tK‖v − v‖ exp(tK) ≤ hK exp(hK)‖v − v‖

and thus the claim. ut

3 Sequential reference

Judging the performance and efficiency bounds of hy-

brid parareal schemes requires the comparison with a

suitable reference. This will be provided by the obvi-

ous method to compute a complete trajectory within

the same problem setting. That is the sequential ap-

proach of stepping through the intervals one after the

other, performing a constant number J of fixed point

iterations on each interval:

Algorithm 1 Sequential iteration

(i) given starting value

v0j = v0∗ for j = 0, . . . , J

(ii) constant initialization

un0 (t) = vn0 for n = 0, . . . , N − 1, t ∈ In

(iii) approximate integration

unj = Fn(unj−1; vnj )

for n = 0, . . . , N − 1, j = 1, . . . , J

(iv) continuity

vnj = un−1J (nh)

for n = 1, . . . , N − 1, j = 0, . . . , J

Let us introduce the function ν : [1,∞[→ R by

νn(x) :=

n−1∑i=0

xi (1)

and gather some of its elemental properties for later

use.

Lemma 2 νn(x) is monotonically increasing in x and

bounded by

xn ≤ νn+1(x) ≤ (n+ 1)xn.

Moreover, 1 + xνn(x) = νn+1(x) holds.

With that, we derive an error estimate for the se-

quential Algorithm 1 defined above.

Theorem 1 Let Assumption 1 hold. Then the total er-

ror enj of the sequential iterative Algorithm 1 with Jiterations in each interval In is bounded by

‖unJ − un∗‖ ≤ enJ := chρjνi+1(L). (2)

Proof By the triangle inequality and Assumption 1 we

obtain Lady Windermere’s fan as

‖unJ − un∗‖ ≤ ‖unJ − un∗ (vnJ )‖+ ‖un∗ (vnJ )− un∗‖≤ ρJ‖un0 − un∗ (vnj )‖+ L‖vnJ − vn∗ ‖≤ ρJch+ L‖un−1j − un−1∗ ‖≤ chρJ + Len−1J

= chρJ(1 + Lνn(L)).

Applying Lemma 2 and induction over n yields the

claim. ut

Assume we ask for an accuracy of TOL relative to

the condition of the problem, i.e.

‖unJ−un∗‖ ≤ TOL chνi+1(L) for n = 0, . . . , N − 1. (3)

4

simulated time

wall

cloc

k tim

e

simulated time

wall

cloc

k tim

e

Fig. 1: Schematic representation of time stepping with iterative schemes. The number Js = 2 of iterations is deliberately chosen for

illustration purposes only. Left: Sequential method. Right: Hybrid parareal scheme.

Then the number of iterations required by Algorithm 1

to reach this accuracy is at most

Js ≤⌈

log TOL

log ρ

⌉. (4)

Let tF denote the computing time for a single evalu-

ation of Fn, e.g., for one SDC sweep or a single V-

cycle, assumed to be the same for all n. The total

time for computing the final value uN−1Jsis bounded

by ts ≤ NJstF , see Fig. 1, left.

4 An inexact hybrid parareal algorithm

The hybrid parareal scheme interleaves the time step-

ping with the stationary iteration on each local inter-

val In, propagating the result of each iteration on to

the next interval. As passing the information on by

only one interval per iteration leads to slow conver-

gence if N is large, sequential but fast coarse propaga-

tors Gn : V → V are used to distribute the correction

quickly through the remaining time intervals, see Fig. 1,

right.

The analysis follows the line of [13], but includes

the hybrid parareal fixed point iteration and considers

the communication of corrections of the initial value

from one time interval to the next to be subject to

errors due to lossy compression. The modification of the

initial value is represented by communication operators

Cnj : V → V .

Algorithm 2 Hybrid parallel iteration

(i) given starting value

v0j = v0∗ for j = 0, . . . , J

(ii) coarse propagation

vn0 = Gn−1(vn−10 ) for n = 1, . . . , N

(iii) piecewise linear initialization

un0 (t) = vn0 + (t/h− n)(vn+10 − vn0 )

for n = 0, . . . , N − 1

(iv) approximate integration

unj = Fn(unj−1 + vnj − vnj−1; vnj )

for n = 0, . . . , N − 1, j = 1, . . . , J

(v) continuity

vnj = vnj−1 + Cnj

(un−1j−1 (nh)− vnj−1

+Gn−1(vn−1j )−Gn−1(vn−1j−1 ))

for n = 1, . . . , N − 1, j = 1, . . . , J

In contrast to [24] we do not assume the fixed point

iterations on each interval to converge unaffected by

changing initial values, but treat this perturbation ex-

plicitly and consider inexact communication as well.

Theorem 2 In addition to Assumption 1, let the coarse

propagators Gn satisfy

‖un0 − un∗ (vn0 )‖ ≤ chγ0, (5)

‖(un∗ (v)− un∗ (v))((n+ 1)h)− (Gn(v)−Gn(v))‖≤ γ‖v − v‖, (6)

and

‖Gn(v)−Gn(v)‖ ≤ L‖v − v‖ (7)

5

for arbitrary v, v ∈ V and some γ0, γ < ∞. Let the

communication error be bounded by

‖v − Cnj (v)‖ ≤ ∆C‖v‖

for some ∆C <∞. Then the error estimate

‖vnj − vn∗ ‖ ≤ δnj := chγ0αρj−1νn+1(βj) (8)

holds with α = 1+∆C

1−∆C/ρand

βj = α

(γ

ρ+ L+ η(1 + ρ−1)j

).

Before proving the theorem, let us stress that in gen-

eral γ in (6) is of order one unless the coarse propagator

has a consistency order of at least 0, which is sufficient

for γ = O(h). This is, however, not satisfied by the com-

mon realization of coarse propagators acting on coarser

grids in PDE problems. But it is necessary to guarantee

fast convergence for large N , since otherwise correction

components that cannot be represented on the coarse

grid and hence not be propagated by G are handed on

by just one interval per iteration. The related impact

of insufficient coarse propagator accuracy in time has

been investigated in [30] for advection-diffusion equa-

tions. The practical success of coarse grid propagators

is due to the usually small amplitude of such error com-

ponents and a quick reduction of high frequent error

components in dissipative systems.

Proof In addition to (8) we will prove

‖unj−un∗ (vnj )‖ ≤ εnj := chγ0ρj(1+αη(1+ρ−1)jνn+1(βj)).

Starting the induction at n = 0 yields

‖v0j − v0∗‖ = 0 ≤ δ0j and ‖u0j − u0∗‖ ≤ chγ0 ρj ≤ ε0jdue to linear fixed point contraction with contraction

factor 0 < ρ < 1. On the other hand, j = 0 provides

‖un0 − un∗ (vn0 )‖ ≤ chγ0 ≤ εn0 by (5) and

‖vn+10 − vn+1

∗ ‖ ≤ ‖un0 − un∗‖≤ ‖un0 − un∗ (vn0 )‖+ ‖un∗ (vn0 )− un∗‖

≤ chγ0 + L‖vn0 − vn∗ ‖ ≤ chγ0νn+1(L) ≤ δn+10 .

For general n, j > 0 we let vn+1j+1 := unj ((n + 1)h) +

Gn(vnj+1)−Gn(vnj ) and obtain

‖vn+1j+1 − v

n+1∗ ‖ ≤ ‖vn+1

j + Cn+1j+1 (vn+1

j+1 − vn+1j )− vn+1

j+1 ‖+ ‖vn+1

j+1 − vn+1∗ ‖

≤ ∆C‖vn+1j+1 − v

n+1j ‖+ ‖vn+1

j+1 − vn+1∗ ‖

≤ ∆C

(‖vn+1j+1 − v

n+1∗ ‖+ ‖vn+1

∗ − vn+1j ‖

)+ ‖vn+1

j+1 − vn+1∗ ‖

≤ ∆C‖vn+1∗ − vn+1

j ‖+ (1 +∆C)‖vn+1

j+1 − vn+1∗ ‖. (9)

With the definitions of vn+1j+1 and εnj , the last term in (9)

can be bounded by

‖vn+1j+1 − v

n+1∗ ‖

= ‖unj ((n+ 1)h) +Gn(vnj+1)

−Gn(vnj )− un∗ (vn∗ )((n+ 1)h)‖≤ ‖unj − un∗ (vnj )‖

+ ‖(un∗ (vnj )− un∗ (vn∗ ))((n+ 1)h)

− (Gn(vnj )−Gn(vn∗ ))‖+ ‖Gn(vnj+1)−Gn(vn∗ )‖≤ εnj + γ‖vnj − vn∗ ‖+ L‖vnj+1 − vn∗ ‖.

Using the monotonicity of νn+1 to bound

δnj =νn+1(βj)

ρνn+1(βj+1)δnj+1 ≤

1

ρδnj+1, (10)

which is only asymptotically sharp, we can further es-

timate

εnj + γ‖vnj − vn∗ ‖+ L‖vnj+1 − vn∗ ‖≤ εnj + γδnj + Lδnj+1

≤ εnj +

(γ

ρ+ L

)δnj+1

≤ chγ0ρj(

1 + αη(1 + ρ−1)jνn+1(βj)

+

(γ

ρ+ L

)ανn+1(βj+1)

)≤ chγ0ρj

(1 + α

(η(1 + ρ−1)j +

γ

ρ+ L

)νn+1(βj+1)

)≤ chγ0ρj(1 + βj+1νn+1(βj+1))

= chγ0ρjνn+2(βj+1)

≤ 1

αδn+1j+1 .

Inserting this into (9) yields

‖vn+1j+1 − v

n+1∗ ‖ ≤ ∆Cδ

n+1j + (1 +∆C)‖vn+1

j+1 − vn+1∗ ‖

≤ ∆C

ρδn+1j+1 + (1 +∆C)

1−∆C/ρ

1 +∆Cδn+1j+1

≤ δn+1j+1 .

Moreover, due to the approximate integration by Fn in

step (iv) of Algorithm 2, we have

‖unj − u(vnj )‖≤ ρ‖unj−1 + vnj − vnj−1 − un∗ (vnj )‖≤ ρ(‖unj−1 − un∗ (vnj−1)‖

+ ‖un∗ (vnj−1)− un∗ (vnj )− (vnj−1 − vnj )‖).

6

Using η as given in Assumption 1 we get

ρ(‖unj−1 − un∗ (vnj−1)‖+ ‖un∗ (vnj−1)− un∗ (vnj )− (vnj−1 − vnj )‖

)≤ ρ(εnj−1 + η‖vnj − vnj−1‖)≤ ρ

(εnj−1 + η(‖vnj − vn∗ ‖+ ‖vnj−1 − vn∗ ‖)

)≤ ρ(εnj−1 + η(δnj + δnj−1))

≤ ρ(εnj−1 + η(1 + ρ−1)δnj )

≤ chγ0ρj(1 + αη(1 + ρ−1)(j − 1)νn+1(βj−1)

+ αη(1 + ρ−1)νn+1(βj))

≤ chγ0ρj(1 + αη(1 + ρ−1)jνn+1(βj)) = εnj ,

which completes the induction. ut

The asymptotic convergence rate is ρ, as in the se-

quential SDC algorithm, independent of communica-

tion accuracy and fast propagator.

Let us investigate the error bound in a straightfor-

ward but nontrivial example situation.

Example 1 We consider the harmonic oscillator

x = y, y = −x

with initial value x = 0, y = 1 and time horizon T = 3π,

subdivided into N = 16 intervals.

As a stationary fine propagator iteration we use

SDC on a two-point Gauss collocation grid with a ex-

plicit Euler base method, converging towards a fourth

order energy preserving collocation scheme. Consequent-

ly, L = 1 and η = 0.58 hold. As fast propagator we con-

sider (i) the zero propagator Gn(v) = 0 yielding just a

parallel SDC method [18] and (ii) the explicit Euler

propagator Gn(v) = v + hf(v).

In that setting, the iterates and hence right hand

sides are bounded in Assumption 1 with (i) c = 1186

and (ii) c = 18, respectively, and the fast propagation

errors can be bounded by (i) γ0 = γ = 0.58 and (ii)

γ0 = γ = 0.17. The observed SDC contraction rate is

ρ = 0.20.

Resulting iteration errors and error bounds are de-

picted in Fig. 2. While the global error behavior is re-

produced rather well by the bound (8), the bound is

far from being sharp, in particular for larger n. On one

hand this is due to the rather general assumptions on

the stationary iteration, with the observed average case

being much better than the bounded worst case. On the

other hand, the estimates in the proof are not sharp,

in particular (10). For that reason we provide a fitted

error estimate as well, where the constants in (8) are

obtained by a least squares fit to the actual errors. Of

course, this reproduces the actual errors much better.

The impact of communication errors is simulated by

adding normally distributed noise of relative standard

deviation ∆C = 0.1, which amounts to less than four

bits per coefficient. Comparing the results in the last

row of Fig. 2 with the second row corresponding to

∆C = 0 suggests that the impact of even such a large

relative communication error on convergence is minor.

Parallel efficiency. Next we estimate the parallel effi-

ciency of the hybrid parareal method based on the error

bound (8). As this is overestimating the error, the effi-

ciency estimates will be rather pessimistic.

For a relative accuracy of TOL as specified in (3),

the number Jh of required iterations is bounded by

Jh ≤ 1+log(

TOL νN+1(L)αγ0νN+1(βJh

)

)log ρ

≤ 1+

log

(TOLLN

αγ0(N+1)βNJh

)log ρ

.

Assuming L > 1 and N rather large, this can be ap-

proximated by

Jh ≈ 1 +log TOL

log ρ− N log(βJh/L) + log(αγ0(N + 1))

log ρ

≈ 1 + Js

−N log

(α(γLρ + 1 + η

L (1 + ρ−1)Jh

))log ρ

− log(αγ0(N + 1))

log ρ.

With α = (1+∆C)/(1−∆C/ρ) and ∆C 1, we obtain

logα ≈ (1 + ρ−1)∆C and thus

Jh ≈ 1 + Js −N log

(α(γLρ + 1 + η

L (1 + ρ−1)Jh

))log ρ

− (1 + ρ−1)∆C + log(γ0(N + 1))

log ρ

≤ 1 + Js −N(γLρ + η

L (1 + ρ−1)Jh

)log ρ

− (N + 1)(1 + ρ−1)∆C + log(γ0(N + 1))

log ρ.

Thus, Jh grows only logarithmically with N , if both

∆C = O(N−1) and

γ

Lρ+η

L(1 + ρ−1)Jh = O(N−1)

hold. The former is rather easy to satisfy with a small,

i.e. logarithmic, increase in communication time. The

latter condition, however, requires in particular γ and

η to be of order N−1. As γ denotes the consistency error

of the coarse propagator, e.g., an Euler step with O(h2),

and η is of orderO(h), this meansNh = O(1). Actually,

7

log10 ‖vnj − vn∗ ‖

jn


jn


jn

log10 δnj

jn

log10 δnj

jn

log10 δnj

jn

log10 δnj

jn

log10 δnj

jn

log10 δnj

jn

Fig. 2: Errors and bounds for Example 1 in logarithmic scale. Left: errors ‖vnj − vn∗ ‖. Mid: error bound (8). Right: error bound withconstants estimated from actual errors. Top: zero fast propagator. Mid: Euler fast propagator. Bottom: Euler fast propagator with

communication error ∆C = 0.1.

this restriction is to be expected, since the theory covers

ill-conditioned and chaotic systems, where an extension

of the integration interval cannot be expected to keep

the iteration number constant.

As is apparent from Fig. 1, the total computation

time is th = Jh(tF +tG)+(N−1)(tG+tC), which yields

a parallel efficiency of

Eh =tsNth

=JstF

Jh(tF + tG) + (N − 1)(tG + tC). (11)

Obviously, a large efficiency close to one requires Jh ≈Js, which in turn requires TOL very small, or both

∆C and h small. Additionally, efficiency requires tG tF , and N(tG + tC) JhtF . The former condition is

rather easy to satisfy, in fact, using the same fixed point

iteration for both fine and coarse propagators would

push the efficiency only down to 0.5 due to the Jh(tF +

tG) term in the denominator. The second condition is

much harder to satisfy, in particular for highly parallel

computations with large N .

The relation of ∆c and tC and their impact on par-

allel efficiency Eh is discussed in the next section.

5 Lossy data compression

First we recall lossy compression methods for coefficient

vectors y ∈ Rm of spatial discretizations, in particular

of finite element approaches. The results are applicable

as well to similar techniques like finite difference and

finite volume schemes.

The simplest compression (rounding of coefficients

to the required accuracy) provides a compression ratio

S = − log2∆C

64(12)

8

relative to uncompressed double precision binary repre-

sentation. Note that transform coding with hierarchical

basis or wavelet transform and entropy coding yields

even better results [35]. For the sake of simplicity, how-

ever, we will stick with (12).

For very small ∆C , the message size is usually boun-

ded from above by the size of the raw binary repre-

sentation of double accuracy floating point values, i.e.

S ≤ 1. Given a time tT required to transmit the un-

compressed data set and a latency tL independent of

∆C that includes the time for encoding and decoding,

if performed, as well as usual communication latencies,

we arrive at a communication time tC = tL + StT . In-

serting this into (11) yields the parallel efficiency model

Eh =JstF

Jh(tF + tG) + (N − 1)(tG + tL − tT log2∆C/64),

(13)

which allows to investigate the impact of data compres-

sion. This impact and the optimal value of ∆C depends

on the values of N, Js, tF , Jh, tG, tL, and tT characteriz-

ing the problem to be solved, the coarse and fine prop-

agators, and the computing system used.

In the following we will explore several scenarios in

order to get an idea in which settings data compres-

sion may have a significant impact on parallel efficiency.

Starting from a nominal setting with SDC of order six

as stationary iteration, and first order Euler as fast

propagator, other scenarios modify the parameters in

a particular direction as outlined below.

Nominal scenario. This is a basic setting characterized

by moderate parallelism, iteration counts that are

typically observed in SDC methods for engineer-

ing tolerances, a coarse propagator time that could

come from a single Euler step compared to a full

SDC sweep, and communication times that are plau-

sible for current standard hardware. Times are given

relative to the fine propagator time tF . Values for

the nominal scenario are indicated by ·. N = 16,

L = 1.5, TOL = 10−6, ρ = 0.2, γ = 0.05, γ0 = 0.1,

η = 0.05, tF = 1, tG = 0.1, tT = 0.1. For the la-

tency tL = tL,e/d+tL,c, we assume a contribution of

tL,e/d = 0.005 from encoding/decoding if performed

and the same amount tL,c = 0.005 from the com-

munication system. With these parameters, (4), (8)

and (11) yield estimates of Js = 9, Jh = 28, and

Eh = 0.28, respectively.

Different tolerances. Aiming at a different tolerance

TOL = s ˆTOL for s ∈ [10−4, 102] usually goes along

with choosing a different order of collocation dis-

cretization and, at the same time, a different time

step size. Smaller time steps tend to decrease ρ,

whereas higher order SDC tends to increase ρ. For

simplicity, we assume the contraction ρ = ρ to be

unaffected by s. In contrast, for a nominal order six

we assume the time step h to scale like s1/6 and

hence obtain L = 1 + (L − 1)s1/6 and η = s1/6η.

Moreover, the Euler consistency error depends on

the time step as well, such that we assume γ0 =

s1/3γ0 and γ = s1/3γ.

Scalability. The natural mode for parallel-in-time inte-

gration is to have only time interval per processor,

and hence scaling means in general weak scaling

with growing time horizon T = sT and N = sN .

Thousands of intervals have been used for time-

parallel computations [14], such that we consider

s ∈ [0.25, 512]. Except for N , no parameter entering

the parallel efficiency estimate is assumed to change.

Commodity network hardware. Here we assume slower

communication links, due to cheaper network in-

frastructure, which might be encountered in work-

station clusters, cloud computing, or even DSL or

WiFi connections. Slower communication affects in

particular the bandwidth, i.e. tC = stC , but to some

extent also latency tL =√stL,c + tL,e/d, with s ∈

[10−2, 10].

In each scenario, and for each value of the scaling

parameter s, an optimal compression accuracy ∆C has

been computed numerically, maximizing the parallel ef-

ficiency based on (8). Figs. 3 and 4 show the paral-

lel efficiency of both the uncompressed (∆C = 0 and

S = 1) method and the lossy compression (∆C = opt,

S = − log2∆C/64).

In every setup, the optimal quantization error is

mostly in the range of 10−3 to 10−2, and improves par-

allel efficiency compared to the uncompressed version.

The effect, however, is rather small, predicted by theory

to be less than 5%. For decreasing tolerance, the parallel

efficiency grows as expected, see Fig. 3 left. However,

the benefit of compression, that could be assumed to

grow with larger tolerances, decreases. This is due to an

actually increasing number of iterations in the parallel

scheme due to larger step sizes affecting the effectivity

of the fast propagator, see Fig. 3 right. The number of

iterations in the sequential scheme, on the other hand,

decreases as expected, and is virtually unaffected by the

compression.

The theoretically predicted scalability is rather un-

satisfactory, see Fig. 4 left, due to the exponential in-

crease of ν in N and the corresponding growth of Jhfor growing number of time intervals. Correspondingly,

the benefit of compression that is assumed to improve

only the single sequential startup-phase becomes rela-

tively negligible, even though in absolute numbers it

increases linearly with N .

9

1e-11 1e-10 1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-30.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1e-11 1e-10 1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-35

10

15

20

25

30

35

TOL TOL

Eh J

compressed

uncompressed

parallel

sequential

Fig. 3: Impact of lossy compression for varying tolerance as described in the Different tolerances scenario. The nominal setting ismarked in both plots. Left: Parallel efficiency Eh as estimated in (11) and (13) is improved slightly by compression. Right: Iteration

numbers Jh and Js grow with the tolerance for the parallel scheme due to larger step sizes decreasing the accuracy of the fast

propagator. In contrast, faster SDC contraction leads to fewer iterations in the sequential scheme.

1e+0 1e+1 1e+2 1e+30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1e-2 1e-1 1e+0 1e+10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

N tT

Eh Eh

compressed

uncompressed

compressed

uncompressed

Fig. 4: Impact of lossy compression on parallel efficiency Eh. Left: Varying number N of time intervals. Right: Varying communicationtime tT . The nominal setting is marked in both plots.

As expected, a varying bandwidth has a significantly

larger effect on the compression benefit, see Fig. 4 right.

For transmission times tT that are at least three times

as high as assumed in the nominal scenario, the parallel

efficiency of the uncompressed scheme decreases rather

quickly, while compression allows to maintain the effi-

ciency for much longer communication times.

6 Computational Examples

Here we consider the inhomogeneous heat equation

ut −∆u = f in Ω × (0, T )

∂νu = 0 on ∂Ω × (0, T )

u(·, 0) = 0 in Ω,

(14)

with Ω = (0, 1)2 and source term

f(x, t) = (8π2 + exp(−t)− 8π2 exp(−t)) cos(2πx1) cos(2πx2).

The analytical solution to this equation is given by

u(x, t) = (1− exp(−t)) cos(2πx1) cos(2πx2).

For the numerical solution we discretize in space first,

using linear finite elements on a uniformly refined trian-

gular mesh. The arising system of ODEs is solved using

the Parallel Full Approximation Scheme in Space and

Time (PFASST) [10] as an instance of hybrid parareal

methods. For implementation we use PFASST++ [21]

in combination with the finite element toolbox Kas-

kade 7 [17].

10

6.1 Two-rank setup

First we consider a minimal setup with two time inter-

vals distributed on two workstations connected by Eth-

ernet, in order to measure communication times with

as few disturbances as possible. We use a time interval

size of 0.125, so T = 0.25. For the coarse propagator, we

use SDC with 3 Gauss-Lobatto collocation nodes; the

fine propagator is SDC on 5 Gauss-Lobatto nodes. In

each case, there are 263 169 spatial degrees of freedom.

Overall we perform 10 iterations. This setting allows

to investigate the effect of compression on the runtimes

of the method as well as on the accuracy, without be-

ing influenced too much by factors like shared access to

network and compute resources.

In Table 1 we report the wall-clock times (averaged

over 5 runs) for uncompressed and compressed commu-

nication with a prescribed relative tolerance of 10−8,

leading to an overall compression factor of 3.7.

Note that in the PFASST algorithm, as well as in

the PFASST++ library used here, communication is

performed in the fine and coarse propagator separately,

where the fine propagator can use interleaving of com-

putation and communication, leading to very small com-

munication times. The coarse propagator in contrast

has to wait for the send/receive to finish before com-

putations can continue. As it uses the same spatial dis-

cretization, communication times are more significant

there. With the time required for compression being

larger than the send/receive times of the fine propaga-

tor, this suggests to use compression only for the coarse

propagators communication.

The achieved accuracy for this setting, using a quan-

tization tolerance of 10−8, is shown in Table 2. Us-

ing compression only for the coarse propagator leads

to similar accuracy as in the uncompressed case, us-

ing compression in fine and coarse propagators leads to

a slightly larger residual. The overall communication

times were reduced to 83% and 81%, respectively.

Despite the simplified setup, computation times vary

between different runs, making them hard to compare.

Besides the shared access to resources, there are other

factors influencing the results. E.g., switching off com-

pression for the fine propagator leads to longer wait

times for the coarse propagators communication, thus

increasing the communication time there.

6.2 More than two ranks

Here we consider T = 1 and use N = 16 time inter-

vals and processors, distributed across different work-

stations equipped with Intel Xeon E3-1245 v5 CPUs

clocked at 3.5GHz and connected by gigabit Ethernet.

On each macro time interval of size 0.0625 we again use

SDC with 3 and 5 Gauss-Lobatto quadrature nodes as

coarse and fine propagator. For each, 66 049 spatial de-

grees of freedom were used.

We stop the iterations on time interval n when the

maximum norm of the correction on this interval is be-

low 10−6, and the previous time step n − 1 is already

converged. This leads to 12 iterations on the last time

step. Using compression with relative quantization tol-

erance of 10−8 (leading to a rather small compression

factor of about 2) takes the same number of iterations,

but reduces the overall communication time (for all pro-

cessors) from 160.1s to 73.9s (time per sweep: 0.46s and

0.21s respectively), a reduction by 53.9%. In this test,

we did not use compression for sending the very first

update to the initial value, as the quantization error

there had larger impact on convergence and commu-

nication times than the following incremental updates.

The overall runtime of the PFASST algorithm is re-

duced by 10%, from 103.9s to 93.2s. In both cases we

get a relative L2([0, T ] × Ω)-error of 2.1 · 10−4 in the

final iterate compared to the analytical solution. Note

that the stopping criterion in this experiment differs

from the standard stopping criterion of the PFASST

method, where usually the norm of the residual of the

collocation formulation

r(t) := u(0)−(u(t)−

∫ t

0

f(u(τ)) dτ)

is used. This is not feasible here: Using lossy compres-

sion for sending updates to initial values typically leads

to spatially oscillating compression errors in the ap-

proximate solution, which are amplified by the Laplace

operator in the original equation. In the two-rank test

case above, the computation was stopped after a fixed

number of sweeps and not iterated until convergence;

as the corrections and residuals are still large enough,

this error amplification is not an issue there.

Increasing the compression factor did not yield any

further improvements for the run times. One reason for

this might be that communication and computation is

not as well interleaved as before, leading to longer wait

times in the communication. Note that not only com-

putation times but also communication times strongly

depend on algorithmic parameters like the number of

quadrature nodes for the coarse and fine propagators.

The interplay of compression, algorithmic options and

other ingredients of PFASST, like the FAS-correction,

requires further investigation to facilitate an optimal

choice of these parameters.

Moreover, as the tests are run on a standard TCP/IP

over Ethernet connection not optimized for high perfor-

mance computing, there are various sources influencing

11

rank task uncompressed compress fine and coarse compress only coarse

0 sweeps fine 104.7 104.6 105.4sweeps coarse 52.3 52.2 52.7

comm fine 0.002 1.3 0.002comm coarse 11.1 8.9 9.1

1 sweeps fine 105.9 106.3 106.1

sweeps coarse 53.0 53.1 52.9comm fine 0.002 0.5 0.002

comm coarse 3.6 1.5 2.8

Table 1: Wall clock times in seconds for the two-rank setup for the individual steps.

rank uncompressed compress fine and coarse compress only coarse

0 9.1 · 10−6 9.1 · 10−6 9.1 · 10−6

1 1.7 · 10−4 2.5 · 10−5 1.8 · 10−5

Table 2: Final relative residuals.

communication times besides shared access to the net-

work, for example configuration of switches or firewalls,

varying Ethernet adapters, or other, unknown ones. In

our test runs, communication times varied from work-

station to workstation, despite having the same proces-

sors, memory, and software. While the available net-

work bandwidth can be measured to some extent, in

the actual computations it was drastically lower than

predicted. In settings like this, or even more severely in

cloud computing, robustness with respect to the band-

width is important; in this respect using lossy compres-

sion for communication can be quite beneficial.

Considering scaling experiments on a state of the

art compute cluster, we fixed a final time T = 2 and

macro time step 0.0625 and used PFASST (as described

above) to solve the equation on 1, 8, 16 and 32 pro-

cessors on the HLRN supercomputer (Cray XC30/40

with Aries interconnect of the North-German Super-

computing Alliance, www.hlrn.de). The sequential ver-

sion required a total of 348 iterations, an average of

10.875 per time interval. In the parallel case, on the

last rank an overall 50, 25, 12 iterations were required

for 8, 16, 32 processors (each computing 4, 2, 1 macro

time intervals), respectively. Using compressed commu-

nication with a quantization tolerance of 10−8 did not

influence the number of iterations; also the overall com-

putation time was only influenced insignificantly. This

indicates that in this experiment communication band-

width and latency are not an issue. In realistic, large-

scale applications, using domain decomposition in space

as well as time-parallel integration, we expect this to be

different, as memory bandwidth would already be sat-

urated due to communication required by the spatial

domain decomposition.

Conclusions

A complete and rather general convergence theory for

hybrid parareal schemes with inexact communication

has been presented. While the error estimates are not

in the least sharp, the qualitative error behavior is cap-

tured quite well and allows to estimate the impact of

inexact communication due to lossy compression on the

iteration count and hence on the parallel efficiency. The

theoretical results, supported by computational exper-

iments, indicate that lossy compression improves the

efficiency, but the amount of improvement depends on

the setup. Compression is particularly effective if the

available communication bandwidth is small. This in-

cludes commodity systems, clusters and cloud comput-

ing, where bandwidth is often low or varying, but also

high performance systems where the communication

network is already saturated due to concurrent com-

munication going on.

Acknowledgement. Partial funding by BMBF Project

SOAK and DFG Project WE 2937/6-1 is gratefully

acknowledged. The authors would also like to thank

Th. Steinke, F. Wende, A. Kammeyer and the North-

German Supercomputing Alliance (HLRN) for support-

ing the numerical experiments.

References

1. M. Alhubail, Q. Wang, and J. Williams. The swept rule

for breaking the latency barrier in time advancing two-dimensional PDEs. Preprint arXiv:1602.07558, 2016.

2. Maitham Alhubail and Qiqi Wang. The swept rule for break-

ing the latency barrier in time advancing PDEs. J. Comput.

Phys., 307:110–121, 2016.3. E. Aubanel. Scheduling of tasks in the parareal algorithm.

Parallel Computing, 37:172–182, 2011.

12

4. G. Bal. On the convergence and the stability of the pararealalgorithm to solve partial differential equations. In R. Ko-

rnhuber, R. Hoppe, J. Periaux, O. Pironneau, O. Widlund,

and J. Xu, editors, Proceedings of DD15, volume 40 of Lec-ture Notes in Computational Science and Engineering, pages

425–432. Springer, 2004.5. A.T. Barker. A minimal communication approach to parallel

time integration. Int. J. Comp. Math., 91(3):601–615, 2014.6. M. Bolten, D. Moser, and R. Speck. Asymptotic convergence

of the parallel full approximation scheme in space and time

for linear problems. Preprint arXiv:1703.07120, 2017.7. A.J. Christlieb, C.B. Macdonald, and B.W. Ong. Parallel

high-order integrators. SIAM J. Sci. Comput., 32(2):818–

835, 2010.8. M. Duarte, M. Massot, and S. Descombes. Parareal operator

splitting techniques for multi-scale reaction waves: Numerical

analysis and strategies. M2AN, 45(5):825–852, 2011.9. A. Dutt, L. Greengard, and V. Rokhlin. Spectral deferred

correction methods for ordinary differential equations. BIT,

40(2):241–266, 2000.10. M. Emmett and M.L. Minion. Toward an efficient parallel in

time method for partial differential equations. Comm. Appl.

Math. Comp. Sci., 7(1):105–132, 2012.11. R. Filgueira, D.E. Singh, J. Carretero, A. Calderon, and

F. Garcıa. Adaptive-compi: Enhancing MPI-based applica-tions’ performance and scalability by using adaptive com-

pression. The International Journal of High Performance

Computing Applications, 25(1):93–114, 2011.12. M. Gander. 50 years of time parallel time integration. In

T. Carraro, M. Geiger, S. Korkel, and R. Rannacher, ed-itors, Multiple Shooting and Time Domain Decomposition

Methods, volume 9 of Contributions in Mathematical and

Computational Sciences, pages 69–113. Springer, 2015.13. M.J. Gander and E. Hairer. Nonlinear convergence analy-

sis for the parareal algorithm. In U. Langer, M. Discacciati,

D.E. Keyes, O.B. Widlund, and W. Zulehner, editors, Do-

main Decomposition Methods in Science and EngineeringXVII, pages 45–56. Springer Berlin Heidelberg, 2008.

14. M.J. Gander and M. Neumuller. Analysis of a new space-time

parallel multigrid algorithm for parabolic problems. SIAM

J. Sci. Comput., 38(4):A2173–A2208, 2016.15. M.J. Gander and S. Vandewalle. Analysis of the parareal

time-parallel time-integration method. SIAM J. Sci. Com-put., 29(2):556–578, 2007.

16. S. Gotschel and Weiser. M. Lossy compression for PDE-

constrained optimization: Adaptive error control. Comput.

Optim. Appl., 62:131–155, 2015.17. S. Gotschel, M. Weiser, and A. Schiela. Solving optimal con-

trol problems with the Kaskade 7 finite element toolbox. InA. Dedner, B. Flemisch, and R. Klofkorn, editors, Advances

in DUNE, pages 101–112. Springer, 2012.18. D. Guibert and D. Tromeur-Dervout. Parallel deferred cor-

rection method for CFD problems. In J.-H. Kwon, J. Periaux,P. Fox, N. Satofuka, and A. Ecer, editors, Parallel Compu-

tational Fluid Dynamics 2006: Parallel Computing and its

Applications, pages 131–136, 2007.19. J. Ke, M. Burtscher, and E. Speight. Runtime compression of

MPI messages to improve the performance and scalability of

parallel applications. In Supercomputing, 2004. Proceedings

of the ACM/IEEE SC2004 Conference, page 59, 2004.20. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and

D. Glasco. Gpus and the future of parallel computing. IEEEMicro, 31(5):7–17, 2011.

21. T. Klatt, M. Emmett, D. Ruprecht, R. Speck, andS. Terzi. PFASST++. http://www.parallelintime.org/

codes/pfasst.html, 2015. Retrieved: 16 50, May 04, 2017(GMT).

22. S. Leyffer, S.M. Wild, M. Fagan, M. Snir, K. Palem,

K. Yoshii, and H. Finkel. Doing Moore with less - Leapfrog-ging Moore’s law with inexactness for supercomputing.

CoRR, abs/1610.02606, 2016.23. J.-L. Lions, Y. Maday, and G. Turinici. A parareal in time

discretization of pdes. C.R. Acad. Sci. Paris, Serie I,,

332:661–668, 2001.24. J. Liu, Y. Wang, and R. Li. A hybrid algorithm based on

optimal quadratic spline collocation and parareal deferred

correction for parabolic PDEs. Math. Probl. Eng., 2016. Ar-ticle ID 6943079.

25. E. McDonald and A.J. Wathen. A simple proposal for paral-

lel computation over time of an evolutionary process with im-plicit time stepping. In Proceedings of the ENUMATH 2015,

Lecture Notes in Computational Science and Engineering.

Springer, to appear.26. M.L. Minion. A hybrid parareal spectral deferred correc-

tions method. Comm. Appl. Math. Comp. Sci., 5(2):265–301, 2010.

27. A.S. Nielsen, G. Brunner, and J.S. Hesthaven.

Communication-aware adaptive parareal with applica-tion to a nonlinear hyperbolic system of partial differential

equations. EPFL-ARTICLE 228189, EPFL, 2017.

28. A. Nunez, R. Filgueira, and M.G. Merayo. SANComSim: Ascalable, adaptive and non-intrusive framework to optimize

performance in computational science applications. Procedia

Computer Science, 18:230 – 239, 2013.29. T. Palmer. Build imprecise supercomputers. Nature, 526:32–

33, 2015.

30. D. Ruprecht. Wave propagation characteristics of parareal.Preprint arXiv:1701.01359v1 [math.NA], 2017.

31. K.P. Saravanan, P.M. Carpenter, and A. Ramirez. A per-formance perspective on energy efficient hpc links. In ICS

’14 Proceedings of the 28th ACM international conference

on Supercomputing, pages 313–322, 2014.32. A. Srinivasan and N. Chandra. Latency tolerance through

parallelization of time in scientific applications. Parallel

Computing, 31:777–796, 2005.33. A. Toselli and O.B. Widlund. Domain Decomposition Meth-

ods – Algorithms and Theory, volume 34 of Computational

Mathematics. Springer, 2005.34. M. Weiser. Faster SDC convergence on non-equidistant grids

by DIRK sweeps. BIT Numerical Mathematics, 55(4):1219–

1241, 2015.35. M. Weiser and S. Gotschel. State trajectory compression for

optimal control with parabolic PDEs. SIAM J. Sci. Comput.,34(1):A161–A184, 2012.

36. S.-L. Wu and T. Zhou. Convergence analysis for three

parareal solvers. SIAM J. Sci. Comput., 37(2):A970–A992,2015.

lossy data compression reduces communication time in ... › weiser › papers ›...

Documents