lossy data compression reduces communication time in ... › weiser › papers ›...
TRANSCRIPT
Noname manuscript No.(will be inserted by the editor)
Lossy data compression reduces communication time inhybrid time-parallel integrators
Lisa Fischer · Sebastian Gotschel · Martin Weiser
To appear in Computing and Visualization in Science, DOI 10.1007/s00791-018-0293-2
Abstract Parallel-in-time methods for solving initial
value problems are a means to increase the parallelism
of numerical simulations. Hybrid parareal schemes in-
terleaving the parallel-in-time iteration with an itera-
tive solution of the individual time steps are among
the most efficient methods for general nonlinear prob-
lems. Despite the hiding of communication time be-
hind computation, communication has in certain sit-
uations a significant impact on the total runtime. Here
we present strict, yet not sharp, error bounds for hy-
brid parareal methods with inexact communication due
to lossy data compression, and derive theoretical esti-
mates of the impact of compression on parallel efficiency
of the algorithms. These and some computational ex-
periments suggest that compression is a viable method
to make hybrid parareal schemes robust with respect to
low bandwidth setups.
1 Introduction
Nowadays, the computing speed of single CPU cores
barely increases, such that performance gains are mostly
due to increasing parallelism: number of compute nodes,
number of CPU cores per node, graphics cards, and
vectorization. Correspondingly, parallelization of algo-
rithms such as PDE solvers for initial value problems
is constantly gaining importance. In addition to estab-
lished spatial domain decomposition methods [33], the
surprisingly old idea of parallelization in time has seen
growing interest in the last decade [12].
One of the prototypical parallel-in-time algorithms
for initial value problems is the parareal scheme [23],
L. Fischer · S. Gotschel · M. Weiser
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, GermanyE-mail: lisa.fischer | goetschel | [email protected]
computing N subtrajectories in parallel by an exact or
fine propagator and transporting the resulting jumps
over the whole time horizon by means of a suitable, se-
quential fast or coarse propagator, the latter being in-
dispensable for efficiency. Explicit error estimates guar-
antee r-linear [4, 8, 13] or q-linear [15, 36] convergence
rates.
As after N steps the exact solution has been com-
puted, both in sequential and in parareal schemes, the
parallel efficiency is bounded by 1/Jh, the inverse of the
number of iterations needed. A fast convergence within
a small number of iterations independent of N usually
requires a rather accurate and hence expensive coarse
propagator, which in turn limits parallel efficiency.
A step towards higher parallel efficiency are hybrid
parareal methods computing the fine propagator in an
iterative way. This allows to interleave the parareal it-
eration with each fine propagator iteration and hence to
perform the global transport of corrections more often.
Examples for iterative fine propagator schemes used in
hybrid parareal methods are multigrid solvers for im-
plicit Runge-Kutta methods [25] or spectral deferred
correction methods [9,34] for solving higher order collo-
cation systems [7,10,26]. Convergence theory for hybrid
parareal schemes is less well covered, see, e.g., [6,24,25].
Apart from convergence speed, the repeated com-
munication necessary in parareal schemes for large scale
problems such as discretized partial differential equa-
tions affects wall clock time and therefore parallel effi-
ciency. Its impact has been reduced by adaptive work
scheduling [3,27] and pipelining computations, such that
communication occurs mostly parallel to computation
and both transmission time and latency are hidden to
some extent [7,10], even though this appears to be diffi-
cult to achieve in practice with current message passing
interface (MPI) implementations. Hiding communica-
2
tion behind computation is also difficult because what is
called computation might be a spatially parallelized do-
main decomposition solver, performing inter-node com-
munication and possibly saturating the communication
network on its own. The impact of communication is
moreover growing, not only because of increasing par-
allelism, but also because computing power tends to
grow faster than communication bandwidth [20].
Approaches for reducing communication include Nie-
vergelt type parallel-in-time integration [5] mainly for
linear equations, and the concept of guided simulations
in case several similar equations are to be solved [32],
e.g., for parameter studies. Especially for hyperbolic
equations, the so called swept rule [1, 2] can be used
to reduce not the amount of data transmitted, but the
number of required messages, which is of interest in
systems with high latency.
Besides methods for such specific problem settings,
compression of MPI messages has gained interest in the
last decade, aiming at methods which can be used re-
gardless of the actual application. Mostly, lossless com-
pression algorithms are considered in order not to spoil
the results of computations, such that only a small re-
duction in size can be achieved. Nevertheless, reduc-
tions of computation times are still achieved [11,19,28].
Let us also note that data compression may also re-
duce the energy spent on communication – an inter-
esting option for HPC systems approaching the power
wall [22, 29,31].
Here, we explore the reduction of communication in
hybrid parareal schemes by lossy data compression. On
the one hand, lossy compression, if applicable, achieves
higher compression factors than lossless compression,
e.g., for audio (MP3) and images (JPEG). On the other
hand, it affects the convergence of the parallel-in-time
algorithm. We estimate the influence of the quantiza-
tion error onto convergence, such that a suitable tol-
erance with only minor impact can be chosen. In the
course of estimating the effect of inexact communica-
tion, we also derive a rigorous, though not sharp, er-
ror bound for hybrid parareal methods. For estimating
compression efficiency and in the numerical tests we em-
ploy transform coding tailored towards finite element
coefficient vectors [16,35].
In Section 2 we specify the abstract problem set-
ting and state the general assumptions. For compari-
son purposes, we give a sequential iterative time step-
ping scheme as a reference in Section 3, using the same
stationary iteration as will be employed for the hybrid
parareal method in Section 4. Besides an a priori error
bound that takes inexact communication into account,
parallel efficiency is estimated. Section 5 briefly recalls
lossy compression before investigating its expected im-
pact on parallel efficiency in several situations. Finally,
actual computation results are presented in Section 6.
2 Problem setting
Let us begin with stating the abstract problem to be
solved and introducing the associated notation. We are
interested in numerically solving the initial value prob-
lem
u = f(u), u(0) = v0∗
for given starting value v0∗ on the global time inter-
val [0, T ], where we assume that the exact solution
u ∈ C([0, T ], V ) exists and assumes values in some
Hilbert space V . Without loss of generality, we restrict
the attention to autonomous problems.
For a time-parallel solution, the global interval is
subdivided into N equidistant local time intervals In =
[nh, (n+ 1)h], n = 0, . . . , N − 1, with length h = T/N .
On each local interval In, the solution subtrajectory
un = u|In is approximated by an element of a sub-
space Un ⊂ C(In, V ) equipped with the max-norm
‖un‖ := maxt∈In ‖un(t)‖. Furthermore, assume that at
least affine functions of time are contained in Un, i.e.
P1(In, V ) ⊂ Un, with Pk denoting the space of polyno-
mials of order up to k.
Abstracting from the details of time stepping, we
assume the availability of convergent fixed point iter-
ations in terms of Fn(u; v), Fn : Un × V → Un for
solving the initial value problems u = f(u), u(nh) = v
on In for initial value v ∈ V . We accept the fixed point
un∗ (v) := Fn(un∗ (v); v) as a function of the initial value v
as the “exact” solution on In, even if it is only a particu-lar approximation, e.g., a collocation solution. Concrete
examples of such iterative solvers are spectral deferred
correction methods converging towards the collocation
solution, where Un = Pk(In, V ), or multigrid V-cycles
converging towards an implicit Euler step, where V is
a finite element space and Un(In, V ) = P1(In, V ) is
affine in time. The global trajectory u∗ over the global
interval [0, T ] is then defined by chaining the local fixed
points, i.e. u∗|In = un∗ (vn∗ ) via the continuity condition
vn+1∗ = un∗ ((n+1)h), n = 0, . . . , N−1. We simply write
un∗ for un∗ (vn∗ ) if no ambiguity arises.
In other words, unj denotes the approximate subtra-
jectory on the n-th time interval at iteration j, defined
in terms of the initial value vnj and the previous iterate
unj−1. The subscript ∗ always indicates the limit point
for j →∞.
Assumption 1 Let the exact evolution have a Lipschitz
constant of L ≥ 1, i.e. ‖un∗ (v) − un∗ (v)‖ ≤ L‖v − v‖for arbitrary v, v ∈ V , the fixed point iterations Fn
3
have a uniform contraction factor 0 < ρ < 1, and
the right hand side be bounded such that ‖un∗ (v)((n+
1)h) − v‖ ≤ ch. Moreover, assume there is K < ∞such that ‖un∗ (v) − un∗ (v) − (v − v)‖ ≤ η‖v − v‖ with
η := hK exp(hK) holds for any v and v.
Remark 1 Already from the choice of the max-norm on
Un, the bound L ≥ 1 is implied. This excludes the
proper treatment of purely dissipative problems, which,
however, are anyways not particularly interesting in the
autonomous setting.
Note that usually both L and ρ depend on h, which
is important to keep in mind when looking at strong
scalability with T = const and growing N . The last as-
sumption is natural as it is a direct result for any rea-
sonable discretization defining un∗ if the right hand side
f is Lipschitz continuous due to the following lemma.
Lemma 1 Assume that u(v) ∈ C([0, T ], V ) satisfies
u(v)(t) = f(u(v)(t)) on [0, h]
for any initial value v and that the right hand side f is
Lipschitz continuous with constant K. Then,
‖u(v)−u(v)− (v− v)‖ ≤ η‖v− v‖, η := hK exp(hK)
Proof Let ξ(t) := (u(v) − u(v))(t) − (v − v). Then, ξ
satisfies
‖ξ(t)‖ =
∥∥∥∥∫ t
τ=0
(f(u(v)(τ))− f(u(v)(τ))) dτ
∥∥∥∥≤∫ t
τ=0
K‖u(v)(τ)− u(v)(τ)‖ dτ
≤∫ t
τ=0
K(‖ξ(τ)‖+ ‖v − v‖) dτ
= tK‖v − v‖+
∫ t
τ=0
K‖ξ‖ dτ.
Gronwall’s inequality now yields
‖ξ(t)‖ ≤ tK‖v − v‖ exp(tK) ≤ hK exp(hK)‖v − v‖
and thus the claim. ut
3 Sequential reference
Judging the performance and efficiency bounds of hy-
brid parareal schemes requires the comparison with a
suitable reference. This will be provided by the obvi-
ous method to compute a complete trajectory within
the same problem setting. That is the sequential ap-
proach of stepping through the intervals one after the
other, performing a constant number J of fixed point
iterations on each interval:
Algorithm 1 Sequential iteration
(i) given starting value
v0j = v0∗ for j = 0, . . . , J
(ii) constant initialization
un0 (t) = vn0 for n = 0, . . . , N − 1, t ∈ In
(iii) approximate integration
unj = Fn(unj−1; vnj )
for n = 0, . . . , N − 1, j = 1, . . . , J
(iv) continuity
vnj = un−1J (nh)
for n = 1, . . . , N − 1, j = 0, . . . , J
Let us introduce the function ν : [1,∞[→ R by
νn(x) :=
n−1∑i=0
xi (1)
and gather some of its elemental properties for later
use.
Lemma 2 νn(x) is monotonically increasing in x and
bounded by
xn ≤ νn+1(x) ≤ (n+ 1)xn.
Moreover, 1 + xνn(x) = νn+1(x) holds.
With that, we derive an error estimate for the se-
quential Algorithm 1 defined above.
Theorem 1 Let Assumption 1 hold. Then the total er-
ror enj of the sequential iterative Algorithm 1 with Jiterations in each interval In is bounded by
‖unJ − un∗‖ ≤ enJ := chρjνi+1(L). (2)
Proof By the triangle inequality and Assumption 1 we
obtain Lady Windermere’s fan as
‖unJ − un∗‖ ≤ ‖unJ − un∗ (vnJ )‖+ ‖un∗ (vnJ )− un∗‖≤ ρJ‖un0 − un∗ (vnj )‖+ L‖vnJ − vn∗ ‖≤ ρJch+ L‖un−1j − un−1∗ ‖≤ chρJ + Len−1J
= chρJ(1 + Lνn(L)).
Applying Lemma 2 and induction over n yields the
claim. ut
Assume we ask for an accuracy of TOL relative to
the condition of the problem, i.e.
‖unJ−un∗‖ ≤ TOL chνi+1(L) for n = 0, . . . , N − 1. (3)
4
simulated time
wall
cloc
k tim
e
simulated time
wall
cloc
k tim
e
Fig. 1: Schematic representation of time stepping with iterative schemes. The number Js = 2 of iterations is deliberately chosen for
illustration purposes only. Left: Sequential method. Right: Hybrid parareal scheme.
Then the number of iterations required by Algorithm 1
to reach this accuracy is at most
Js ≤⌈
log TOL
log ρ
⌉. (4)
Let tF denote the computing time for a single evalu-
ation of Fn, e.g., for one SDC sweep or a single V-
cycle, assumed to be the same for all n. The total
time for computing the final value uN−1Jsis bounded
by ts ≤ NJstF , see Fig. 1, left.
4 An inexact hybrid parareal algorithm
The hybrid parareal scheme interleaves the time step-
ping with the stationary iteration on each local inter-
val In, propagating the result of each iteration on to
the next interval. As passing the information on by
only one interval per iteration leads to slow conver-
gence if N is large, sequential but fast coarse propaga-
tors Gn : V → V are used to distribute the correction
quickly through the remaining time intervals, see Fig. 1,
right.
The analysis follows the line of [13], but includes
the hybrid parareal fixed point iteration and considers
the communication of corrections of the initial value
from one time interval to the next to be subject to
errors due to lossy compression. The modification of the
initial value is represented by communication operators
Cnj : V → V .
Algorithm 2 Hybrid parallel iteration
(i) given starting value
v0j = v0∗ for j = 0, . . . , J
(ii) coarse propagation
vn0 = Gn−1(vn−10 ) for n = 1, . . . , N
(iii) piecewise linear initialization
un0 (t) = vn0 + (t/h− n)(vn+10 − vn0 )
for n = 0, . . . , N − 1
(iv) approximate integration
unj = Fn(unj−1 + vnj − vnj−1; vnj )
for n = 0, . . . , N − 1, j = 1, . . . , J
(v) continuity
vnj = vnj−1 + Cnj
(un−1j−1 (nh)− vnj−1
+Gn−1(vn−1j )−Gn−1(vn−1j−1 ))
for n = 1, . . . , N − 1, j = 1, . . . , J
In contrast to [24] we do not assume the fixed point
iterations on each interval to converge unaffected by
changing initial values, but treat this perturbation ex-
plicitly and consider inexact communication as well.
Theorem 2 In addition to Assumption 1, let the coarse
propagators Gn satisfy
‖un0 − un∗ (vn0 )‖ ≤ chγ0, (5)
‖(un∗ (v)− un∗ (v))((n+ 1)h)− (Gn(v)−Gn(v))‖≤ γ‖v − v‖, (6)
and
‖Gn(v)−Gn(v)‖ ≤ L‖v − v‖ (7)
5
for arbitrary v, v ∈ V and some γ0, γ < ∞. Let the
communication error be bounded by
‖v − Cnj (v)‖ ≤ ∆C‖v‖
for some ∆C <∞. Then the error estimate
‖vnj − vn∗ ‖ ≤ δnj := chγ0αρj−1νn+1(βj) (8)
holds with α = 1+∆C
1−∆C/ρand
βj = α
(γ
ρ+ L+ η(1 + ρ−1)j
).
Before proving the theorem, let us stress that in gen-
eral γ in (6) is of order one unless the coarse propagator
has a consistency order of at least 0, which is sufficient
for γ = O(h). This is, however, not satisfied by the com-
mon realization of coarse propagators acting on coarser
grids in PDE problems. But it is necessary to guarantee
fast convergence for large N , since otherwise correction
components that cannot be represented on the coarse
grid and hence not be propagated by G are handed on
by just one interval per iteration. The related impact
of insufficient coarse propagator accuracy in time has
been investigated in [30] for advection-diffusion equa-
tions. The practical success of coarse grid propagators
is due to the usually small amplitude of such error com-
ponents and a quick reduction of high frequent error
components in dissipative systems.
Proof In addition to (8) we will prove
‖unj−un∗ (vnj )‖ ≤ εnj := chγ0ρj(1+αη(1+ρ−1)jνn+1(βj)).
Starting the induction at n = 0 yields
‖v0j − v0∗‖ = 0 ≤ δ0j and ‖u0j − u0∗‖ ≤ chγ0 ρj ≤ ε0jdue to linear fixed point contraction with contraction
factor 0 < ρ < 1. On the other hand, j = 0 provides
‖un0 − un∗ (vn0 )‖ ≤ chγ0 ≤ εn0 by (5) and
‖vn+10 − vn+1
∗ ‖ ≤ ‖un0 − un∗‖≤ ‖un0 − un∗ (vn0 )‖+ ‖un∗ (vn0 )− un∗‖
≤ chγ0 + L‖vn0 − vn∗ ‖ ≤ chγ0νn+1(L) ≤ δn+10 .
For general n, j > 0 we let vn+1j+1 := unj ((n + 1)h) +
Gn(vnj+1)−Gn(vnj ) and obtain
‖vn+1j+1 − v
n+1∗ ‖ ≤ ‖vn+1
j + Cn+1j+1 (vn+1
j+1 − vn+1j )− vn+1
j+1 ‖+ ‖vn+1
j+1 − vn+1∗ ‖
≤ ∆C‖vn+1j+1 − v
n+1j ‖+ ‖vn+1
j+1 − vn+1∗ ‖
≤ ∆C
(‖vn+1j+1 − v
n+1∗ ‖+ ‖vn+1
∗ − vn+1j ‖
)+ ‖vn+1
j+1 − vn+1∗ ‖
≤ ∆C‖vn+1∗ − vn+1
j ‖+ (1 +∆C)‖vn+1
j+1 − vn+1∗ ‖. (9)
With the definitions of vn+1j+1 and εnj , the last term in (9)
can be bounded by
‖vn+1j+1 − v
n+1∗ ‖
= ‖unj ((n+ 1)h) +Gn(vnj+1)
−Gn(vnj )− un∗ (vn∗ )((n+ 1)h)‖≤ ‖unj − un∗ (vnj )‖
+ ‖(un∗ (vnj )− un∗ (vn∗ ))((n+ 1)h)
− (Gn(vnj )−Gn(vn∗ ))‖+ ‖Gn(vnj+1)−Gn(vn∗ )‖≤ εnj + γ‖vnj − vn∗ ‖+ L‖vnj+1 − vn∗ ‖.
Using the monotonicity of νn+1 to bound
δnj =νn+1(βj)
ρνn+1(βj+1)δnj+1 ≤
1
ρδnj+1, (10)
which is only asymptotically sharp, we can further es-
timate
εnj + γ‖vnj − vn∗ ‖+ L‖vnj+1 − vn∗ ‖≤ εnj + γδnj + Lδnj+1
≤ εnj +
(γ
ρ+ L
)δnj+1
≤ chγ0ρj(
1 + αη(1 + ρ−1)jνn+1(βj)
+
(γ
ρ+ L
)ανn+1(βj+1)
)≤ chγ0ρj
(1 + α
(η(1 + ρ−1)j +
γ
ρ+ L
)νn+1(βj+1)
)≤ chγ0ρj(1 + βj+1νn+1(βj+1))
= chγ0ρjνn+2(βj+1)
≤ 1
αδn+1j+1 .
Inserting this into (9) yields
‖vn+1j+1 − v
n+1∗ ‖ ≤ ∆Cδ
n+1j + (1 +∆C)‖vn+1
j+1 − vn+1∗ ‖
≤ ∆C
ρδn+1j+1 + (1 +∆C)
1−∆C/ρ
1 +∆Cδn+1j+1
≤ δn+1j+1 .
Moreover, due to the approximate integration by Fn in
step (iv) of Algorithm 2, we have
‖unj − u(vnj )‖≤ ρ‖unj−1 + vnj − vnj−1 − un∗ (vnj )‖≤ ρ(‖unj−1 − un∗ (vnj−1)‖
+ ‖un∗ (vnj−1)− un∗ (vnj )− (vnj−1 − vnj )‖).
6
Using η as given in Assumption 1 we get
ρ(‖unj−1 − un∗ (vnj−1)‖+ ‖un∗ (vnj−1)− un∗ (vnj )− (vnj−1 − vnj )‖
)≤ ρ(εnj−1 + η‖vnj − vnj−1‖)≤ ρ
(εnj−1 + η(‖vnj − vn∗ ‖+ ‖vnj−1 − vn∗ ‖)
)≤ ρ(εnj−1 + η(δnj + δnj−1))
≤ ρ(εnj−1 + η(1 + ρ−1)δnj )
≤ chγ0ρj(1 + αη(1 + ρ−1)(j − 1)νn+1(βj−1)
+ αη(1 + ρ−1)νn+1(βj))
≤ chγ0ρj(1 + αη(1 + ρ−1)jνn+1(βj)) = εnj ,
which completes the induction. ut
The asymptotic convergence rate is ρ, as in the se-
quential SDC algorithm, independent of communica-
tion accuracy and fast propagator.
Let us investigate the error bound in a straightfor-
ward but nontrivial example situation.
Example 1 We consider the harmonic oscillator
x = y, y = −x
with initial value x = 0, y = 1 and time horizon T = 3π,
subdivided into N = 16 intervals.
As a stationary fine propagator iteration we use
SDC on a two-point Gauss collocation grid with a ex-
plicit Euler base method, converging towards a fourth
order energy preserving collocation scheme. Consequent-
ly, L = 1 and η = 0.58 hold. As fast propagator we con-
sider (i) the zero propagator Gn(v) = 0 yielding just a
parallel SDC method [18] and (ii) the explicit Euler
propagator Gn(v) = v + hf(v).
In that setting, the iterates and hence right hand
sides are bounded in Assumption 1 with (i) c = 1186
and (ii) c = 18, respectively, and the fast propagation
errors can be bounded by (i) γ0 = γ = 0.58 and (ii)
γ0 = γ = 0.17. The observed SDC contraction rate is
ρ = 0.20.
Resulting iteration errors and error bounds are de-
picted in Fig. 2. While the global error behavior is re-
produced rather well by the bound (8), the bound is
far from being sharp, in particular for larger n. On one
hand this is due to the rather general assumptions on
the stationary iteration, with the observed average case
being much better than the bounded worst case. On the
other hand, the estimates in the proof are not sharp,
in particular (10). For that reason we provide a fitted
error estimate as well, where the constants in (8) are
obtained by a least squares fit to the actual errors. Of
course, this reproduces the actual errors much better.
The impact of communication errors is simulated by
adding normally distributed noise of relative standard
deviation ∆C = 0.1, which amounts to less than four
bits per coefficient. Comparing the results in the last
row of Fig. 2 with the second row corresponding to
∆C = 0 suggests that the impact of even such a large
relative communication error on convergence is minor.
Parallel efficiency. Next we estimate the parallel effi-
ciency of the hybrid parareal method based on the error
bound (8). As this is overestimating the error, the effi-
ciency estimates will be rather pessimistic.
For a relative accuracy of TOL as specified in (3),
the number Jh of required iterations is bounded by
Jh ≤ 1+log(
TOL νN+1(L)αγ0νN+1(βJh
)
)log ρ
≤ 1+
log
(TOLLN
αγ0(N+1)βNJh
)log ρ
.
Assuming L > 1 and N rather large, this can be ap-
proximated by
Jh ≈ 1 +log TOL
log ρ− N log(βJh/L) + log(αγ0(N + 1))
log ρ
≈ 1 + Js
−N log
(α(γLρ + 1 + η
L (1 + ρ−1)Jh
))log ρ
− log(αγ0(N + 1))
log ρ.
With α = (1+∆C)/(1−∆C/ρ) and ∆C 1, we obtain
logα ≈ (1 + ρ−1)∆C and thus
Jh ≈ 1 + Js −N log
(α(γLρ + 1 + η
L (1 + ρ−1)Jh
))log ρ
− (1 + ρ−1)∆C + log(γ0(N + 1))
log ρ
≤ 1 + Js −N(γLρ + η
L (1 + ρ−1)Jh
)log ρ
− (N + 1)(1 + ρ−1)∆C + log(γ0(N + 1))
log ρ.
Thus, Jh grows only logarithmically with N , if both
∆C = O(N−1) and
γ
Lρ+η
L(1 + ρ−1)Jh = O(N−1)
hold. The former is rather easy to satisfy with a small,
i.e. logarithmic, increase in communication time. The
latter condition, however, requires in particular γ and
η to be of order N−1. As γ denotes the consistency error
of the coarse propagator, e.g., an Euler step with O(h2),
and η is of orderO(h), this meansNh = O(1). Actually,
7
log10 ‖vnj − vn∗ ‖
jn
log10 ‖vnj − vn∗ ‖
jn
log10 ‖vnj − vn∗ ‖
jn
log10 δnj
jn
log10 δnj
jn
log10 δnj
jn
log10 δnj
jn
log10 δnj
jn
log10 δnj
jn
Fig. 2: Errors and bounds for Example 1 in logarithmic scale. Left: errors ‖vnj − vn∗ ‖. Mid: error bound (8). Right: error bound withconstants estimated from actual errors. Top: zero fast propagator. Mid: Euler fast propagator. Bottom: Euler fast propagator with
communication error ∆C = 0.1.
this restriction is to be expected, since the theory covers
ill-conditioned and chaotic systems, where an extension
of the integration interval cannot be expected to keep
the iteration number constant.
As is apparent from Fig. 1, the total computation
time is th = Jh(tF +tG)+(N−1)(tG+tC), which yields
a parallel efficiency of
Eh =tsNth
=JstF
Jh(tF + tG) + (N − 1)(tG + tC). (11)
Obviously, a large efficiency close to one requires Jh ≈Js, which in turn requires TOL very small, or both
∆C and h small. Additionally, efficiency requires tG tF , and N(tG + tC) JhtF . The former condition is
rather easy to satisfy, in fact, using the same fixed point
iteration for both fine and coarse propagators would
push the efficiency only down to 0.5 due to the Jh(tF +
tG) term in the denominator. The second condition is
much harder to satisfy, in particular for highly parallel
computations with large N .
The relation of ∆c and tC and their impact on par-
allel efficiency Eh is discussed in the next section.
5 Lossy data compression
First we recall lossy compression methods for coefficient
vectors y ∈ Rm of spatial discretizations, in particular
of finite element approaches. The results are applicable
as well to similar techniques like finite difference and
finite volume schemes.
The simplest compression (rounding of coefficients
to the required accuracy) provides a compression ratio
S = − log2∆C
64(12)
8
relative to uncompressed double precision binary repre-
sentation. Note that transform coding with hierarchical
basis or wavelet transform and entropy coding yields
even better results [35]. For the sake of simplicity, how-
ever, we will stick with (12).
For very small ∆C , the message size is usually boun-
ded from above by the size of the raw binary repre-
sentation of double accuracy floating point values, i.e.
S ≤ 1. Given a time tT required to transmit the un-
compressed data set and a latency tL independent of
∆C that includes the time for encoding and decoding,
if performed, as well as usual communication latencies,
we arrive at a communication time tC = tL + StT . In-
serting this into (11) yields the parallel efficiency model
Eh =JstF
Jh(tF + tG) + (N − 1)(tG + tL − tT log2∆C/64),
(13)
which allows to investigate the impact of data compres-
sion. This impact and the optimal value of ∆C depends
on the values of N, Js, tF , Jh, tG, tL, and tT characteriz-
ing the problem to be solved, the coarse and fine prop-
agators, and the computing system used.
In the following we will explore several scenarios in
order to get an idea in which settings data compres-
sion may have a significant impact on parallel efficiency.
Starting from a nominal setting with SDC of order six
as stationary iteration, and first order Euler as fast
propagator, other scenarios modify the parameters in
a particular direction as outlined below.
Nominal scenario. This is a basic setting characterized
by moderate parallelism, iteration counts that are
typically observed in SDC methods for engineer-
ing tolerances, a coarse propagator time that could
come from a single Euler step compared to a full
SDC sweep, and communication times that are plau-
sible for current standard hardware. Times are given
relative to the fine propagator time tF . Values for
the nominal scenario are indicated by ·. N = 16,
L = 1.5, TOL = 10−6, ρ = 0.2, γ = 0.05, γ0 = 0.1,
η = 0.05, tF = 1, tG = 0.1, tT = 0.1. For the la-
tency tL = tL,e/d+tL,c, we assume a contribution of
tL,e/d = 0.005 from encoding/decoding if performed
and the same amount tL,c = 0.005 from the com-
munication system. With these parameters, (4), (8)
and (11) yield estimates of Js = 9, Jh = 28, and
Eh = 0.28, respectively.
Different tolerances. Aiming at a different tolerance
TOL = s ˆTOL for s ∈ [10−4, 102] usually goes along
with choosing a different order of collocation dis-
cretization and, at the same time, a different time
step size. Smaller time steps tend to decrease ρ,
whereas higher order SDC tends to increase ρ. For
simplicity, we assume the contraction ρ = ρ to be
unaffected by s. In contrast, for a nominal order six
we assume the time step h to scale like s1/6 and
hence obtain L = 1 + (L − 1)s1/6 and η = s1/6η.
Moreover, the Euler consistency error depends on
the time step as well, such that we assume γ0 =
s1/3γ0 and γ = s1/3γ.
Scalability. The natural mode for parallel-in-time inte-
gration is to have only time interval per processor,
and hence scaling means in general weak scaling
with growing time horizon T = sT and N = sN .
Thousands of intervals have been used for time-
parallel computations [14], such that we consider
s ∈ [0.25, 512]. Except for N , no parameter entering
the parallel efficiency estimate is assumed to change.
Commodity network hardware. Here we assume slower
communication links, due to cheaper network in-
frastructure, which might be encountered in work-
station clusters, cloud computing, or even DSL or
WiFi connections. Slower communication affects in
particular the bandwidth, i.e. tC = stC , but to some
extent also latency tL =√stL,c + tL,e/d, with s ∈
[10−2, 10].
In each scenario, and for each value of the scaling
parameter s, an optimal compression accuracy ∆C has
been computed numerically, maximizing the parallel ef-
ficiency based on (8). Figs. 3 and 4 show the paral-
lel efficiency of both the uncompressed (∆C = 0 and
S = 1) method and the lossy compression (∆C = opt,
S = − log2∆C/64).
In every setup, the optimal quantization error is
mostly in the range of 10−3 to 10−2, and improves par-
allel efficiency compared to the uncompressed version.
The effect, however, is rather small, predicted by theory
to be less than 5%. For decreasing tolerance, the parallel
efficiency grows as expected, see Fig. 3 left. However,
the benefit of compression, that could be assumed to
grow with larger tolerances, decreases. This is due to an
actually increasing number of iterations in the parallel
scheme due to larger step sizes affecting the effectivity
of the fast propagator, see Fig. 3 right. The number of
iterations in the sequential scheme, on the other hand,
decreases as expected, and is virtually unaffected by the
compression.
The theoretically predicted scalability is rather un-
satisfactory, see Fig. 4 left, due to the exponential in-
crease of ν in N and the corresponding growth of Jhfor growing number of time intervals. Correspondingly,
the benefit of compression that is assumed to improve
only the single sequential startup-phase becomes rela-
tively negligible, even though in absolute numbers it
increases linearly with N .
9
1e-11 1e-10 1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-30.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1e-11 1e-10 1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-35
10
15
20
25
30
35
TOL TOL
Eh J
compressed
uncompressed
parallel
sequential
Fig. 3: Impact of lossy compression for varying tolerance as described in the Different tolerances scenario. The nominal setting ismarked in both plots. Left: Parallel efficiency Eh as estimated in (11) and (13) is improved slightly by compression. Right: Iteration
numbers Jh and Js grow with the tolerance for the parallel scheme due to larger step sizes decreasing the accuracy of the fast
propagator. In contrast, faster SDC contraction leads to fewer iterations in the sequential scheme.
1e+0 1e+1 1e+2 1e+30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1e-2 1e-1 1e+0 1e+10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
N tT
Eh Eh
compressed
uncompressed
compressed
uncompressed
Fig. 4: Impact of lossy compression on parallel efficiency Eh. Left: Varying number N of time intervals. Right: Varying communicationtime tT . The nominal setting is marked in both plots.
As expected, a varying bandwidth has a significantly
larger effect on the compression benefit, see Fig. 4 right.
For transmission times tT that are at least three times
as high as assumed in the nominal scenario, the parallel
efficiency of the uncompressed scheme decreases rather
quickly, while compression allows to maintain the effi-
ciency for much longer communication times.
6 Computational Examples
Here we consider the inhomogeneous heat equation
ut −∆u = f in Ω × (0, T )
∂νu = 0 on ∂Ω × (0, T )
u(·, 0) = 0 in Ω,
(14)
with Ω = (0, 1)2 and source term
f(x, t) = (8π2 + exp(−t)− 8π2 exp(−t)) cos(2πx1) cos(2πx2).
The analytical solution to this equation is given by
u(x, t) = (1− exp(−t)) cos(2πx1) cos(2πx2).
For the numerical solution we discretize in space first,
using linear finite elements on a uniformly refined trian-
gular mesh. The arising system of ODEs is solved using
the Parallel Full Approximation Scheme in Space and
Time (PFASST) [10] as an instance of hybrid parareal
methods. For implementation we use PFASST++ [21]
in combination with the finite element toolbox Kas-
kade 7 [17].
10
6.1 Two-rank setup
First we consider a minimal setup with two time inter-
vals distributed on two workstations connected by Eth-
ernet, in order to measure communication times with
as few disturbances as possible. We use a time interval
size of 0.125, so T = 0.25. For the coarse propagator, we
use SDC with 3 Gauss-Lobatto collocation nodes; the
fine propagator is SDC on 5 Gauss-Lobatto nodes. In
each case, there are 263 169 spatial degrees of freedom.
Overall we perform 10 iterations. This setting allows
to investigate the effect of compression on the runtimes
of the method as well as on the accuracy, without be-
ing influenced too much by factors like shared access to
network and compute resources.
In Table 1 we report the wall-clock times (averaged
over 5 runs) for uncompressed and compressed commu-
nication with a prescribed relative tolerance of 10−8,
leading to an overall compression factor of 3.7.
Note that in the PFASST algorithm, as well as in
the PFASST++ library used here, communication is
performed in the fine and coarse propagator separately,
where the fine propagator can use interleaving of com-
putation and communication, leading to very small com-
munication times. The coarse propagator in contrast
has to wait for the send/receive to finish before com-
putations can continue. As it uses the same spatial dis-
cretization, communication times are more significant
there. With the time required for compression being
larger than the send/receive times of the fine propaga-
tor, this suggests to use compression only for the coarse
propagators communication.
The achieved accuracy for this setting, using a quan-
tization tolerance of 10−8, is shown in Table 2. Us-
ing compression only for the coarse propagator leads
to similar accuracy as in the uncompressed case, us-
ing compression in fine and coarse propagators leads to
a slightly larger residual. The overall communication
times were reduced to 83% and 81%, respectively.
Despite the simplified setup, computation times vary
between different runs, making them hard to compare.
Besides the shared access to resources, there are other
factors influencing the results. E.g., switching off com-
pression for the fine propagator leads to longer wait
times for the coarse propagators communication, thus
increasing the communication time there.
6.2 More than two ranks
Here we consider T = 1 and use N = 16 time inter-
vals and processors, distributed across different work-
stations equipped with Intel Xeon E3-1245 v5 CPUs
clocked at 3.5GHz and connected by gigabit Ethernet.
On each macro time interval of size 0.0625 we again use
SDC with 3 and 5 Gauss-Lobatto quadrature nodes as
coarse and fine propagator. For each, 66 049 spatial de-
grees of freedom were used.
We stop the iterations on time interval n when the
maximum norm of the correction on this interval is be-
low 10−6, and the previous time step n − 1 is already
converged. This leads to 12 iterations on the last time
step. Using compression with relative quantization tol-
erance of 10−8 (leading to a rather small compression
factor of about 2) takes the same number of iterations,
but reduces the overall communication time (for all pro-
cessors) from 160.1s to 73.9s (time per sweep: 0.46s and
0.21s respectively), a reduction by 53.9%. In this test,
we did not use compression for sending the very first
update to the initial value, as the quantization error
there had larger impact on convergence and commu-
nication times than the following incremental updates.
The overall runtime of the PFASST algorithm is re-
duced by 10%, from 103.9s to 93.2s. In both cases we
get a relative L2([0, T ] × Ω)-error of 2.1 · 10−4 in the
final iterate compared to the analytical solution. Note
that the stopping criterion in this experiment differs
from the standard stopping criterion of the PFASST
method, where usually the norm of the residual of the
collocation formulation
r(t) := u(0)−(u(t)−
∫ t
0
f(u(τ)) dτ)
is used. This is not feasible here: Using lossy compres-
sion for sending updates to initial values typically leads
to spatially oscillating compression errors in the ap-
proximate solution, which are amplified by the Laplace
operator in the original equation. In the two-rank test
case above, the computation was stopped after a fixed
number of sweeps and not iterated until convergence;
as the corrections and residuals are still large enough,
this error amplification is not an issue there.
Increasing the compression factor did not yield any
further improvements for the run times. One reason for
this might be that communication and computation is
not as well interleaved as before, leading to longer wait
times in the communication. Note that not only com-
putation times but also communication times strongly
depend on algorithmic parameters like the number of
quadrature nodes for the coarse and fine propagators.
The interplay of compression, algorithmic options and
other ingredients of PFASST, like the FAS-correction,
requires further investigation to facilitate an optimal
choice of these parameters.
Moreover, as the tests are run on a standard TCP/IP
over Ethernet connection not optimized for high perfor-
mance computing, there are various sources influencing
11
rank task uncompressed compress fine and coarse compress only coarse
0 sweeps fine 104.7 104.6 105.4sweeps coarse 52.3 52.2 52.7
comm fine 0.002 1.3 0.002comm coarse 11.1 8.9 9.1
1 sweeps fine 105.9 106.3 106.1
sweeps coarse 53.0 53.1 52.9comm fine 0.002 0.5 0.002
comm coarse 3.6 1.5 2.8
Table 1: Wall clock times in seconds for the two-rank setup for the individual steps.
rank uncompressed compress fine and coarse compress only coarse
0 9.1 · 10−6 9.1 · 10−6 9.1 · 10−6
1 1.7 · 10−4 2.5 · 10−5 1.8 · 10−5
Table 2: Final relative residuals.
communication times besides shared access to the net-
work, for example configuration of switches or firewalls,
varying Ethernet adapters, or other, unknown ones. In
our test runs, communication times varied from work-
station to workstation, despite having the same proces-
sors, memory, and software. While the available net-
work bandwidth can be measured to some extent, in
the actual computations it was drastically lower than
predicted. In settings like this, or even more severely in
cloud computing, robustness with respect to the band-
width is important; in this respect using lossy compres-
sion for communication can be quite beneficial.
Considering scaling experiments on a state of the
art compute cluster, we fixed a final time T = 2 and
macro time step 0.0625 and used PFASST (as described
above) to solve the equation on 1, 8, 16 and 32 pro-
cessors on the HLRN supercomputer (Cray XC30/40
with Aries interconnect of the North-German Super-
computing Alliance, www.hlrn.de). The sequential ver-
sion required a total of 348 iterations, an average of
10.875 per time interval. In the parallel case, on the
last rank an overall 50, 25, 12 iterations were required
for 8, 16, 32 processors (each computing 4, 2, 1 macro
time intervals), respectively. Using compressed commu-
nication with a quantization tolerance of 10−8 did not
influence the number of iterations; also the overall com-
putation time was only influenced insignificantly. This
indicates that in this experiment communication band-
width and latency are not an issue. In realistic, large-
scale applications, using domain decomposition in space
as well as time-parallel integration, we expect this to be
different, as memory bandwidth would already be sat-
urated due to communication required by the spatial
domain decomposition.
Conclusions
A complete and rather general convergence theory for
hybrid parareal schemes with inexact communication
has been presented. While the error estimates are not
in the least sharp, the qualitative error behavior is cap-
tured quite well and allows to estimate the impact of
inexact communication due to lossy compression on the
iteration count and hence on the parallel efficiency. The
theoretical results, supported by computational exper-
iments, indicate that lossy compression improves the
efficiency, but the amount of improvement depends on
the setup. Compression is particularly effective if the
available communication bandwidth is small. This in-
cludes commodity systems, clusters and cloud comput-
ing, where bandwidth is often low or varying, but also
high performance systems where the communication
network is already saturated due to concurrent com-
munication going on.
Acknowledgement. Partial funding by BMBF Project
SOAK and DFG Project WE 2937/6-1 is gratefully
acknowledged. The authors would also like to thank
Th. Steinke, F. Wende, A. Kammeyer and the North-
German Supercomputing Alliance (HLRN) for support-
ing the numerical experiments.
References
1. M. Alhubail, Q. Wang, and J. Williams. The swept rule
for breaking the latency barrier in time advancing two-dimensional PDEs. Preprint arXiv:1602.07558, 2016.
2. Maitham Alhubail and Qiqi Wang. The swept rule for break-
ing the latency barrier in time advancing PDEs. J. Comput.
Phys., 307:110–121, 2016.3. E. Aubanel. Scheduling of tasks in the parareal algorithm.
Parallel Computing, 37:172–182, 2011.
12
4. G. Bal. On the convergence and the stability of the pararealalgorithm to solve partial differential equations. In R. Ko-
rnhuber, R. Hoppe, J. Periaux, O. Pironneau, O. Widlund,
and J. Xu, editors, Proceedings of DD15, volume 40 of Lec-ture Notes in Computational Science and Engineering, pages
425–432. Springer, 2004.5. A.T. Barker. A minimal communication approach to parallel
time integration. Int. J. Comp. Math., 91(3):601–615, 2014.6. M. Bolten, D. Moser, and R. Speck. Asymptotic convergence
of the parallel full approximation scheme in space and time
for linear problems. Preprint arXiv:1703.07120, 2017.7. A.J. Christlieb, C.B. Macdonald, and B.W. Ong. Parallel
high-order integrators. SIAM J. Sci. Comput., 32(2):818–
835, 2010.8. M. Duarte, M. Massot, and S. Descombes. Parareal operator
splitting techniques for multi-scale reaction waves: Numerical
analysis and strategies. M2AN, 45(5):825–852, 2011.9. A. Dutt, L. Greengard, and V. Rokhlin. Spectral deferred
correction methods for ordinary differential equations. BIT,
40(2):241–266, 2000.10. M. Emmett and M.L. Minion. Toward an efficient parallel in
time method for partial differential equations. Comm. Appl.
Math. Comp. Sci., 7(1):105–132, 2012.11. R. Filgueira, D.E. Singh, J. Carretero, A. Calderon, and
F. Garcıa. Adaptive-compi: Enhancing MPI-based applica-tions’ performance and scalability by using adaptive com-
pression. The International Journal of High Performance
Computing Applications, 25(1):93–114, 2011.12. M. Gander. 50 years of time parallel time integration. In
T. Carraro, M. Geiger, S. Korkel, and R. Rannacher, ed-itors, Multiple Shooting and Time Domain Decomposition
Methods, volume 9 of Contributions in Mathematical and
Computational Sciences, pages 69–113. Springer, 2015.13. M.J. Gander and E. Hairer. Nonlinear convergence analy-
sis for the parareal algorithm. In U. Langer, M. Discacciati,
D.E. Keyes, O.B. Widlund, and W. Zulehner, editors, Do-
main Decomposition Methods in Science and EngineeringXVII, pages 45–56. Springer Berlin Heidelberg, 2008.
14. M.J. Gander and M. Neumuller. Analysis of a new space-time
parallel multigrid algorithm for parabolic problems. SIAM
J. Sci. Comput., 38(4):A2173–A2208, 2016.15. M.J. Gander and S. Vandewalle. Analysis of the parareal
time-parallel time-integration method. SIAM J. Sci. Com-put., 29(2):556–578, 2007.
16. S. Gotschel and Weiser. M. Lossy compression for PDE-
constrained optimization: Adaptive error control. Comput.
Optim. Appl., 62:131–155, 2015.17. S. Gotschel, M. Weiser, and A. Schiela. Solving optimal con-
trol problems with the Kaskade 7 finite element toolbox. InA. Dedner, B. Flemisch, and R. Klofkorn, editors, Advances
in DUNE, pages 101–112. Springer, 2012.18. D. Guibert and D. Tromeur-Dervout. Parallel deferred cor-
rection method for CFD problems. In J.-H. Kwon, J. Periaux,P. Fox, N. Satofuka, and A. Ecer, editors, Parallel Compu-
tational Fluid Dynamics 2006: Parallel Computing and its
Applications, pages 131–136, 2007.19. J. Ke, M. Burtscher, and E. Speight. Runtime compression of
MPI messages to improve the performance and scalability of
parallel applications. In Supercomputing, 2004. Proceedings
of the ACM/IEEE SC2004 Conference, page 59, 2004.20. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and
D. Glasco. Gpus and the future of parallel computing. IEEEMicro, 31(5):7–17, 2011.
21. T. Klatt, M. Emmett, D. Ruprecht, R. Speck, andS. Terzi. PFASST++. http://www.parallelintime.org/
codes/pfasst.html, 2015. Retrieved: 16 50, May 04, 2017(GMT).
22. S. Leyffer, S.M. Wild, M. Fagan, M. Snir, K. Palem,
K. Yoshii, and H. Finkel. Doing Moore with less - Leapfrog-ging Moore’s law with inexactness for supercomputing.
CoRR, abs/1610.02606, 2016.23. J.-L. Lions, Y. Maday, and G. Turinici. A parareal in time
discretization of pdes. C.R. Acad. Sci. Paris, Serie I,,
332:661–668, 2001.24. J. Liu, Y. Wang, and R. Li. A hybrid algorithm based on
optimal quadratic spline collocation and parareal deferred
correction for parabolic PDEs. Math. Probl. Eng., 2016. Ar-ticle ID 6943079.
25. E. McDonald and A.J. Wathen. A simple proposal for paral-
lel computation over time of an evolutionary process with im-plicit time stepping. In Proceedings of the ENUMATH 2015,
Lecture Notes in Computational Science and Engineering.
Springer, to appear.26. M.L. Minion. A hybrid parareal spectral deferred correc-
tions method. Comm. Appl. Math. Comp. Sci., 5(2):265–301, 2010.
27. A.S. Nielsen, G. Brunner, and J.S. Hesthaven.
Communication-aware adaptive parareal with applica-tion to a nonlinear hyperbolic system of partial differential
equations. EPFL-ARTICLE 228189, EPFL, 2017.
28. A. Nunez, R. Filgueira, and M.G. Merayo. SANComSim: Ascalable, adaptive and non-intrusive framework to optimize
performance in computational science applications. Procedia
Computer Science, 18:230 – 239, 2013.29. T. Palmer. Build imprecise supercomputers. Nature, 526:32–
33, 2015.
30. D. Ruprecht. Wave propagation characteristics of parareal.Preprint arXiv:1701.01359v1 [math.NA], 2017.
31. K.P. Saravanan, P.M. Carpenter, and A. Ramirez. A per-formance perspective on energy efficient hpc links. In ICS
’14 Proceedings of the 28th ACM international conference
on Supercomputing, pages 313–322, 2014.32. A. Srinivasan and N. Chandra. Latency tolerance through
parallelization of time in scientific applications. Parallel
Computing, 31:777–796, 2005.33. A. Toselli and O.B. Widlund. Domain Decomposition Meth-
ods – Algorithms and Theory, volume 34 of Computational
Mathematics. Springer, 2005.34. M. Weiser. Faster SDC convergence on non-equidistant grids
by DIRK sweeps. BIT Numerical Mathematics, 55(4):1219–
1241, 2015.35. M. Weiser and S. Gotschel. State trajectory compression for
optimal control with parabolic PDEs. SIAM J. Sci. Comput.,34(1):A161–A184, 2012.
36. S.-L. Wu and T. Zhou. Convergence analysis for three
parareal solvers. SIAM J. Sci. Comput., 37(2):A970–A992,2015.