wavefield compression for adjoint methods in full-waveform...

Wavefield compression for adjoint methodsin full-waveform inversion

Christian Boehm1, Mauricio Hanzich2, Josep de la Puente2, and Andreas Fichtner1

ABSTRACT

Adjoint methods are a key ingredient of gradient-basedfull-waveform inversion schemes. While being conceptuallyelegant, they face the challenge of massive memory require-ments caused by the opposite time directions of forward andadjoint simulations and the necessity to access both wave-fields simultaneously for the computation of the sensitivitykernel. To overcome this bottleneck, we have developedlossy compression techniques that significantly reduce thememory requirements with only a small computational over-head. Our approach is tailored to adjoint methods and usesthe fact that the computation of a sufficiently accurate sen-sitivity kernel does not require the fully resolved forwardwavefield. The collection of methods comprises reinterpo-lation with a coarse temporal grid as well as adaptivelychosen polynomial degree and floating-point precision to re-present spatial snapshots of the forward wavefield on hier-archical grids. Furthermore, the first arrivals of adjointwaves are used to identify “shadow zones” that do not con-tribute to the sensitivity kernel. Numerical experiments showthe high potential of this approach achieving an effectivecompression factor of three orders of magnitude with onlya minor reduction in the rate of convergence. Moreover, it iscomputationally cheap and straightforward to integrate infinite-element wave propagation codes with possible exten-sions to finite-difference methods.

INTRODUCTION

Adjoint methods offer a powerful tool to solve problems in full-waveform inversion (FWI). They are essential to efficiently computethe first and second derivatives of the misfit functional and have been

applied successfully to solve tomography problems onmany differentscales (e.g., Igel et al., 1996; Pratt et al., 1998; Fichtner et al., 2009;Virieux and Operto, 2009; Tape et al., 2010; Zhu et al., 2012; Rickerset al., 2013; Méivier et al., 2013). However, the time-reversed natureof the adjoint simulations introduces a severe bottleneck by the ne-cessity to access forward and adjoint wavefields simultaneously dur-ing the computation of sensitivity kernels. Hence, in addition torequiring a huge amount of computational resources, FWI on 3D datasets also demands massive storage capabilities.In general, there exist two opposing strategies to deal with this

challenge that show a tradeoff between memory requirements andincreasing the computational cost. On the one hand, saving thewhole forward wavefield to the disk yields a significant overheadto store and process the 4D space-time evolution of the seismicwavefield. Moreover, because this can easily exceed tens or hun-dreds of terabytes, it necessitates using the slow levels in memoryhierarchy and limitations of the memory bandwidth degrade the per-formance significantly, especially on GPU clusters. On the otherhand, checkpointing techniques (Griewank, 1992; Walther andGriewank, 2004) can trade an almost arbitrary amount of memoryrequirements for additional computations. Although the optimalplacement of the checkpoints has been studied and fine-tuned toits application in reverse time migration and seismic tomography(Symes, 2007; Anderson et al., 2012), this approach yields a sig-nificant computational overhead, which is typically in the order ofone additional forward simulation and can only be reduced at thecost of increasing the memory requirements. A related strategy isproposed by Tromp et al. (2008) and requires storing only the boun-dary values and the final state of the forward wavefield. In a secondstep, an auxiliary equation — the so-called backward wave equa-tion — is solved during the adjoint run to propagate the savedforward wavefield in reversed time direction. Although its computa-tional overhead is comparable to checkpointing techniques, this ap-proach is limited to symmetric time stepping schemes and whenattenuation in the Earth is weak.

Manuscript received by the Editor 23 November 2015; revised manuscript received 28 April 2016; published online 13 September 2016.1ETH Zurich, Department of Earth Sciences, Switzerland. E-mail: [email protected]; [email protected] Applications in Science and Engineering, Barcelona Supercomputing Center — Centro Nacional de Supercomputación, Spain. E-mail: mauricio.

[email protected]; [email protected].© 2016 Society of Exploration Geophysicists. All rights reserved.

R385

GEOPHYSICS, VOL. 81, NO. 6 (NOVEMBER-DECEMBER 2016); P. R385–R397, 9 FIGS., 6 TABLES.10.1190/GEO2015-0653.1

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

http://crossmark.crossref.org/dialog/?doi=10.1190%2Fgeo2015-0653.1&domain=pdf&date_stamp=2016-09-13

In summary, storing the whole forward wavefield is prohibitivelyexpensive for large data sets and potential remedies such as check-pointing introduce a significant computational overhead. Hence, theamount of auxiliary data that has to be transferred to and from diskand related input/output (I/O) operations are key challenges for FWI.In this paper, we propose an alternative approach that strikes a

balance between memory requirements and the need for additionalcomputations. This is based on techniques for the lossy compres-sion of the forward wavefield that fulfill the following importantrequirements:

1) The error in the inexact gradients can be controlled such that thelossy compression does not significantly affect the invertedresults.

2) The I/O overhead is reduced substantially without the need forextra simulations.

3) The computational cost for compressing/decompressing thewavefield is cheap compared to solving the elastic waveequation.

4) Using approximate sensitivity kernels resulting from the com-pressed wavefield does not significantly slow down the rate ofconvergence of an iterative minimization scheme to solve theinverse problem.

Our method combines (1) storing coarse temporal snapshots andreinterpolation with a sliding window cubic spline, (2) spatial com-pression based on an adaptively chosen floating-point precision andlocal representation of the wavefield by lower dimensional polyno-mials on hierarchical grids, and (3) identification of “shadowzones,” where the forward and adjoint wavefields do not overlap.Related ideas have been proposed in previous work by Unat et al.

(2009), Weiser and Götschel (2012), Hanzich et al. (2013), and Göt-schel and Weiser (2015), achieving compression factors between 8and 213 for different applications. However, we can obtain signifi-cantly higher compression factors by tailoring the methods to thecomputation of sensitivity kernels in high-order finite-elementmethods for time-domain FWI.Alternative compression strategies based on the discrete cosine

transform have been proposed and studied by Hanzich et al.(2014) and Unat et al. (2009). A rigorous comparison with wave-let-based compression methods is beyond the scope of this paper.Intermediate results indicate, however, that wavelet-based methodstend to achieve a slightly higher compression factor given a fixedaccuracy. However, the methods presented in this paper have a sig-nificantly smaller computational overhead for compression and de-compression which makes them a particularly favorable alternativeto checkpointing techniques.This paper is organized as follows: First, we develop error esti-

mates for inexact sensitivity kernels that result from a (de)com-pressed forward wavefield. Then, we describe the compressiontechniques in detail and conclude with numerical examples inthe last section.

ITERATIVE INVERSION SCHEMES WITHINEXACT GRADIENTS

In this section, we briefly recall nonlinear minimization methodsto solve inverse problems and introduce important notation. Aclassical Newton-type line-search method iteratively updates themodel m by setting

mkþ1 ¼ mk þ αksk; (1)

with a search direction sk, along which the misfit decreases, and asuitable step length αk > 0, until the sequence mk converged to a(local) minimum. sk is obtained by minimizing a quadratic approxi-mation of the misfit functional χ around the current model mk,which gives

sk ¼ −Bkgk; (2)

where gk denotes the gradient of χ evaluated at mk and Bk is a pos-itive definite matrix that approximates the inverse Hessian of χ atmk. Typical choices for Bk are given by the BFGS or limited-memory BFGS (LBFGS) method (Nocedal and Wright, 2006),and for the particular case of using the identity operator as Bk, weobtain the steepest descent method.To compute search directions to update the model, we require the

Fréchet derivative of χ with respect to m. This sensitivity can beefficiently computed using adjoints. Here, it is important to notethat all derivatives with respect to different structural parameterssuch as density, bulk, or shear moduli, have a similar structure thatinvolves multiplication of forward and adjoint velocity fields orstrains, respectively, and integration in time. For instance, in caseof the elastic wave equation for isotropic media, the Fréchet deriv-atives with respect to density ρ and shear modulus μ are given by

KρðxÞ ¼ −Z

T

0

utðx; tÞ · u†t ðx; tÞdt; (3)

KμðxÞ ¼Z

T

0

εðuÞðx; tÞ∶εðu†Þðx; tÞdt; (4)

where u† is the adjoint wavefield and εðuÞ and εðu†Þ denote thestrain tensors of forward and adjoint fields, respectively. More gen-erally, the Fréchet derivatives with respect to structural parametershave the following form:

KðxÞ ¼Z

T

0

ðDuÞðx; tÞ · ðDu†Þðx; tÞdt; (5)

where D is a first-order differential operator that extracts spatial ortemporal derivatives of the wavefield. Transforming K to the modelspace then yields the gradient g. For further examples see Fichtner(2011) or Tromp et al. (2005). As mentioned before, the oppositetime directions of forward and adjoint simulations cause the mainchallenge of computing the Fréchet derivatives.In what follows, the key idea is to compress Du during the for-

ward simulation and to decompress it again during the adjoint run; i.e., we replace Du in equation 5 by an approximate field fDu ≈ Du.This yields an approximate gradient ~gk and — in case of a quasi-Newton method — also a modified ~Bk in equation 2. Althoughonly the forward wavefield is compressed, all derived quantities thatinvolve fDu will be denoted with ~· to indicate the inexactness.Note that we substitute fDu for Du only during the computation

of the sensitivity kernel, i.e., after the forward run. Hence, the com-pression affects neither the calculation of the misfits nor the adjointsources. This implies, in particular, that also the adjoint state u† hasthe full accuracy and does not introduce an additional compres-sion error.

R386 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

As pointed out above, we aim for a good approximation of thesensitivity kernel rather than an accurate reconstruction of the for-ward wavefield itself. In the next step, we show that by carefullycontrolling the approximation error, an inexactly computed sensi-tivity kernel is sufficient to ensure convergence of an iterative min-imization method to solve the inverse problem. Therefore, we haveto ensure that ~gk ≈ gk and also ~sk ≈ sk. In particular, we require thatall ~sk are directions along which χ decreases. Using the structure ofthe sensitivity kernels as indicated in equation 5, we can bound theabsolute error in the gradient for each structural parameter field bythe estimate

k ~g − gkL∞ ≤ const ·Z

T

0

kðfDuÞðx; tÞ − ðDuÞðx; tÞkL2

· kðDu†Þðx; tÞkL2dt: (6)

Here, the particular choice of the norm depends on the finite-element discretization and would typically correspond to some dis-crete Sobolev space norm, see Boehm and Ulbrich (2015) for detail.From inequality 6, we derive that the error in the gradient isbounded by the approximation error of the compressed forwardwavefield kðfDuÞðx; tÞ − ðDuÞðx; tÞk. Controlling the latter at thetime of compression is key to our method. An estimate of the ad-joint state would additionally allow for time-dependent compres-sion thresholds, but such estimate is not yet available during theforward simulation.Intuitively, we would expect similar model updates if the error

k~g − gk is small. More precisely, we can ensure convergence ofthe iterative minimization scheme if the inexactly computed searchdirections ~sk satisfy the so-called angle condition; i.e.,

ðgk; ~skÞ ≤ −βkgkk · k~skk; (7)

holds with some β > 0 for all k, see Nocedal and Wright (2006).The geometric interpretation of inequality 7 is that the search direc-tions ~sk enclose an angle of strictly less than 90° with the exact neg-ative gradient gk at the current iterate. This ensures, in particular,that all ~sk are directions of descent.For the steepest descent method, we have ~sk ¼ ~gk and it can

easily be shown that the angle condition is satisfied if the relativeerror of the inexact gradient is smaller than 0.5. Indeed, ifk~gk − gkk < δk~gkk for some δ ∈ ½0; 0.5Þ, we obtain

ðgk; ~skÞ ¼ ðgk;− ~gkÞ ≤ k ~gk − gkk · kgkk − kgkk2≤ δk ~gkk · kgkk − ð1 − δÞk ~gkk · kgkk; (8)

which gives the angle condition, inequality 7, for the steepestdescent method with β ¼ 1 − 2δ. Now, we can use the estimatefrom inequality 6 to ensure that the error in the approximate gra-dient is sufficiently small, i.e., k~gk − gkk < δk~gkk.In the case of a quasi-Newton method such as (L)BFGS, the in-

exact search direction ~sk computed with the approximations ~Bk and~gk can be interpreted as a solution to a perturbed version of the lin-ear system in equation 2. Here, the Dennis-Moré conditions (Dennisand Moré, 1977) have to be satisfied to ensure a superlinear rate ofconvergence.Before introducing the compression methods, we should discuss

three main challenges when using inexact gradients in an iterativedescent scheme. First, convergence to the same local minimum as

with exact gradients cannot be guaranteed. Second, the rate of con-vergence might be reduced; i.e., it might take more iterations to ob-tain the same reduction in the objective function. Furthermore,depending on the level of inexactness, the approximated gradientmight not be a direction of descent and iterations may get stuck.With the above derivation using the angle condition and the error

bound from inequality 6, convergence of the inexact LBFGSmethod to a local minimizer can be guaranteed provided that theinexactness of the gradient is sufficiently small. Of course, withoutany further assumptions on the problem structure or the initialmodel, it is in general neither possible to ensure convergence towarda global minimizer nor toward the same local minimizer. This is acommon limitation of nonconvex optimization problems, which ap-plies to all iterative descent schemes. However, there is no guaranteethat the iterations using exact gradients will converge to a betterminimum.To address the second and third challenges, it is important to en-

sure that more accurate gradients are used once a local minimizer isapproached. This is achieved by adaptively choosing the compres-sion criteria, which will be introduced in the next session. Thereby,we can guarantee that the accuracy of the approximate gradient im-proves when its norm decreases. With carefully chosen compressionthresholds, we expect that all variants of descent methods tailored toFWI can also be applied successfully with inexact gradients.

LOSSY COMPRESSION TECHNIQUES

In this section, we present several methods for lossy compressionof the forward wavefield. All of the following methods require suit-able criteria to steer the quality of the compression. For the spatialcompression, we recall from inequality 6 that the error in the gra-dient can be bounded by kðfDuÞðx; tÞ − ðDuÞðx; tÞk, which gives alocalized quantity that can be computed at the time of compression.Hence, we propose using thresholds on the absolute and relativepoint-wise difference between the decompressed and the fullyresolved wavefield. More specifically, for given tolerances εabs2 >εabs1 > 0 and εrel, we will ensure that one of the following condi-tions holds for all snapshots of the decompressed wavefield:

1) C-1: The maximum absolute error is smaller than εabs1 .2) C-2: The maximum relative error is smaller than εrel and the

maximum absolute error is smaller than εabs2 .

Specific configurations of these thresholds can be found in the sec-tion on numerical results.In the following, we use the term “forward wavefield” in a

generic way, which is motivated by the fact that the techniquesapply to all kinds of vector fields, e.g., displacements, velocities,strains, pressure, or velocity divergence. Furthermore, we denotethe wavefield by u instead of Du to simplify the notation.

Field requantization

As a first method for the spatial compression of the wavefield, wepropose to use an adaptive number of bits to represent the values atthe grid points; see Hanzich et al. (2013) for prior work in this field.The main idea is to adapt the floating-point precision for storing thewavefield to the local amplitudes. This enables us to represent val-ues by fewer bits in areas with a small range of amplitudes and,consequently, to reduce the memory requirements. We borrow theterm quantization from image processing, which refers to a transfer

Wavefield compression for FWI R387

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

function that maps a continuous interval of values ½vmin; vmax� onto afinite set fv0; v1; v2; : : : ; vkg with increments of Δv ¼ viþ1 − vi.Because all entities are already quantized in our case — for in-stance, in single precision floating-point format — and onlythe number of discrete values will change, we use the term requan-tization.Compressing and decompressing the wavefield mean to trans-

form the higher precision representation into the lower precisionone and vice versa, which are summarized in Algorithms 1 and2, respectively. In addition to store the values at all grid points withthe reduced number of bits, the offset value uo and the spacing s ¼Δv need to be stored. Due to this overhead, we cannot pick thefloating-point precision individually for every grid point. Instead,we divide the computational domain Ω into smaller subdomainsΩi, i ¼ 1; : : : ; K. Within every subdomain, we use a constant num-ber of bits to store the values of its grid points and, thus, only twoadditional values for offset and spacing are required per subdomain.Hence, the size of the subdomains should be not only small enoughto take advantage of the locally varying amplitudes of the wavefield,but also large enough such that the memory overhead introduced byoffset and spacing is not significant. For instance, in a spectral-element code, the finite elements — with 125 grid points forfourth-order shape functions in 3D — can serve as partitioning,but larger subdomains can be used as well.From Algorithms 1 and 2, we directly deduce that the point-wise

compression error is bounded by

j ~uj − ujj ¼ juoi þ si · uj − ujj

¼��uoi − uj þ si

�ðuj − uoi Þsi

þ 0.5

�� ≤ 0.5si: (9)

Hence, it is straightforward to apply the criteria (C-1) and (C-2)presented at the beginning of this section to determine the requirednumber of bits in every subdomain.

p-Coarsening

This second technique for the spatial compression of the wave-field is tailored specifically to spectral-element methods, which arewidely used in numerical wave propagation codes such as SPEC-FEM3D (Peter et al., 2011) or SES3D (Gokhberg and Fichtner,2016). Here, the wavefield is spatially represented by higher ordershape functions, most commonly with polynomials of order 4. Notethat we restrict the following presentation to fourth-order polyno-mials, but the method can easily be extended to other high-orderfinite-element methods. A straightforward way to approximatethe wavefield with fewer degrees of freedom is to adaptively reducethe polynomial degree within the spectral elements. This requires todownsample the wavefield locally onto a lower dimensional space.Based on similar concepts for adaptive mesh refinement in finite-element methods, we call this approach p-coarsening.Spectral-element methods represent the wavefield on a single

element by the Galerkin projection of the following form:

uðx; tÞ ¼X4i;j;k¼0

uijk4 ðtÞψ ijkðxÞ; (10)

with shape functions ψ ijk and time-dependent coefficients uijk4 ðtÞthat approximate the wavefield at the grid points. Here, ψ ijk arethe tensorized Lagrange polynomials of degree 4 with control pointsgiven by the Gauss-Lobatto-Legendre (GLL) quadrature rule. Asthe number of collocation points quickly increases from 8 (degree1) to 27 (degree 2) and 125 (degree 4), a lower polynomial degreereduces the memory requirements significantly.In the following, we denote the fourth-order coefficients by uijk4

and the lower dimensional representation of u using polynomials oforder p ∈ f0; 1; 2; 3; 4g by up. Here, u0 is a constant function de-fined by the average value of the wavefield at the collocation pointsof p ¼ 1. The projection of u4 onto the lower dimensional subspacecan be computed by solving a least-squares problem, which requiresthe solution of a linear system with ðpþ 1Þ3 unknowns. As acheaper alternative, we can determine up by evaluating u4 at thelower order collocation points. Moreover, we can exploit the hier-archical structure of the GLL points (see Figure 1) which means thatthe projection requires only memory access to the correspondingindices of u4 but no interpolation for p ∈ f0; 1; 2; 4g. Because thisapproach does not incorporate information from the higher-ordercollocation points, this approximation is slightly worse comparedto the least-squares projection, but it proved to be acceptable inour numerical tests.The local polynomial degree is determined based on the criteria

(C-1) and (C-2) introduced at the beginning of this section. Toestimate the pointwise errors and to decompress the wavefield,we need to plug in up into equation 10 and to evaluate u at thefourth-order collocation points.

Algorithm 1. Requantization and compression.

Require: uncompressed raw wavefield u, subdomains Ωi, vector ofbit-resolutions b

1) for all Ωi do

2) Set offset uoi ¼ minfuðxÞ∶x ∈ Ωig.3) Set spacing si ¼ 1

2bi−1 · ðmaxfuðxÞ∶x ∈ Ωig − uoi Þ.4) for all xj ∈ Ωi do

5) Set uj ¼ bðuj−uoi Þsiþ 0.5c.

6) end for

7) end for

Ensure: compressed wavefield u, vector of offsets uo andspacings s.

Algorithm 2. Requantization and decompression.

Require: Compressed wavefield u, offsets uo, spacings s.

1) for all Ωi do

2) for all xj ∈ Ωi do

3) ~uj ¼ uoi þ si · uj4) end for

5) end for

Ensure: Decompressed wavefield ~u.

R388 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

Requantization and p-coarsening on hierarchical grids

Although requantization and p-coarsening can be used independ-ently, a combination of both achieves the highest compression rate.The main idea is to use a hierarchy of grids on which the lower orderinformation acts as a predictor for the values of the next finer gridlevel. More specifically, on each level, we compute the differencebetween the values predicted by the lower order shape functions andthe actual values of the wavefield and requantize these residuals.This is conceptually similar to MPEG video compression, see Sul-livan and Wiegand (2005). For spectral-element methods, we ex-ploit again the hierarchy of the GLL points and denote the setof collocation points on level l as Pp½l� with p ¼ ð0; 1; 2; 4Þ. Hence,we have Pp½l� ⊂ Pp½lþ1� and P0 ¼ ∅. Now, we use Lagrange poly-nomials of order p½l� and interpolating Pp½l�, to predict the values atthe points in Pp½lþ1�. Algorithm 3 describes the compression in de-tail. The error bound given by inequality 9 determines the numberof bits that are required to requantize the residuals at the collocation

points of the next higher level such that criteria (C-1) and (C-2) aresatisfied, see line 5 in Algorithm 3. This number is determined inde-pendently for each level. Note that we use the approximated values~uijkp½l� from the previous level instead of the actual wavefield uijkp½l� inline 7 in Algorithm 3. Hence, we only use information that is avail-able at the time of decompression. This ensures that the approxi-mation error does not increase if we move to higher levels andwe indeed have control over point-wise absolute and relative errorsafter decompression.The procedure is described in Figure 2 for a 2D element. Note

that in 3D, the number of collocation points rapidly increases from 8(P1 \ P0) to 19 (P2 \ P1) and 98 (P4 \ P2). Furthermore, the mag-nitude of the residuals typically decreases from level to level, whichyields high compression factors.This approach is straightforward to use with finite-element meth-

ods, but it is important to note that the patch-wise partitioning of themesh can be done independently of the discretization scheme. We

Figure 1. Hierarchical order of the GLL points in 2D (same struc-ture as in 3D). For degrees 1, 2, and 4, every collocation point of thelower order polynomial is also a collocation point of the higher or-der polynomials.

Algorithm 3. Combined requantization and p-coarsening.

Require: uncompressed raw wavefield on a subdomain, thresholdsεabs1 , εabs2 , and εrel.

1) Set ~u0 ¼ u0.

2) for l ¼ 0; 1; 2 do

3) for all xijk ∈ Pp½lþ1� \ Pp½l� do4) Set rijk ¼ u4ðxijkÞ − ~up½l�ðxijkÞ.5) Determine number of bits bl to requantize rijk based on

εabs1 , εabs2 , and εrel.

6) Apply Algorithm 1 to requantize rijk and obtain rijk andsl.

7) Set ~uijkp½lþ1� ¼ ~uijkp½l� þ rijk.

8) end for

9) end for

10) Set u ¼ r.

Ensure: compressed wavefield u, spacings sl and number of bits blfor every level.

−10

1−1

01

0

0.4

0.8

Degree 4

Am

plitu

de

−10

1−1

01

0

0.4

0.8

Am

plitu

de

Degree 0

−10

1−1

01

0

0.5

Residuals: Degree 0 → 1

−10

1−1

01

0

0.4

0.8

Degree 1

Am

plitu

de

−10

1−1

01

0

0.5


−10

1−1

01

0

0.4

0.8

y

Degree 2

x

Am

plitu

de

−10

1−1

01

0

0.5

y


x

Requantized residualsHigher order residuals

Figure 2. A 2D sketch of requantization for a reference element of½−1; 1� × ½−1; 1� on a hierarchy of grids. The current interpolant pre-dicts the values at the collocation points of the higher polynomialdegree. Only the residuals of the next level are requantized (bluelines). These residuals are then used together with the current in-terpolant to construct the prediction on the next higher level. Thisprocedure is repeated for degrees 0, 1, and 2.


Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

will comment on possible extensions to finite-difference methods inthe discussion.

Temporal compression

Explicit time stepping schemes, which are widely used in seismicwave propagation codes, require much shorter time steps due to theCourant–Friedrichs–Lewy (CFL) condition than the sample ratepredicted by the Nyquist-Shannon theorem. This motivates usinga coarser temporal grid for storing the wavefield, and FWI codestypically save only every kth time step with k in the order of 10to 30; see Fichtner et al. (2009) for a derivation of the sampling rate.Instead of using a piecewise constant extrapolation of the forward

wavefield during the adjoint run, however, we propose a cubicspline interpolation to capture the smoothness of the signal in time.This comes at low computational cost but improves the approxima-tion significantly and enables us to reduce the sampling rate of thesnapshots. Because it would be prohibitively expensive to retrieveand access all snapshots simultaneously, we suggest to use a slidingwindow of four consecutive snapshots to sequentially compute thespline interpolant. This procedure is illustrated in Figure 3. Thewavefield in the central subinterval between snapshots 2 and 3 isapproximated with the cubic spline. Afterward, the oldest snapshotis released and the next one is retrieved from disk to continue withthe reconstruction in the next subinterval. Here, spatial decompres-sion is carried out in a first step. Afterward, the spline coefficientsare determined by solving a tridiagonal linear system, which can bedone analytically because only two unknowns remain after applyingnatural boundary conditions. Although every grid point requires itsown cubic spline and spatial dependencies are not taken into ac-count, the tridiagonal matrix only depends on the time increments

between the snapshots, which makes the computation of the splinecoefficients comparatively cheap. The evaluation of the spline atintermediate time steps requires three summations and three multi-plications per grid point. Note that we can further reduce the com-putational overhead by evaluating the spline only every second orthird time step when calculating the sensitivity kernel.

Adjoint shadow zones

The Fréchet derivatives with respect to different structural param-eters have in common that the forward wavefield, or its temporal orspatial derivatives, respectively, is multiplied with the correspond-ing adjoint field, see equation 5. Hence, only time steps at whichforward and adjoint waves locally overlap contribute to the sensi-tivity kernel. In particular, there is no need to store any informationlocally at time steps prior to the first arrival of the forward wavefieldor later than the first arrival of the adjoint waves. This observation isremarkably useful in the latter case because it works independentlyof the energy and the amplitudes of the forward wavefield. Thus, itenables us to disregard parts of the forward wavefield, which wouldnot have been identified by the previously discussed methods.Moreover, this compression is lossless with regard to the sensitivitykernel. Figure 4 visualizes the evolution of the shadow zones for asingle source-receiver pair in a 2D domain.As has been pointed out above, the shadow zones require lower

bounds on the arrival times of forward and adjoint waves for everysubdomain. This can be done in a preprocessing step prior to theinversion, which computes the respective first and last time steps foreach of the subdomains. As a rough estimate for the last time step atwhich the forward wavefield needs to be stored locally, we computethe minimum distance of the subdomain to any seismic station and

divide it by the maximum P-wave velocity. Ofcourse, more sophisticated methods can be ap-plied to estimate the travel times. For instance,we could run a single adjoint simulation andtrack the first arrivals of the adjoint waves to ob-tain a more accurate prediction. The same tech-nique can also be applied for the first arrivals ofthe forward wavefield.

RESULTS

In this section, we present numerical examplesfor our compression methods for a single param-eter gradient and compress the strain field duringthe forward run. All numerical tests have beencarried out on Piz Daint at the Swiss NationalSupercomputing Center (CSCS). Details on theimplementation of the spectral-element codecan be found in Boehm (2015). All computationsare in double-precision floating-point arithmetic.Clearly, the actual compression factor depends

on the geometry of the domain, the locations ofsources and receivers, and the underlying veloc-ity model. However, most of them share thecommon property of sources and receivers lo-cated at or near the surface and a velocity profilethat increases with depth. Hence, the results pre-sented in this section provide a good indicationof the capabilities of the method. For simplicity,

−20

0

20 Decompressed signal

Am

plitu

de

0 1 2 3 4 5 6

−20

0

20 Interpolation error

Am

plitu

de

Time (s)

3 3.1 3.2

−20

0

20

Time (s)

Am

plitu

de True signal Cubic spline Piecewise constant Snapshots

a)

b)

c)

Figure 3. Cubic spline interpolation using a sliding window with four consecutive snap-shots. (a) Reconstructing signals using snapshots at the red markers with either a piece-wise constant extrapolation (dashed line) or a cubic spline (blue line). (b) Interpolationerrors for both methods. (c) Zoom to sliding window. Snapshots at the four time stepswith the filled red markers determine a cubic spline that we use to interpolate the signalin the central subinterval.

R390 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

we always use the L2-norm as misfit functional, but, of course, moresophisticated measures can be used as well. Furthermore, we applya free-surface condition at the top boundary and dashpot absorbingboundary conditions (Epanomeritakis et al., 2008) on all other facesof the computational domain.We use two different criteria to assess the quality of the inexactly

computed gradient using the compressed wavefield. On the onehand, we consider the angular difference θ that denotes the angleenclosed by exact and inexactly computed gradients; see the con-dition given by inequality 7:

cos θ ¼ ð ~g; gÞk ~gk · kgk : (11)

If the angular difference is small, the gradient computed withcompressed information points into a similar direction as the fullyresolved gradient, and, thus, the model update will be similar, too.Furthermore, we measure the difference between both gradients bythe structural similarity index (SSIM) developed by Wang et al.(2004) which is widely used in image processing. The SSIM rates

the perceived similarity of images with values between zero and onewith one being the best score. It is important to note that all exam-ples use the consistent discrete gradient as result of the adjoint sim-ulation without any modification (e.g., clipping or smoothing). Inparticular, even higher compression factors can be achieved if wecompare smoothed gradients.

Example 1: Spherical wave problem

As a first example, we consider a simple spherical wave problemon a 3D domain of 20 × 20 × 18 km with a homogeneous elasticmaterial ρ ¼ 2000 kg∕m3, vp ¼ 2500 m∕s, and a Poisson ratio of0.25. The simulation time is 15 s. Furthermore, we use a singlesource-receiver pair at 6 km depth with a horizontal distance of10 km. The source-time function is a Ricker wavelet with a dom-inant frequency of 3 Hz and we reinject the recorded displacementsas adjoint source.To assess the quality of our methods, we compare sensitivity ker-

nels that result from different configurations of the compressionthresholds with the exact kernel. A detailed quantitative analysisis given in Table 1. Here, we indicate the effective compression fac-tor that we define as the ratio of the total memory required for the

Figure 4. Time evolution of adjoint shadow zones for a single source-receiver pair on a 12 × 20 km domain. The rows depict the curl (red) anddivergence (blue) of the forward and the adjoint wavefield as well as the interaction field, i.e., the product of both wavefields, which gives theinstantaneous contribution to the sensitivity kernel. The gray-shaded areas indicate the shadow zones, where either the forward or the adjointwavefield is zero. Obviously, also the interaction field is zero in these areas and there is no need to store the wavefield. The right column showsthe current memory usage of the compressed wavefield relative to storing all values with full precision.


Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

uncompressed wavefield and the total memory required by the com-pressed wavefield. Furthermore, we show the minimum instantane-ous spatial compression factor that varies between 8 and 16 in allconfigurations and mainly depends on εabs1 . The quality of theapproximation is measured by the angular difference and the SSIM.In addition, Figure 5 shows vertical slices through the center of thedomain.The memory requirements can be reduced by three orders of

magnitude without hardly any differences in the visual appearanceor the direction of the gradient. As expected, the quality of the ap-proximated gradient declines with an increasing compression factor.However, even with a factor of 4714, we obtain an approximatekernel that is at least qualitatively similar and, more importantly,that is still a direction of descent with an angular difference ofroughly 46°. Note that gzip (version 1.3.12 with option –9 forthe highest compression) as a standard black-box loss-less compres-sion software yields only a compression factor of 1.24 for this par-

ticular wavefield. This shows the necessity and also the great benefitof tailoring lossy compression methods to their application in FWI.Figure 6 shows the temporal evolution of the spatial compression.

Throughout the simulation, we achieve at least an instantaneousspatial compression factor of 15.8. Furthermore, the increase ofthe compression factor toward the end of the simulation indicatesthe effect of the adjoint shadows.Table 2 states the overhead in CPU time that is required to com-

press the forward strain field and to decompress it again during theadjoint run. The total overhead adds only about 2%–10% to the costof the forward simulation itself. Hence, the compression methodsproposed in this paper are much less expensive than checkpointingtechniques or solving the backward wave equation, which wouldtypically require additional computations in the order of one for-ward simulation. Moreover, the actual overhead in wall-clock timethat is needed for compression and decompression is typically evenless, as these tasks can be carried out in separate threads overlapping

Table 1. Statistics for various configurations of the compression algorithm. εabs1 , εabs2 , and εrel refer to the spatial compressionthresholds and sr denotes the sampling rate of the spline snapshots. cf is the overall compression factor, miscf the minimuminstantaneous spatial compression factor, θ the angular difference between exact and approximated gradient. Columns 9–11indicate the SSIM for slices through the center of the domain and orthogonal to each of the coordinate axis and the last columnstates the overhead in CPU time — relative to a forward simulation without I/O — for spatial compression anddecompression and the computation of the spline coefficients.

εabs1 εabs2 εrel sr cf miscf θ SSIMx SSIMy SSIMz ohd

(1) 0.001 0.1 0.005 30 544 8.6 3.1 0.9991 0.9983 0.9998 9.26%

(2) 0.01 0.1 0.02 30 1043 16.0 3.1 0.9991 0.9983 0.9998 9.09%

(3) 0.01 0.5 0.1 50 1708 15.8 19.7 0.9497 0.9638 0.9164 5.95%

(4) 0.001 1 0.005 125 2100 8.2 25.2 0.8967 0.9099 0.8499 2.64%

(5) 0.01 1 0.1 75 2506 15.6 33.0 0.8976 0.8015 0.9400 4.27%

(6) 0.01 10 0.3 100 3277 15.5 39.0 0.8843 0.6808 0.9191 3.28%

(7) 0.01 10 0.3 150 4714 15.1 46.4 0.5975 0.8909 0.5334 2.28%

–1 0 1

w/o compression

0

6

12

Dep

th(k

m) (1)

cf-544

(2)

cf-1043

(3)

cf-1708

(4)

cf-21000 5 10 15 20

0

6

12

Length (km)

Dep

th(k

m) (5)

cf-25060 5 10 15 20

Length (km)

(6)

cf-32770 5 10 15 20

Length (km)

(7)

cf-47140 5 10 15 20

Length (km)

Figure 5. Gallery of normalized sensitivity kernels with a compressed forward wavefield for the configurations of the thresholds indicated inTable 1. The top left image shows the sensitivity kernel without compressing the forward wavefield; the numbers in the lower right cornerindicate the compression factor. All images have been truncated at 12 km depth and show a vertical slice through the center of the domain thatcorresponds to SSIMx.

R392 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

with the simulation. Furthermore, Table 2 shows that the compres-sion factor and the computational overhead are inversely propor-tional. This is due to the fact, that a higher sampling raterequires fewer spline coefficients and fewer snapshots to be spa-tially compressed and decompressed. In addition, higher compres-sion thresholds not only increase the compression factor, but alsoallow for skipping more steps in the prediction-correction phase.

Example 2: 3D inversion in a cube

We consider a cubic domain of 4 × 4 × 4 km with a homo-geneous background medium with ρ ¼ 2000 kg∕m3 and vp ¼2500 m∕s, and a strong reflector near the bottom with vp ¼3000 m∕s. Again, we assume a constant Poisson ratio of 0.25and invert only for vp. The true model contains two ball-shapedanomalies with�10% deviations from the background model. Syn-thetic test data are generated by five point sources using a Rickerwavelet with a dominant frequency of 5 Hz. An array of 441 equi-distantly aligned receivers is placed below the surface. Figure 7ashows the geometry of the setup as well as the true velocity model.The initial model considered for the inversion contains the two

layers but not the ball-shaped anomalies. We run LBFGS severaltimes using either the fully resolved forward wavefield or differentsettings of the compression thresholds. Two reconstructions and theevolution of the misfit for several configurations are shown in Fig-ure 7. Even with an average compression factor of more than 3000per source, the iterations look very similar and converge to the samemodel. The models from scenarios 1 and 2 after 30 iterations ofLBFGS are visually indistinguishable from the reconstruction with-out compression. As expected, the performance of LBFGS declineswhen too much information is lost during the compression. How-ever, this can be circumvented by adaptively lowering the compres-sion thresholds and restarting the iterations. This is the case inscenario 4 where after 18 iterations, the inexact LBFGS search di-rection is not a direction of descent, see Figure 7b for details. Afterlowering the compression thresholds during the iterations for sce-narios 3 and 4, all tests eventually converge to the same model.In summary, LBFGS performs well using inexact gradients that

result from compressing and decompressing the wavefield. In par-ticular, neither the rate of convergence nor the reconstruction

changes significantly when the memory requirements are reducedby three orders of magnitude.

Example 3: SEG overthrust model

In this last example, we consider a more complex geology anduse a subset of the SEG overthrust model (Aminzadeh et al., 1996).Here, we use a domain of 8 × 8 × 3.2 km, where we extract the cen-tral parts of the model in the x- and y-directions. We consider anarray of 6561 receivers equidistantly distributed on a plane 8 m be-neath the surface with a lateral and longitudinal spacing of 100 m.We use two configurations for the source. Setup 1 (S1) is a singlepoint source in the center of the x-y-plane at 40 m depth. Setup 2(S2) is an encoded source with 6400 simultaneous sources equidis-tantly aligned on an array with a spacing of 100 m at 40 m depth.The encoding weights are chosen as independent samples of Rade-macher’s distribution, i.e., eitherþ1 or −1with equal probability. Inboth cases, we use a Ricker wavelet with a dominant frequency of10 Hz as source-time function and a simulation time of 6 s.We generate synthetic data with the SEG overthrust model and

evaluate the gradient using the L2-norm as misfit functional for twodifferent models. Model A (MA) is a 1D model that is constant inthe x-y-plane with the average value of the true model at that depth.For model B (MB), we apply a Gaussian smoothing filter to the truemodel, which yields a model that is already close to the true model.Figure 8 shows vertical slices of the true model, as well as models Aand B.Table 3 compares the accuracy of the gradient with respect to

both the Lamé coefficients for both source setups and both models.Similar to the previous examples, we achieve a compression factorof three orders of magnitude with only a small angular difference forthe point source. The 6400 simultaneously firing point sources insetup 2 create a more complicated wavefield. Hence, for fixed errorthresholds, the achievable compression factor is smaller than for asingle point source. However, for an angular difference of 20°–32°,the memory requirements can still be reduced by three orders ofmagnitude.Furthermore, Figure 9 shows that the inexact gradient still pre-

serves fine-scale information if the compression thresholds arechosen carefully. Here, we show slices of the inexact gradient withrespect to the shear modulus accumulated from 20 encoded sourceswith an average compression factor of 1364. Although the level ofdetail decreases with the increasing depth, some fine-scale struc-tures remain visible in the inexact gradient. This confirms thatthe compression methods are applicable to models with complex

0 5 10 150

50

100

150

200

Spa

tial c

ompr

essi

on fa

ctor

Time (s)

Temporal evolution

Figure 6. Evolution of the spatial compression factor during thesimulation for scenario 3 in Table 1. Note that the vertical axishas been truncated to a factor of 200. The spatial compression factortends to infinity toward the beginning and the end of the simulationdue to the shadow zones. The minimum instantaneous spatial com-pression factor is 15.8.

Table 2. CPU time of the different components of thecompression methods relative to one forward simulation.Results are shown for the different configurations of thecompression thresholds from Table 1.

(1) (2) (3) (4) (5) (6) (7)

Spatial compression (%) 3.80 3.66 2.43 1.10 1.74 1.35 0.93

Spatial decompression (%) 1.63 1.60 1.02 0.44 0.73 0.56 0.38

Spline coefficients (%) 3.83 3.83 2.49 1.10 1.80 1.38 0.96

Total overhead (ohd) (%) 9.26 9.09 5.95 2.64 4.27 3.28 2.28


Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

geology. With carefully chosen compression thresholds, the inexactgradients can be used with any iterative descent method. We willcomment on this in more detail in the next section.

DISCUSSION

The compression methods proposed in this paper offer a prom-ising alternative to checkpointing techniques. As indicated in theprevious section, sufficiently accurate approximations of the sensi-tivity kernel can be obtained with only a small computational over-head, adding only about 2%–10% to the CPU time of a forwardsimulation, which is significantly less than checkpointing methodseven if they are tailored to FWI (Anderson et al., 2012). Further-more, the practical performance is significantly better than withblack-box loss-less compression tools such as gzip that are not tail-ored to the application in FWI. Note that even higher compression

factors can be achieved if the sensitivity kernels are smoothed be-fore the model is updated.A notable feature of the compressed kernels displayed in Figure 5

is the successive disappearance of the higher Fresnel zones. Withincreasing compression factor, kernels resemble “fat rays,” pro-posed as a compromise between ray theory and accurate finite-fre-quency kernels that are more expensive to compute, see Husen andKissling (2001) and Yoshizawa and Kennett (2002). Higher Fresnelzones are relevant only when tomographic resolution is sufficient toensure that structure as small as their width can indeed be recovered(e.g., Van Der Hilst and De Hoop, 2005). This implies that com-pression factors may be further increased in regions of lower cover-age where only the first Fresnel zone contributes significantly toresolution.With an increasing resolution of the numerical mesh, even the

reduced memory requirements through compression might becomeprohibitively expensive at some point. In this case, compression can

0 5 10 15 20 25 3010

−3

10−2

10−1

100

Iterations

Rel

ativ

e m

isfit

(1) cf−1279(2) cf−3153(3) cf−4182(4) cf−6141(5) cf−4182w/o compression

a)

b)

c)

d)

Figure 7. (a) Computational setup with the true model for vp, receiver locations marked as black dots and the source locations indicated by redsquares. Slices through the center of the ball-shaped anomalies are projected to the rear and bottom. (b) Misfit reduction with and withoutcompression. The numbers cf-X indicate the average compression factor per source. Note that the thresholds were lowered in scenario 4 after18 iterations. (c) Reconstruction without compression illustrating the vp deviation from the initial model. (d) Reconstruction of scenario1 using the compressed forward wavefield.

R394 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

complement checkpointing techniques by spatially compressing thesnapshots. This allows us to store more checkpoints and, thus, itreduces the computational overhead of checkpointing techniques.The analysis of iterative inversion schemes using inexact gradient

information shows that convergence to a (local) minimum can beguaranteed. It is, in general, not possible to ensure that the iterateswith and without compression converge to the same model,although we did not encounter such problems in our numerical tests.This common limitation of nonconvex problems can be partiallymitigated by choosing the compression thresholds adaptively basedon the norm of the gradient, thus gradually including more infor-mation. Similar strategies in the context of stochastic gradient meth-ods have been applied successfully in FWI (Li et al., 2012;Moghaddam et al., 2013), in which randomly chosen subsets ofsources introduce inexactness in the gradient. It has been observedby van Leeuwen and Herrmann (2013, 2014) that the level of in-exactness can be quite high during the first iterations and only needsto be refined once a local minimum is approached.In the remainder of this section, we comment on several aspects

of the implementation and possible extensions of the methods.A multiparameter inversion requires strains for the sensitivity

kernels with respect to the Lamé coefficients as well as the velocityfield for the density kernel, see equations 3 and 4. The error thresh-olds steering the compression can be chosen independently for eachfield. Compressed strains are stored element-wise, which are nec-

Figure 8. Vertical slices of the models to compare the gradients.(a) True model that is used to generate synthetic data. (b) ModelA: A 1D model with average value of the x-y-plane. (c) Model B:Gaussian smoothing filter applied to the true model. All images showthe P-wave velocity.

Table 3. Statistics for various tests with the SEG overthrustmodel. The first column indicates the source type, i.e., S1 fora single point source or S2 for an encoded source, and themodel for which the gradient is computed, i.e., MA for the1D model and MB for the smoothed model.

εabs1 εabs2 εrel sr cf miscf θ ohd (%)

S1-MA 0.001 0.01 0.05 50 4462 44.3 7.8 6.62

0.001 0.01 0.05 100 8895 44.3 46.2 3.36

S1-MB 0.001 0.01 0.05 50 4025 41.0 3.1 6.75

0.001 0.01 0.05 100 8020 41.0 21.4 3.36

S2-MA 0.0001 0.01 0.05 50 306 5.0 4.6 6.73

0.01 0.1 0.1 100 1371 9.7 32.3 3.48

S2-MB 0.0001 0.01 0.05 50 302 5.0 3.4 6.75

0.01 0.1 0.1 100 1350 9.7 20.8 3.45

Figure 9. Sections of the accumulated gradient from 20 encodedsources at horizontal slices through the domain for model B. Theleft column shows the relative deviation from the true shear modu-lus and the right column the normalized gradient with regard to theshear modulus for each slice.


Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

essary as strains are discontinuous across element boundaries in aspectral-element discretization. However, velocities are continuousthroughout the whole domain, and we can store the velocity fieldusing global vectors. This reduces the memory requirements by thenumber of grid points that are shared among two or more elements.Note that storing only the displacement field instead of strains andvelocities is disadvantageous for two reasons. First, recomputingtemporal and spatial derivatives of the wavefield during the adjointrun would require significant computing time. Second, a small com-pression error in the displacement does not necessarily yield a smalldifference in the derivatives, which makes it very difficult to controlthe accuracy of the approximation for the strains.All of the compression techniques are spatially localized and do

not require MPI communication. Furthermore, compression, de-compression, and I/O operations can be carried out asynchronouslyto the simulation and in multiple threads, which further reduce thecomputational overhead. Due to the significantly reduced memoryrequirements, using a local scratch file system is feasible and highlydesirable in case it is available on the HPC architecture.Possible extensions of the compression methods can combine p-

and h-coarsening in a similar fashion as in hp-adaptivity (Demko-wicz et al., 2002). This would not only enable us to adaptivelymerge neighboring elements and to further reduce the memory re-quirements but also introduce an additional computational overheadto transform the mesh.Although this paper primarily targets finite-element methods, the

same ideas for wavefield compression can be applied for finite-dif-ference methods as well. First of all, the temporal compression ofthe wavefield and reinterpolation with cubic splines works com-pletely independently of the spatial discretization as long as the spa-tial grid does not change with time. Likewise, requantizationrequires only a partitioning of the spatial domain in smaller subsetson which the same number of bits is used but no additional geo-metric information. The main requirement to apply requantizationin combination with p-coarsening is a hierarchical structure of thegrid. Although this may not be given naturally for a finite-differencediscretization, we can easily split up any rectilinear grid into smallpatches with 2d grid points per dimension and compress the wave-field on this subdomain using our method. Although for high-orderfinite-element methods, the partitioning of the mesh is intrinsicallytied to the discretization, locally interpolating the wavefield at thegrid points by the Lagrange polynomials is possible for arbitraryrectilinear grids. This hierarchical subdivision of the domain ap-pears in many other compression approaches as well, for instance,MPEG or wavelet-based techniques, and is not limited to finite-element methods.This paper focuses on adjoint methods to compute sensitivity ker-

nels, but the techniques for wavefield compression are useful inother areas as well, e.g., in the context of scattering-integral meth-ods (Chen et al., 2007). Furthermore, we currently investigate ex-tensions for the adjoint-based computation of Hessian-vectorproducts that are required, for instance, in a Newton-CG method(Boehm and Ulbrich, 2015; Götschel and Weiser, 2015). Althoughit is conceptually straightforward to apply the same compressionmethods to the forward and adjoint wavefields for the computationof Hessian-vector products, the two main challenges are errorpropagation and accuracy. Because the decompressed wavefieldsalso appear as a distributed source term in the two auxiliary waveequations, errors are propagated to the perturbed wavefields as well.

However, the inexact Newton steps resulting from the decom-pressed wavefields must still yield significantly better updates thanthe LBFGS step to compensate for the additional simulations re-quired for the CG iterations. We carried out some preliminary testsapplying the previously discussed compression techniques to theforward and adjoint wavefield in a trust-region Newton-PCG(Boehm and Ulbrich, 2015) method. Obtaining descent directionswas not a problem in our tests; however, the inexact Newton stepsprovided only a small improvement compared with LBFGS up-dates. Of course, more accurate steps can be obtained by reducingthe compression thresholds, but this significantly lowers the achiev-able compression factor. Tailoring the compression methods to Hes-sian-vector products is subject of future research.

CONCLUSION

In this paper, we proposed compression techniques tailored toFWI, in which the forward wavefield is compressed and retrievedduring the adjoint run to compute an approximate gradient. All ofthe presented methods are easy to implement and computationallycheap, adding an overhead of only a few percent to a simulation.Numerical examples show that the methods are capable of reduc-

ing the memory requirements by three orders of magnitude withoutaffecting the iterations to solve the inverse problem. Although themethods are tailored to spectral-element methods in the time domain,similar ideas can be applied for other numerical techniques as well.

ACKNOWLEDGMENTS

The authors would like to thank Editor John Etgen, AssociateEditor Anatoly Baumstein, and three anonymous reviewers for theirvaluable comments and excellent suggestions to improve the paper.C. B. and A. F. gratefully acknowledge support by the Swiss Na-tional Supercomputing Centre (CSCS) in the form of CHRONOSproject ch1 and PASC project GeoScale. M.H. and J.P. would like toacknowledge funding from the European Union’s Horizon 2020Programme (2014–2020) and from Brazilian Ministry of Science,Technology and Innovation through Rede Nacional de Pesquisa(RNP) under the HPC4E Project (www.hpc4e.eu), grant agreementno 689772. Furthermore, this project has received funding from theEuropean Union’s Horizon 2020, research and innovation pro-gramme under the Marie Sklodowska-Curie grant agreementNo. 644202 and from Repsol. C. B. acknowledges funding fromShell within the project “Boosting full-waveform inversion.”

REFERENCES

Aminzadeh, F., N. Burkhard, J. Long, T. Kunz, and P. Duclos, 1996, Threedimensional SEG/EAEG models — An update: The Leading Edge, 15,131–134, doi: 10.1190/1.1437283.

Anderson, J. E., L. Tan, and D. Wang, 2012, Time-reversal checkpointingmethods for RTM and FWI: Geophysics, 77, no. 4, S93–S103, doi: 10.1190/geo2011-0114.1.

Boehm, C., 2015, Efficient inversion methods for constrained parameteridentification in full-waveform seismic tomography: Ph.D. thesis, Tech-nische Universität München.

Boehm, C., and M. Ulbrich, 2015, A semismooth Newton-CG method forconstrained parameter identification in seismic tomography: SIAM Jour-nal on Scientific Computing, 37, S334–S364, doi: 10.1137/140968331.

Chen, P., T. H. Jordan, and L. Zhao, 2007, Full three-dimensional tomog-raphy: A comparison between the scattering-integral and adjoint-wave-field methods: Geophysical Journal International, 170, 175–181, doi:10.1111/j.1365-246X.2007.03429.x.

R396 Boehm et al.

Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

www.hpc4e.eu

www.hpc4e.eu

www.hpc4e.eu

http://dx.doi.org/10.1190/1.1437283

http://dx.doi.org/10.1190/1.1437283

http://dx.doi.org/10.1190/1.1437283

http://dx.doi.org/10.1190/geo2011-0114.1



http://dx.doi.org/10.1137/140968331

http://dx.doi.org/10.1137/140968331

http://dx.doi.org/10.1111/j.1365-246X.2007.03429.x






Demkowicz, L., W. Rachowicz, and P. Devloo, 2002, A fully automatic hp-adaptivity: Journal of Scientific Computing, 17, 117–142, doi: 10.1023/A:1015192312705.

Dennis, J., Jr., and J. Moré, 1977, Quasi-Newton methods, motivation andtheory: SIAM Review, 19, 46–89, doi: 10.1137/1019005.

Epanomeritakis, I., V. Akçelik, O. Ghattas, and J. Bielak, 2008, A Newton-CG method for large-scale three-dimensional elastic full-waveform seis-mic inversion: Inverse Problems, 24, 034015, doi: 10.1088/0266-5611/24/3/034015.

Fichtner, A., 2011, Full seismic waveform modelling and inversion:Springer.

Fichtner, A., B. L. N. Kennett, H. Igel, and H.-P. Bunge, 2009, Full seismicwaveform tomography for upper-mantle structure in the Australasian re-gion using adjoint methods: Geophysical Journal International, 179,1703–1725, doi: 10.1111/j.1365-246X.2009.04368.x.

Gokhberg, A., and A. Fichtner, 2016, Full-waveform inversion on hetero-geneous HPC systems: Computers & Geosciences, 89, 260–268, doi: 10.1016/j.cageo.2015.12.013.

Götschel, S., and M. Weiser, 2015, Lossy compression for PDE-constrainedoptimization: Adaptive error control: Computational Optimization andApplications, 62, 131–155, doi: 10.1007/s10589-014-9712-6.

Griewank, A., 1992, Achieving logarithmic growth of temporal and spatialcomplexity in reverse automatic differentiation: Optimization Methodsand Software, 1, 35–54, doi: 10.1080/10556789208805505.

Hanzich, M., F. Rubio, G. Aguilar, and N. Gutierrez, 2013, Efficient lossycompression for seismic processing: 75th EAGE Conference and Exhibi-tion, EAGE, Expanded Abstracts, Th-16-08.

Hanzich, M., F. Rubio, J. de la Puente, and N. Gutierrez, 2014, Lossy datacompression with DCT transforms: Workshop on High PerformanceComputing for Upstream, EAGE, Expanded Abstracts, HPC30.

Husen, S., and E. Kissling, 2001, Local earthquake tomography betweenrays and waves: Fat ray tomography: Physics of the Earth and PlanetaryInteriors, 123, 127–147, doi: 10.1016/S0031-9201(00)00206-5.

Igel, H., H. Djikpesse, and A. Tarantola, 1996, Waveform inversion ofmarine reflection seismograms for P-impedance and Poisson’s ratio: Geo-physical Journal International, 124, 363–371, doi: 10.1111/j.1365-246X.1996.tb07026.x.

Li, X., A. Y. Aravkin, T. van Leeuwen, and F. J. Herrmann, 2012, Fast ran-domized full-waveform inversion with compressive sensing: Geophysics,77, no. 3, A13–A17, doi: 10.1190/geo2011-0410.1.

Métivier, L., R. Brossier, J. Virieux, and S. Operto, 2013, Full waveforminversion and the truncated Newton method: SIAM Journal on ScientificComputing, 35, B401–B437, doi: 10.1137/120877854.

Moghaddam, P., H. Keers, F. Herrmann, and W. Mulder, 2013, A new opti-mization approach for source-encoding full-waveform inversion: Geo-physics, 78, no. 3, R125–R132, doi: 10.1190/geo2012-0090.1.

Nocedal, J., and S. Wright, 2006, Numerical optimization: Springer.Peter, D., D. Komatitsch, Y. Luo, R. Martin, N. Le Goff, E. Casarotti, P.Le Loher, F. Magnoni, Q. Liu, C. Blitz, T. Nissen-Meyer, P. Basini,and J. Tromp, 2011, Forward and adjoint simulations of seismic wavepropagation on fully unstructured hexahedral meshes: GeophysicalJournal International, 186, 721–739, doi: 10.1111/j.1365-246X.2011.05044.x.

Pratt, R. G., C. Shin, and G. J. Hick, 1998, Gauss-Newton and full Newtonmethods in frequency-space seismic waveform inversion: GeophysicalJournal International, 133, 341–362, doi: 10.1046/j.1365-246X.1998.00498.x.

Rickers, F., A. Fichtner, and J. Trampert, 2013, The Iceland–Jan Mayenplume system and its impact on mantle dynamics in the North Atlanticregion: Evidence from full-waveform inversion: Earth and Planetary Sci-ence Letters, 367, 39–51, doi: 10.1016/j.epsl.2013.02.022.

Sullivan, G., and T. Wiegand, 2005, Video compression - from concepts tothe H.264/AVC standard: Proceedings of the IEEE, 93, 18–31, doi: 10.1109/JPROC.2004.839617.

Symes, W. W., 2007, Reverse time migration with optimal checkpointing:Geophysics, 72, no. 5, SM213–SM221, doi: 10.1190/1.3238367.

Tape, C., Q. Liu, A. Maggi, and J. Tromp, 2010, Seismic tomography ofthe southern California crust based on spectral-element and adjointmethods: Geophysical Journal International, 180, 433–462, doi: 10.1111/j.1365-246X.2009.04429.x.

Tromp, J., D. Komatitsch, and Q. Liu, 2008, Spectral-element and adjoint meth-ods in seismology: Communications in Computational Physics, 3, 1–32.

Tromp, J., C. Tape, and Q. Liu, 2005, Seismic tomography, adjoint methods,time reversal and banana-doughnut kernels: Geophysical JournalInternational, 160, 195–216, doi: 10.1111/j.1365-246X.2004.02453.x.

Unat, D., T. Hromadka, and S. B. Baden, 2009, An adaptive sub-samplingmethod for in-memory compression of scientific data: Data CompressionConference, IEEE, DCC’09, 262–271.

Van Der Hilst, R. D., and M. V. De Hoop, 2005, Banana-doughnut kernelsand mantle tomography: Geophysical Journal International, 163, 956–961, doi: 10.1111/j.1365-246X.2005.02817.x.

van Leeuwen, T., and F. J. Herrmann, 2013, Fast waveform inversion with-out source-encoding: Geophysical Prospecting, 61, 10–19, doi: 10.1111/j.1365-2478.2012.01096.x.

van Leeuwen, T., and F. J. Herrmann, 2014, 3D frequency-domain seismicinversion with controlled sloppiness: SIAM Journal on Scientific Com-puting, 36, S192–S217, doi: 10.1137/130918629.

Virieux, J., and S. Operto, 2009, An overview of full-waveform inversion inexploration geophysics: Geophysics, 74, no. 6, WCC1–WCC26, doi: 10.1190/1.3238367.

Walther, A., and A. Griewank, 2004, Advantages of binomial checkpointingfor memory-reduced adjoint calculations, in M. Feistauer, V. Dolejší, P.Knobloch, and K. Najzar eds., Numerical mathematics and advanced ap-plications, ENUMATH 2003: Springer, 834–843.

Wang, Z., A. Bovik, H. Sheikh, and E. Simoncelli, 2004, Image quality as-sessment: from error visibility to structural similarity: IEEE Transactionson Image Processing, 13, 600–612.

Weiser, M., and S. Götschel, 2012, State trajectory compression for optimalcontrol with parabolic PDEs: SIAM Journal on Scientific Computing, 34,A161–A184, doi: 10.1137/11082172X.

Yoshizawa, K., and B. L. N. Kennett, 2002, Determination of the influencezone for surface wave paths: Geophysical Journal International, 149, 440–453, doi: 10.1046/j.1365-246X.2002.01659.x.

Zhu, H., E. Bozdağ, D. Peter, and J. Tromp, 2012, Structure of the Europeanupper mantle revealed by adjoint tomography: Nature Geoscience, 5,493–498, doi: 10.1038/ngeo1501.


Dow

nloa

ded

01/0

5/17

to 1

95.1

76.1

10.2

14. R

edis

trib

utio

n su

bjec

t to

SEG

lice

nse

or c

opyr

ight

; see

Ter

ms

of U

se a

t http

://lib

rary

.seg

.org

/

http://dx.doi.org/10.1023/A:1015192312705

http://dx.doi.org/10.1023/A:1015192312705

http://dx.doi.org/10.1023/A:1015192312705

http://dx.doi.org/10.1137/1019005

http://dx.doi.org/10.1137/1019005

http://dx.doi.org/10.1088/0266-5611/24/3/034015

http://dx.doi.org/10.1088/0266-5611/24/3/034015

http://dx.doi.org/10.1088/0266-5611/24/3/034015







http://dx.doi.org/10.1016/j.cageo.2015.12.013






http://dx.doi.org/10.1007/s10589-014-9712-6

http://dx.doi.org/10.1007/s10589-014-9712-6

http://dx.doi.org/10.1080/10556789208805505

http://dx.doi.org/10.1080/10556789208805505

http://dx.doi.org/10.1016/S0031-9201(00)00206-5

http://dx.doi.org/10.1016/S0031-9201(00)00206-5

http://dx.doi.org/10.1111/j.1365-246X.1996.tb07026.x









http://dx.doi.org/10.1137/120877854

http://dx.doi.org/10.1137/120877854
















http://dx.doi.org/10.1016/j.epsl.2013.02.022






http://dx.doi.org/10.1109/JPROC.2004.839617




http://dx.doi.org/10.1190/1.3238367

http://dx.doi.org/10.1190/1.3238367

http://dx.doi.org/10.1190/1.3238367



















http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1111/j.1365-2478.2012.01096.x

http://dx.doi.org/10.1137/130918629

http://dx.doi.org/10.1137/130918629

http://dx.doi.org/10.1190/1.3238367

http://dx.doi.org/10.1190/1.3238367

http://dx.doi.org/10.1190/1.3238367

http://dx.doi.org/10.1137/11082172X

http://dx.doi.org/10.1137/11082172X







http://dx.doi.org/10.1038/ngeo1501

http://dx.doi.org/10.1038/ngeo1501

wavefield compression for adjoint methods in full-waveform...

Documents