simultaneous estimation of wavefields & medium parameters · motivation...

University of British ColumbiaSLIM

Bas Peters, Felix J. Herrmann

Workshop W-‐12, The limit of FWI in subsurface parameter recovery.SEG 85TH annual meeting, Friday, 23 October

Simultaneous estimation of wavefields & medium parameters reduced-space versus full-space waveform inversion

Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0).Copyright (c) 2015 SLIM group @ The University of British Columbia.

Motivation

Full-‐Waveform Inversion (FWI) works well when a good start model and/or low-‐frequency data is available.

2

H(m) 2 CN⇥N discrete PDEm 2 RN medium parametersP 2 Rm⇥N selects field at receiversu 2 CN fieldd 2 Cm observed dataq 2 CN source

minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

Motivation


If iterative solvers in frequency domain are used, we cannot compute:

Instead we obtain:

3

minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

H(m)�1q = u

û = H(m)�1(q+ ru) = H(m)�1ru + u

Motivation


If iterative solvers in frequency domain are used, we cannot compute:

Instead we obtain:

4

H(m)�1q = u

û = H(m)�1(q+ ru) = H(m)�1ru + u

error

residual

minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

Example

Illustration of what happens when PDE’s are solved inaccurately

• 3-‐15Hz• good start model• full-‐offset source & receiver array• noise free data

5

6

True velocity model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

7

Initial velocity model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

FWI using very accurate iterative solver & LBFGS

8

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

FWI using accurate iterative solver & LBFGS

9

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

FWI using less accurate iterative solver & LBFGS

10

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

FWI using inaccurate accurate iterative solver & LBFGS

11

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

FWI using very inaccurate accurate iterative solver &LBFGS

12

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

reduced-‐space: (includes FWI)• error in objective function value• error in gradient• error in Hessian

• storage as low as two fields at a time• dense reduced-‐Hessian• requires extra safeguards/accuracy control

13

Inexact PDE solves

error in medium parameter update

[A Tarantola, 1984; E Haber et al., 2000; I Epanomeritakis et al., 2008]

[T. van Leeuwen & F.J. Herrmann, 2014]

Goal

This talk is about deriving an algorithm which:• allows for inexact solutions of linear systems• enjoys similar parallelism and memory requirements as FWI

14

For the true medium parameters and true fields we know that:

&

Many ways to use these equations to form:• objectives, • constraints • algorithms

Adjoint-‐state based FWI is just one algorithm.

15

Data, PDE’s and constraints

Pu = dH(m)u = q

16


minm,u

1

2kPu� dk22 s.t. H(m)u = q

constraint

objective

17


minm,u

1


constraint

objective

minm,u

1

2kH(m)u� qk22 s.t. Pu = d

18


minm,u

1


minm,u

1


minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

19


minm,u

1


minm,u

1


minm,u

kH(m)u� qk22 s.t. kPu� dk22 �

minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

20


minm,u

1


minm,u

1


minm,u

kH(m)u� qk22 s.t. kPu� dk22 �

minm

1

2kPH(m)�1q� dk2

2

=1

2kd

pred

(m)� dobs

k22

Multi-experiment structure:

21

0

BBB@

P1P2

. . .Pk

1

CCCA

0

BBB@

u1u2...uk

1

CCCA�

0

BBB@

d1d2...dk

1

CCCA

0

BBB@

H1H2

. . .Hk

1

CCCA

0

BBB@

u1u2...uk

1

CCCA�

0

BBB@

q1q2...qk

1

CCCA

k = nsrc ⇥ nfreq k ⇥N field parameters

Pu� d H(m)u� q

22

minm,u

1

2kPu� dk22 +

�2

2kH(m)u� qk22

A quadratic-penalty based full space method

minm,u

kH(m)u� qk22 s.t. kPu� dk22 � -‐ relation is known[W. Gander, 1980; A. Bjork, 1996]� �

23

updates for for working subset of fields & medium parameters

minm,u

1

2kPu� dk22 +

�2

2kH(m)u� qk22

✓P ⇤P + �2H⇤H r2u,m�

r2m,u� �2G⇤mGm

◆✓�u�m

◆= �

✓P ⇤(Pu� d) + �2H⇤

�Hu� q

�

�2G⇤m�Hu� q

�◆


Newton’s method:

•update fields & medium parameters simultaneously•function value, gradient, Hessian evaluation is ~free & exact•sparse Hessian•theory allows for inexact updates computations•requires storage of working subset of fields + working memory (gradients, Hessian & update)

•update computation is challenging24

✓P ⇤P + �2H⇤H r2u,m�

r2m,u� �2G⇤mGm

◆✓�u�m

◆= �

✓P ⇤(Pu� d) + �2H⇤

�Hu� q

�

�2G⇤m�Hu� q

�◆


Approximate: block diagonal & positive (semi) definite

• give up some of Newton’s method properties• update computation intrinsically parallel per field• no need to form off-‐diagonal blocks

philosophy: more & cheaper iterations25


✓P ⇤P + �2H⇤H 0

0 �2G⇤mGm

◆✓�u�m

◆= �

✓P ⇤(Pu� d) + �2H⇤

�Hu� q

�

�2G⇤m�Hu� q

�◆

Memory requirements

save fields for the working subset of frequencies & sources

can be distributed over multiple nodes

feasible? need• parallel computing• simultaneous sources (redrawing is possible)• small frequency batches

26

Notes on full-space quadratic-penalty methods forPDE-constrained optimizationBas PetersSeismic Laboratory for Imaging and Modeling (SLIM), University of British Columbia

Abstract

…

Introduction

We investigate different ways to successfully estimate multiple parameters from PDE’s, with focus on the mul‐ti-parameter Helmholtz equation. The first order coupled system of equations describing acoustic wave mo‐tion is

The pressure field is indicated by , the vector force source term is , where indicates the component. isthe particle velocity and the scalar volume injection source term.The unknown parameters are the scalarbuoyancy and the scalar compressibility . These are the parameters we want to estimate. In an acousticmedium, the compressibility, buoyancy and velocity ( ) are related as .

Throughout this paper, we will use the discretize-then-optimize formalism. This means equation is the onlycontinuous equation. After discretization of , we restrict ourselves to the case where there are only volumeinjection sources, , and rewrite the discrete version of equation as a second order partial differentialequation (PDE). The only fied in this equation is the pressure field. This is the Helmholtz equation in a losslessand isotropic medium.

To obtain estimates for the unknown parameters, we would like to match experimentally observed data topredicted data while the predicted field satisfies the PDE. is a linear operator which projects the pre‐dicted field to the receiver locations. This problem is of the general PDE-constrained optimization type

is a linear projection operator onto the receiver locations, is the discretized two-paramter Helmholtz ma‐trix, is the pressure wavefield, is the source term and is the measured data. The two parameters to beestimated are the buoyancy and the compressibility .

This paper is concerned with solution methods for problem , specifically suitable for the situation where wewant to estimate multiple parameters occuring in the same PDE. We explore methods based on a quadraticpenalty formulation of a constrained optimization problem. Most reseach for solving problem uses a La‐grangian formulation. We will show that the penalty formulation has some interesting advantages and disad‐vantages compared to the Lagrangian formulation and we will derive some algorithms which exploit the ad‐vantages of a penalty formulation, but do not suffer too much from the disadvantages. We will not discuss de‐tails of memory consumtion and computational requirements of the proposed algorithms, compared to La‐grangian based methods. We do however, show that the proposed algorithms are in the same ballpark as theirlagrangian counterparts, memory and compuation wise.

Memory and computational cost will be evaluated from the direct solver point of view, QR and Cholseky fac‐torizations will be used to solve the linear systems occuring in this paper. This does not imply we favor directmethods over iterative solvers, but this will change the entire point of view on what is computationally effi‐cient and what is not. The fastest algorithms for PDE-constrained optimization may very well be a combina‐tion of direct and iterative solvers. Some remarks are made about challenges for iterative solvers and placeswhere iterative solvers could be very efficient, compared to direct solvers.

Context, motivation and challenges

In parameter estimation problems, we almost exclusively work with multi-experiment data. In the frequencydomain which we consider in this work, the multiple experiments originate from the source(s) which excitethe fields from multiple locations which they do so at multiple frequencies. The multiple frequencies may bedue to exciting the fields multiple times, each time at a different frequency or by Fourier transforming a time-domain signal. In the multi experiment setting, the matrices and vectors in problem obtain a block struc‐ture.

where the subscript indicates the experiment index.Solving problem using the PDE in equation is difficult for a number of reasons. Besides the usual numeri‐cal challenges of solving large linear and possibily ill-conditioned systems, limited availability of data and theill-posedness of the inverse problem, there is the difficulty of having two parameters to be estimated, occuringin either a coupled first-order PDE system or a second-order PDE. Intuitively, this will introduce a certain‘trade-off’ between (in this case) compressibility and buoyancy in the parameter estimation sense. This couldalso be described as non-uniqueness/uncertainty of the estimation process. This is a particulary problematicissue in this case. We can verify this with a simple example.

Example 1:

We generate ‘observed’ data in a model with a buoyancy and compressibility pertubation from the homoge‐neous background model, which serves as the initial guess. We invert this data using the most commonly usedapproch in geophysical exploration: eliminating the constraints in and solving the unconstrained problemby using just the reduced gradient to minimize the objective with the L-BFGS algorithm. (insert citations).(Gradient-descent type algorithms are also widely used in exploration geophysics). The results show that wemanaged to fit over 95% of the data, but the estimated parameters are obviously nowhere near the true model.Moreover, only the compressibility is changed compared to the initial guess, the buoyance is almost un‐touched. This example shows that inverting (effectively) for only one of the two parameters, we are able to fitover 95% of the data anyway. This is very different in the single-parameter estimation problem. To show this,we repeat the previous exercise, but this time we generate data using only a pertubation in the compressibilityand we only invert for the compressibility. Again, we are able to fit over 95% of the data, but this time the esti‐mated model is very close to the true model.

Example 1 explicitly shows that using just the gradient in a quasi-Newton scheme is not up to the task of esti‐mating the parameters. We observe that the norm of the gradients of the objective value, corresponding to thedifferent paramters, are very different. This causes only one of the two parameters to be updated. One couldsimply scale the gradients by modifying the quasi-Newton or gradient-descent algorithm, but the question is:how to scale these two? Putting them on equal footing may sound reasonable, but this is still a completely ar‐bitrary and unjustified choice. Scalings of this type, but inspired by the physics of wavefields are used in (cite).This type of approach may work well, but it requires problem dependent specification of the scaling. Anotherapproach to mitigate the challenges of multi-parameter estimation is to use an hirarchical strategy: Estimateone parameter while keeping the other fixed and do this in an alternating fashion, possibly just once. Thisstragety is used in (cite tarantola). This can also yield satisfactory results, but it requires a lot of parametertweeking and fine-tuning, to be repeated for every new problem. This strategy requires choosing which para‐meter to start with, for how many iterations or to which objective function value. Yet another way to mitigatethe non-uniqueness and different magnitudes of the gradients is the incorporation of a priori information intothe problem. (meju gallardo) use so called ‘cross-gradient’ regularization of the inverse problem. This requiresthe different estimated parameters to exhibit ‘structural similarity’, ie, they have to vary in a similar fashion. Apriori information should be very usefull, but it has to be available first. This is often not the case, because weneed a priori information about the relation between compressibility and buoyancy, not about each parame‐ter separately (although this can be included as well). In the case of acoustic inverse problems with the earthas the medium, we actually know that the buoyancy and compressibility may sometimes vary in a similar way,but sometimes their relation can be completely different. (example bg model). There is also some litteratureavailable concerned with which parameterization to choose in multi-parameter seismic problems (cite).These works analyse gradient direction of the reduced Lagrangian objective or partial derivatives of the PDE (

). Based on those pieces of information, it is decided which parametrization (density-velocity, compressibili‐ty-density, etc.) should give the best parameter estimations. This is also purely focued on gradient-descent orquasi-Newton methods, just like the other approaches described above. This totally bypasses physical infor‐mation which is available in the Hessian. See (pratt for a physical interpretation of the Hessian in the reducedLagrangian). From an optimization point of view, it is well known that Newton’s method has the affine trans‐formation invariance property (refs) and is in principle robust to ill-conditioned problems. Quasi-Newton andgradient-descent do not have these properties.

Our approach is therefore to use the Hessian and approximations of the Hessian to estimate multiple parame‐ters. We will evaluate if the use of the Hessian allows simultaneous estimation of multiple parameters, withoutusing prior information about the coupling of the parameters in the medium to be estimated, without hirar‐chical strategies and without manually rescaling the different parts of the gradient of the objective. We willthus focus on Newton and Gauss-Newton type algorithms.

Penalty methods for PDE-constrained optimization

A penalty formulation for solving problem was presented in Leeuwen and Herrmann (2013b) and used forsingle parameter seismic waveform inversion in Leeuwen and Herrmann (2013a). Earlier, a penalty formula‐tion with a fixed rule for balancing the two terms was used in (cite modified gradient and contrast source). Weuse the same formulation, but now for two unknown parameters

Problem is a combination of data-misfit and PDE-misfit, where the scalar balances the two. It is importantto note that the balancing between the two terms is the essence of a penalty method. Unlike a Lagrangian for‐mulation of a equality constrained problem, the penalty formulation for a fixed treats the original objectiveto be minimized as well as the constraints in the same way. In other words, there is no longer a notion of whatare constraints and what should be minimized. The value of determines what will happen. A large willprovide results approaching the solutions to problem , while very small values for will provide solutions toa problem similar to , but with objective and constraints switched. Of course, the standard algorithms forquadratic penalty methods honour the notion of minimization objective and constraints by solving a se‐quence of problem of the form of where each problem is solved with a different value for . This guaranteesconvergence to a solution of , see theorem 17.1 in Nocedal and Wright (2000). In this paper we only considera single value for , so we do not solve a sequence of problems and see how well it works. The main reasonsfor this approach are promising results in (cite our papers and abstracts) using a single value for as well asthe cost of solving PDE-constrained optimization problems. For large scale problems, solving it once is al‐ready computationally expensive, solving a sequence of problems would seriously decrease the competitive‐ness of penalty based methods.

We can introduce the objective function for problem as

which is an unconstrained minimization problem. We can also write this objective in it’s separable form as

where is the index for the experiment number. This explicitly shows we can accumulate the objective in par‐allel as a running sum. This is not the case for the gradient and Hessian in general. Another way to rewrite thisobjective is in a block form, organized per angular frequency

where , blbalalkdsf. This structure is intuitive in case direct factorization

methods are used for the solution of linear systems arising in the optimization algorithms, because it groupsall blocks in terms of the PDE discretization matrices per frequency. This form is also convenient in the simul‐taneous source scenario, also known as randomized trace estimation (cite eldad,felix).

The first-order necessary condition for a local minimizer of this unconstrained problem are

where the partial derivative matrices are given by

which have the following multi experiment block-structure:

and the second-order necessary conditions are (in addition to satisfying the first-order conditions)

Whearas we will use the first-order necessary conditions to design our optimization algorithms, the second-order necessary condition is used as a diagnostic.

With these standard optimality conditions defined, we can formulate several algorithms to minimize objec‐tive . Each of which has different computational challenges, memory requirements, advantages and disad‐vantages.

First, we return to the case of a sufficiently small, such that the is enough emphasis on the term to fit the observed data to appoximately, we essentially obtain an approximation to the problem withthe objective and constraints switched

For now we assume that the observed data is noise free, so fitting it exactly is no concern. In the realistic casewith noise, the constraint should be relaxed to inexact data fitting. The motivation to bring this interpretationof the quadratic penalty method for PDE-constrained optimization with a small , is to explain why we stilluse the quadratic penalty form, rather than solve problem , or related problem with inexact data fit con‐straints using a Lagrangian form. This Lagrangian form is

with as the vector of Lagrangian multipliers for the equality constraints. There are no obstacles which keepus from solving this constrained problem, but this form does pose some limitations on the algorithms that canbe developed.

The first order necessary conditions for a minimum of the Lagrangian in are

Variant 1: Full Newton

This variant is the classical Newton method to minimize . At each iteration we compute :

In more explicit form this iteration can be written as

This method can be seen as the quadratic penalty objective equivalent of the Lagrangian ‘all-at-once’ method(cite), because this method updates both the medium parameters (compressibility and buoyancy) and thefields. This can be an advantage, because the problem has a lot of parameters to work with, minimizing theobjective. Other main advantages are the availability of an exact gradient vector and exact Hessian matrix for free. No computations (PDE-solves) are required, contrary to reduced space methods as described belowand in (cite ghattas, haber, leeuwen). This immediately points at the disadvantages as well; to be able to up‐date the fields, they have to be stored in memory. The feasibility of this is highly dependend on the number ofgrid points and the number of experiments. Even for modest sized problems with a relatively small number ofexperiments, the Hessian will most likely be too large for direct factorization to compute the Newton step. Di‐rect factorization is used by Grote et al. (2011) to solve interior-point systems in originating from a Lagrangianapproch to handle the constraints. Iterative methods need to be used when the required hardware is not avail‐able for direct factorization. This has been the topic of many papers (cite) related to precondioning and solv‐ing the Newton system for the Lagrangian formulation (the KKT-system). In this case, we are not aware of re‐search dedicated to solve systems arising from the penalty form of the PDE-constrained optimization prob‐lem. Although some blocks of our Newton system are the same as in the KKT system for the Lagrangian form(see Haber), a main difference is the block. This block is the combination of the normal equations formof a PDE system matrix and the linear operator selecting the receiver locations. This has a particularly unfa‐vorable eigenvalue distribution and a roughly squared condition number. No litterature has been publisheddedicated to solving this block on its own using iterative methods. In this work we will therefore not consideriterative solutions of the full Newton method, only direct factorizations on small problems. When efficient it‐erative preconditioning and solutions methods are available, this Newton method can enjoy inexact solves ofthe Newton system to save a significant number of iterations. See (Nocedal and Wright, 2000) for some theo‐rems about the required tolerance at every iteration to preserve a desired rate of convergence of the Newtonmethod.

The Newton iteration uses the solution of as the update for fields and parameters. To ensure global conver‐gence, this update can be used in a trus-region or line-search algorithm. Only line-search algorithms are con‐sidered in this work, without specifying the details or advocating the use of a line-search over a trust-regionapproach.

Variant 2: Uncoupling fields and medium parameters in Newton’s method

This method gives up the rate of convergence of Newton’s method, as well as possiblily some robustness withregard to local minima. This last statment is very difficult to quantify in the non-convex setting of PDE-con‐straint optimization, especially when we are dealing with oscillatory fields. It requires numerical experimentsto test wheater this is true or not. Furthermore, theorems about the convergence rate of inexact Newton’smethod no longer hold.

Uncoupling the fields and medium parameters amounts to setting to zero, the Hessian blocks which couplemedium parameters to fields. This leads to the following Newton-type system which needs to be solved atevery iteration

This shows the benefits of this approximation of the Hessian. The update for the fields can now be computedin parallel, each block in the block-diagonal block can therefore be factored separately. This algorithmallows the solution of the Newton system using direct solvers for problems a lot larger than variant 1. The ap‐proximation to the Hessian in this variant does not approximate the relation between the medium parametersdirectly, but only indirectly because the removal of the coupling between fields and medium parameter doesaffect the Newton direction at every iterate. This Hessian approximation can be written a bit more explicitly as

The Hessian in equation can be factored as

which reveals it is an Hermitian matrix, positive-semidefinite or positive-definite (depending on the rank ofthe factor) with real eigenvalues. The discretization of the PDE will determine , and thus the (semi-)def‐initeness of the Hessian approximation in this section. In case we use the discretization of (cite), is full rankand is also full rank, with no linear dependence between the rows of and . Because the update for thefields and the medium parameter updates , are uncoupled, we can solve every linear system in the

block in parallel. For the medium parameter updates we can either use a direct factorization or an iter‐ative method. Which one will be most suitable depends on a number of things. First of all, the PDE itself, butalso the mesh-type (regular gird / grids with hard refinement etc), type of finite-differences/finite-elementswhich are used determine the bandwidth, symmetry properties and eigenvalue distribution and the conditionnumber. Therefore it is not possible to make a general recomendation about the solution method for themedium parameter updates.

Variant 3: Block-diagonal Hessian approximation Newton-type method.

In this variant we set all off-diagonal blocks of the Hessian corresponding to the full Newton step for objective to zero. This gives the following Hessian approximation

In addition to the things we gave up to arrive at variant 2, we now also give up the coupling between the medi‐um parameters. We only maintain the action of the Hessian blocks on each class of medium parameters sepa‐rately. Note that this can be more than just scaling. The buoyancy for example, is inside the div-grad operator,giving the corresponding Hessian block a differential operator structure. This can be seen in the section aboutthe heuristics on effets of the various Hessian approximations. The additional advantage compared to variant2, is that even the updates for the medium parameters can be computed in parallel. This allows this variant tobe used for even larger problem when relying on direct factorizations to solve for the field and medium para‐meter updates. By ignoring the blocks coupling the medium parameters, we reduce the bandwith of the sys‐tem to be solved for the medium parameter updates. This cuts back on the memory requirements due to fill-in during the factorization process.

Variant 4: Reduced-space sparse Hessian approximation

The main argument for this variant is it’s very limited memory use related to the storage of the fields corre‐sponding to all the experiments. We will depart from the variants which store and update the fields, insteadwe will solve for the fields at every iteration of a Newton-type algorithm. This was presented in Leeuwen andHerrmann (2013b). The first step is to make sure our objective satisfies the first-order optimiality conditionwith respect to the fields. To achieve this, we solve

which corresponds to the least-squares problem

In Leeuwen and Herrmann (2013b) it is also noted that this can be interpreted as a variational projection as inAravkin and Leeuwen (2012). This least-squares from explicitly shows that observed data information is aug‐mented to the PDE. The solution will satisfy this observed data if the trade-off parameter is chosen suffi‐ciently small. We proceed by applying block-Gaussian elimination to the full Newton system . To make no‐tation simple, partition the Hessian from equation as

Eliminating the block results in

which can be recognized as the Schur-complement. When we subsitute , the above simpli‐fies to

These are the reduced-Hessian and reduced-gradient for the following reduced objective

where is the solution to the least-squares problem . The value of the variable projection interpretation isthat this proves (insert theorem) that the stationary points of the reduced objective are also stationarypoints of the original full objective . This does not imply that the solutions generated by minimizing each ofthe objectives will always be the same.

This reduced objective shows one of the main disadvantages of reduced-space methods: function evaluationsrequire the solution of least-squares problems. This is equivalent to the reduced-space objectives based on aLagrangian, which require PDE-solves for function evaluations (Haber et al., 2000). This is a problem in line-search or trust-region algorithms. To keep the computational cost under control in a backtracking line-searchalgorithm, it is common to limit the number of back-track steps and to reduce the step size considerably tomake sure a step is found quickly. This may cause the algorithm to achieve a sub-optimal rate of convergence.In variants 1, 2 and 3 this is not a problem, because function evaluations are cheap; it merely takes some ma‐trix-vector products. These variants should be able to find the near optimal step-length at every iteration ofthe Newton-type algorithm. A second major disadvantage of reduced-space methods is that the block-elimi‐nation turned the large and sparse full Hessian into a dense and small reduced Hessian. In certain optimiza‐tion problem this may be a favourable trade-in, but in the PDE-constrained optimization setting, especiallywhen dealing with wave-phenomena, the ‘small’ dense Hessian becomes to large to store in memory. for thereduced Hessian based on a Lagrangian objective.

One way to proceed, is to compute matrix-free matrix-vector products with dense Hessian bysolving the linear system with the block at every iteration of an iterative algorithm to compute the New‐ton-type step. The linear systems in the block can be solved using a direct factorization method, but thecomputation of the Newton-type step is restricted to iterative methods. Therefore this method is only a feasi‐ble option, if the Hessian is sufficiently well conditioned or has some usefull eigenvalue clustering. This is un‐fortunately not the case here, therefore an efficient preconditioner is required, but unavailable at this time.This is a significant disadvantage of the used formulation of the PDE-constrained problem using the quadraticpenalty function. It must be noted however, that in the Lagrangian case where the reduced Hessian (see Haberet al., 2000 for the reduced Hessian based on a Lagrangian objective) is used (Haber et al. (2000),Métivier et al.(2013),ghattas), the extra effort compared to quasi-Newton methods does not always pay off in the single-pa‐rameter waveform-inversion problem (Métivier et al., 2013). (ghattas) describes this method in the multi-pa‐rameter waveform inversion context, but does not show an example. It remains to be investigated how thismethod will perform in the multi-parameter case. The reduced Hessian in the Lagrangian setting differs fromthe one in the quadratic penalty setting in the sense that the Lagrangian based reduced Hessian requires a se‐ries of two linear-system solves to compute a matrix-free Hessian-vector product, whereas the penalty-basedreduced Hessian requires only one. The problem of solving a series of linear-systems is less suitable for inex‐act iterative methods, because the error from solving the first linear system is effectiveley multiplied by theapproximate inverse of the second linear system. We will not investigate the use of the reduced Hessian in thiswork.

The strucure of the reduced Hessian in equation allows an interesting alternative which still allows us touse the reduced-Hessian. (Leeuwen and Herrmann, 2013b) proposes to use just the sparse part, . The accu‐racy of this approximation depends on the value of , a smaller value increases the accuracy. It must be notedthat we are not primarily intested in how well approximates , but in how well the computedupdate direction using approximates the direction which would follow when was used as theHessian. Taking the sparse part from the Hessian in equation leads to

In the single parameter estimation problem, the use of this approximation has shown to be approximately aspowerfull as the L-BFGS quasi-Newton method. Here we are primarily interested in the question if the approx‐imate Hessian properly balances the updates for the multiple parameters, rather then rate of convergence. Wewill rely on numerical experiments to show this.

Variant 5: Reduced-space sparse block-diagonal Hessian approximation

A further simplification of variant 4 is to approximate the Hessian approximation even further by dropping theoff diagonal blocks of . This uncouples the updates for the different parameter types, but does allow the up‐dates to be computed in parallel. As in variant 3, this allows larger problems to be solved using a direct factor‐ization method.

Memory requirements

One of the main drawbacks of algorithms which update the fields, rather than solving for them, is that allfields need to be in memory, compared to 1 or 2 fields for the other class of algorithms. In present day parallelcomputing environments, things have changed a bit. When we solve for the fields, the objective function andits gradient can be computed in a running sum over the multiple experiments. That is why the memory re‐quirements are so low. It is very common however, to parallize over frequencies. That is, split the separableobjective function over the multiple frequencies and compute one frequency per core/node. This is a form oftrivial parallel computing with maximum efficiency. Memory wise, the total storage of fields has increased cor‐respondingly. This is equivalent to the algorithm of variant 2, where the objective is not intrinsically separable,but by uncoupling the fields from medium parameters, we can achieve the same parallelism as in the variantswhere we solve for the fields. Concluding, variant 2 is an algorithm which updates the fields, but is memorywise similar to algorithms which solve for the fields. Note that this is not possible in the full Newton methods(variant 1), because computing updates for the fields in a parallel fashion would require comminucation be‐tween the solution processes for the medium parameter updates and field updates.

Known deficiencies of quadratic penalty methods in the PDE-constrained optimization setting

boundedness from below: In general, the objective of a quadratic penalty formulation of an equality con‐strained problem may not be bounded from below (Nocedal and Wright, 2000). In our case the objective is thesum of two norms, therefore the objective function value will be bounded from below by .

Acknowledgements

This work was financially supported in part by the Natural Sciences and Engineering Research Council ofCanada Discovery Grant (RGPIN 261641-06) and the Collaborative Research and Development Grant DNOISEII (CDRP J 375142-08). This research was carried out as part of the SINBAD II project with support from thefollowing organizations: BG Group, BGP, BP, Chevron, ConocoPhillips, CGG, ION GXT, Petrobras, PGS, Statoil,Total SA, WesternGeco, Woodside.

References

Aravkin, A. Y., and Leeuwen, T. van, 2012, Estimating nuisance parameters in inverse problems: Inverse Prob‐lems, 28, 115016.

Grote, M., Huber, J., and Schenk, O., 2011, Interior point methods for the inverse medium problem on mas‐sively parallel architectures: Procedia Computer Science, 4, 1466–1474. doi:http://dx.doi.org/10.1016/j.procs.2011.04.159

Haber, E., Ascher, U. M., and Oldenburg, D., 2000, On optimization techniques for solving nonlinear inverseproblems: Inverse Problems, 16, 1263–1280. doi:10.1088/0266-5611/16/5/309

Leeuwen, T. van, and Herrmann, F. J., 2013a, Mitigating local minima in full-waveform inversion by expandingthe search space: Geophysical Journal International, 195, 661–667. doi:10.1093/gji/ggt258

Leeuwen, T. van, and Herrmann, F. J., 2013b, A penalty method for pDE-constrained optimization: UBC. Re‐trieved from https://www.slim.eos.ubc.ca/Publications/Private/Submitted/2013/vanLeeuwen2013Penalty2/vanLeeuwen2013Penalty2.pdf

Métivier, L., Brossier, R., Virieux, J., and Operto, S., 2013, Full waveform inversion and the truncated newtonmethod: SIAM Journal on Scientific Computing, 35, B401–B437. doi:10.1137/120877854

Nocedal, J., and Wright, S. J., 2000, Numerical optimization: Springer.

*

*

p + iω =∂k ρk,rvk fk+ iωκp = q.∂rvr

(1)

p fk k vkq

b κ

c κ = bc2

11

= 0 1

dPu u P

∥Pu − d s.t. A(m)u = q.minm,u

12

∥22(2)

P Au q d

b κ

2

2

2

u =

⎛⎝⎜⎜⎜

u1u2

⋮uk

⎞⎠⎟⎟⎟

d =

⎛⎝⎜⎜⎜

d1d2

⋮dk

⎞⎠⎟⎟⎟

q =

⎛⎝⎜⎜⎜⎜

q1q2

⋮qk

⎞⎠⎟⎟⎟⎟

A =

⎛⎝⎜⎜⎜⎜

A1A2

⋱Ak

⎞⎠⎟⎟⎟⎟

P =

⎛⎝⎜⎜⎜⎜

P1P2

⋱Pk

⎞⎠⎟⎟⎟⎟

(3)

2 1

2

G

2

∥Pu − d + ∥A(m)u − q .minm,u

12

∥22λ2

2∥22(4)

4 λ

λ

λ λ

2 λ2

4 λ2

λ

λ

4

ϕ(m, u, λ) = ∥Pu − d + ∥A(m)u − q .12

∥22λ2

2∥22(5)

ϕ(m, u, λ) = ∥ − + ∥A(m − ,∑i

12

Piui di∥22λ2

2)iui qi∥

22(6)

i

ωi

ϕ(m, u, λ) = ∥ − + ∥A(m − .∑ωi

12

PωiUωi Dωi ∥22

λ2

2)ωi Uωi Qωi ∥

22(7)

= ( )Uωi uω1 uω2 \hdots uωk

ϕ(m, u, λ) = (A(m)u − q) = 0∇b λ2G∗bϕ(m, u, λ) = (A(m)u − q) = 0∇κ λ2G∗κ

ϕ(m, u, λ) = (Pu − d) + A(m (A(m)u − q) = 0∇u P∗ λ2 )∗(8)

= ∂A(b, κ)u/∂bGb= ∂A(b, κ)u/∂κ.Gκ

(9)

G =

⎛⎝⎜⎜⎜

G1G2

⋮Gk

⎞⎠⎟⎟⎟(10)

dfadfagsd(11)

5

λ ∥Pu − d12 ∥22

d 2

∥A(m)u − q s.t. Pu = d.minm,u

12

∥22(12)

λ

12

L(b, κ, u) = ∥A(m)u − q + (Pu − d),12

∥22 γ∗(13)

γ

13

5 = gxi+1 H−1

( )( ) = −( ).ϕ∇2u,uϕ∇2m,u

ϕ∇2u,mϕ∇2m,m

δuδm

ϕ∇uϕ∇m

(14)

( )( ) = ( ).P + AP∗ λ2A∗ϕ∇2m,u

( + )λ2 Rm A∗Gmλ2G∗mGm

δuδm

(Pu − d) + (Au − q)P∗ λ2A∗

(Au − q)λ2G∗m(15)

g H

ϕ∇2u,u

14

Algorithm 1 Full-Newton for Quadratic Penalty form.

while Not converged do 1. form Hessian and gradient // Form (~free) 2. Solve for Newton direction // Solve 3. Find steplength using linesearch // evaluate (~free) 4. // update model end

m = m + αpgn

( )( ) = ( ).P + AP∗ λ2A∗0

0λ2G∗mGm

δuδm

(Pu − d) + (Au − q)P∗ λ2A∗

(Au − q)λ2G∗m(16)

ϕ∇2u,u

16

⎛⎝⎜

[ ]λA∗ P∗

00

0λG∗κλG∗b

⎞⎠⎟

⎛⎝⎜

[ ]λAP0

0

λGκ

0

λGb

⎞⎠⎟

Gκ GbA

P P Aδu δκ δb

ϕ∇2u,u

Algorithm 2 field-medium parameter uncoupled Newton for Quadratic Penalty form.

0. construct initial guess for medium and for each field while not converged do 1. form Hessian and gradient // form (~free) 2. ignore the blocks // approximate 3. find & each in parallel // solve 4. find steplength using linesearch // evaluate (~free) 5. & // update model and fields end

m ui

ϕ, ϕ∇2u,m ∇2m,u

δm δuiα

m = m + αδm u = u + αδu

5

= .⎛⎝⎜⎜

ϕ∇2u,u0

0

0

ϕ∇2κ,κ0

0

0

ϕ∇2b,b

⎞⎠⎟⎟⎛⎝

δuδκδb

⎞⎠

⎛⎝

ϕ∇uϕ∇κϕ∇b

⎞⎠(17)

Algorithm 3 Block-diagonal approximation of the Hessian for the Quadratic Penalty form.

while Not converged do 1. form Hessian and gradient // Form (~free) 2. ignore the blocks // Approximate 3. Solve for the field updates in parallel & solve for the medium parameter updates to obtain 4. Find steplength using linesearch // evaluate (~free) 5. // update model end

ϕ, ϕ, ϕ, ϕ, ϕ, ϕ∇2u,κ ∇2u,b ∇

2κ,u ∇

2b,u ∇

2b,κ ∇

2κ,b

m = m + αp

ϕ(m, u, λ) = (Pu − d) + A(m (A(m)u − q) = 0,∇u P∗ λ2 )∗

= arg ( )u − ( ) .ū minu

∥∥∥

λA(m)P

λqd

∥∥∥2

(18)

λ

1414

= ( ).⎛⎝⎜⎜⎜

ϕ∇2u,uϕ∇2κ,uϕ∇2b,u

\vline

\vline

\vline

ϕ∇2u,κϕ∇2κ,κϕ∇2b,κ

ϕ∇2u,bϕ∇2κ,bϕ∇2b,b

⎞⎠⎟⎟⎟ EC BD(19)

ϕ∇2u,u

(D − C B)( ) = ( ) − C ϕ( , κ, b)E−1 δκδu

ϕ∇κϕ∇b

E−1∇u ū(20)

ϕ(m, u, λ) = 0∇u

(D − C B)( ) = ( ),E−1 δκδu

ϕ∇κϕ∇b

(21)

(b, κ, λ) = ∥P − d + ∥A(m) − q .ϕ̄12

ū ∥22λ2

2ū ∥22(22)

ū 1822

5

D − C BE−1

E−1

E−1

21D

λ

D D − C BE−1

D D − C BE−1

21

D = ( ) = ( ).ϕ∇2κ,κϕ∇2b,κ

ϕ∇2κ,bϕ∇2b,b

G∗κGκG∗bGκ

G∗κGbG∗bGb

(23)

D

0

Algorithm

27

FWI using very inaccurate accurate iterative solver &LBFGS

28

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

Full-space method of this talk using very inaccurate accurate iterative solver

29

estimated model

x [m]

z [m

]

0 1000 2000 3000 4000 5000

0

500

1000

1500

2000 1500

2000

2500

3000

3500

4000

4500

Related work

30

The presented algorithm is a quadratic-‐penalty version of Lagrangian-‐based all-‐at-‐once algorithms:

solve (inexactly) at every iteration: Newton-‐KKT system

[E. Haber & U.M. Ascher, 2001 ; G. Biros & O. Ghattas , 2005 ; Grote et. al., 2011]

minm,u

1


0

@⇤ ⇤ G⇤⇤ P ⇤P HG H 0

1

A

0

@�m�u�v

1

A = �

0

@G⇤v

H⇤v + P ⇤(Pu� d)Hu� q

1

A

L(m,u,v) = 12kPu� dk22 + v⇤

�H(m)u� q

�G =

@H(m)u

@m

Related work

31

Lagrangian based full-‐space methods also store the multipliers + corresponding gradient & Hessian blocks.

no intrinsic parallel structure0

@⇤ ⇤ G⇤⇤ P ⇤P HG H 0

1

A

0

@�m�u�v

1

A = �

0

@G⇤v

H⇤v + P ⇤(Pu� d)Hu� q

1

A

number of field variables: 2⇥ nsrc ⇥ nfreq ⇥ ngrid

* higher order terms

[E. Haber & U.M. Ascher, 2001 ; G. Biros & O. Ghattas , 2005 ; Grote et. al., 2011]

reduced-‐space (FWI): • error in objective function value• error in gradient• error in Hessian

full-‐space (this talk):• objective function value always exact• gradient always exact• Hessian always exact

32

Inexact PDE solves– full-space vs reduced-space

error in medium parameter update

globally convergent inexact Newton methods [S.C. Eisenstat & H.F. Walker, 1994]

Full vs Reduced-space

Reduced-space Full-space

Hessian, gradient &function evaluation

solve PDE’s ~free

Hessian, gradient &function evaluation inexact exact

Hessian dense sparse

memory for fields 2 fields per parallel processworking subset of simultaneous

source fields in memory(can be distributed over nodes)

working memory1 gradient & update

directionupdate directions & gradients in

memory

~free = sparse matrix-vector products33

Conclusions

Constructed a quadratic-‐penalty based full-‐space method which:

• updates fields & medium parameters simultaneously• main computations are intrinsically parallel• suitable for frequency domain waveform inversion with iterative solvers• con: need to store working subset of simultaneous source fields• but, less storage needed compared to Lagrangian full-‐space methods

34

Acknowledgements

This work was financially supported by SINBAD Consortium members BG Group, BGP, CGG, Chevron, ConocoPhillips, DownUnder GeoSolutions, Hess, Petrobras, PGS, Schlumberger, Statoil, Sub Salt Solutions and Woodside; and by the Natural Sciences and Engineering Research Council of Canada via NSERC Collaborative Research andDevelopment Grant DNOISEII (CRDPJ 375142-‐-‐08).

Thanks to our sponsors

1. Eisenstat, Stanley C., and Homer F. Walker. "Globally convergent inexact Newton methods." SIAM Journal on Optimization 4.2 (1994): 393-‐422.

2. Tristan van Leeuwen and Felix J. Herrmann, frequency-‐domain seismic inversion with controlled sloppiness, SIAM Journal on Scientific Computing, 36 (2014), pp. S192–S217.

3. B Peters, FJ Herrmann, T van Leeuwen. Wave-‐equation Based Inversion with the Penalty Method-‐Adjoint-‐state Versus Wavefield-‐reconstruction Inversion. 76th EAGE Conference, 2014.

4. M.J. Grote, J. Huber, and O. Schenk, Interior point methods for the inverse medium problem on massively parallel architectures, Procedia Computer Science, 4 (2011), pp. 1466 – 1474. Proceedings of the International Conference on Computational Science, {ICCS} 2011.

5. Eldad Haber, Uri M Ascher, and Doug Oldenburg, On optimization techniques for solving nonlinear inverse problems, Inverse Problems, 16 (2000), pp. 1263–1280.

6. E Haber and U M Ascher, Preconditioned all-‐at-‐once methods for large, sparse parameter estimation problems, Inverse Problems, 17 (2001), p. 1847.

7. I Epanomeritakis, V Akcelik, O Ghattas, and J Bielak. A Newton-‐CG method for large-‐scale three-‐dimensional elastic full-‐waveform seismic inversion. Inverse Problems, 24(3):034015, June 2008.

8. George Biros and Omar Ghattas, Parallel lagrange–newton–krylov– schur methods for pde-‐constrained optimization. part i: The krylov–schur solver, SIAM Journal on Scientific Computing, 27 (2005), pp. 687–713.

9. R.E. Kleinman and P.M.van den Berg, A modified gradient method for two-‐ dimensional problems in tomography, Journal of Computational and Applied Mathematics, 42 (1992), pp. 17 – 35.

References

36

https://scholar.google.ca/citations?user=hw23CTQAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=hw23CTQAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=9namBMsAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=9namBMsAAAAJ&hl=en&oi=sra

10. Ake Bjork, Numerical methods for least squares problems. siam, 1996.

11. Walter Gander, Least squares with a quadratic constraint, Numerische Mathematik, 36 (1980), pp. 291–307.

References

37

simultaneous estimation of wavefields & medium parameters · motivation...

Documents