simultaneous estimation of wavefields & medium parameters · motivation...

37
University of British Columbia SLIM Bas Peters, Felix J. Herrmann Workshop W12, The limit of FWI in subsurface parameter recovery. SEG 85TH annual meeting, Friday, 23 October Simultaneous estimation of wavefields & medium parameters reduced-space versus full-space waveform inversion Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0). Copyright (c) 2015 SLIM group @ The University of British Columbia.

Upload: others

Post on 19-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • University  of  British  ColumbiaSLIM

    Bas  Peters,  Felix  J.  Herrmann

    Workshop  W-‐12,  The  limit  of  FWI  in  subsurface  parameter  recovery.SEG  85TH  annual  meeting,  Friday,  23  October

    Simultaneous estimation of wavefields & medium parameters reduced-space versus full-space waveform inversion

    Released to public domain under Creative Commons license type BY (https://creativecommons.org/licenses/by/4.0).Copyright (c) 2015 SLIM group @ The University of British Columbia.

  • Motivation

    Full-‐Waveform  Inversion  (FWI)  works  well  when  a  good  start  model  and/or  low-‐frequency  data  is  available.

    2

    H(m) 2 CN⇥N discrete PDEm 2 RN medium parametersP 2 Rm⇥N selects field at receiversu 2 CN fieldd 2 Cm observed dataq 2 CN source

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

  • Motivation

    Full-‐Waveform  Inversion  (FWI)  works  well  when  a  good  start  model  and/or  low-‐frequency  data  is  available.

    If  iterative  solvers  in  frequency  domain  are  used,  we  cannot  compute:

    Instead  we  obtain:  

    3

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

    H(m)�1q = u

    û = H(m)�1(q+ ru) = H(m)�1ru + u

  • Motivation

    Full-‐Waveform  Inversion  (FWI)  works  well  when  a  good  start  model  and/or  low-‐frequency  data  is  available.

    If  iterative  solvers  in  frequency  domain  are  used,  we  cannot  compute:

    Instead  we  obtain:  

    4

    H(m)�1q = u

    û = H(m)�1(q+ ru) = H(m)�1ru + u

    error

    residual

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

  • Example

    Illustration  of  what  happens  when  PDE’s  are  solved  inaccurately

    • 3-‐15Hz• good  start  model• full-‐offset  source  &  receiver  array• noise  free  data

    5

  • 6

    True velocity model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • 7

    Initial velocity model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • FWI using very accurate iterative solver & LBFGS

    8

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • FWI using accurate iterative solver & LBFGS

    9

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • FWI using less accurate iterative solver & LBFGS

    10

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • FWI using inaccurate accurate iterative solver & LBFGS

    11

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • FWI using very inaccurate accurate iterative solver &LBFGS

    12

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • reduced-‐space:    (includes  FWI)• error  in  objective  function  value• error  in  gradient• error  in  Hessian

    • storage  as  low  as  two  fields  at  a  time• dense  reduced-‐Hessian• requires  extra  safeguards/accuracy  control

    13

    Inexact PDE solves

    error  in  medium  parameter  update

    [A  Tarantola,  1984;  E  Haber  et  al.,  2000;  I  Epanomeritakis  et  al.,  2008]

    [T.  van  Leeuwen  &  F.J.  Herrmann,  2014]

  • Goal

    This  talk  is  about  deriving  an  algorithm  which:• allows  for  inexact  solutions  of  linear  systems• enjoys  similar  parallelism  and  memory  requirements  as  FWI

    14

  • For  the  true  medium  parameters  and  true  fields  we  know  that:

                                                       

                                                                                                                             &  

    Many  ways  to  use  these  equations  to  form:• objectives,  • constraints  • algorithms

    Adjoint-‐state  based  FWI  is  just  one  algorithm.

    15

    Data, PDE’s and constraints

    Pu = dH(m)u = q

  • 16

    Data, PDE’s and constraints

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    constraint

    objective

  • 17

    Data, PDE’s and constraints

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    constraint

    objective

    minm,u

    1

    2kH(m)u� qk22 s.t. Pu = d

  • 18

    Data, PDE’s and constraints

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    minm,u

    1

    2kH(m)u� qk22 s.t. Pu = d

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

  • 19

    Data, PDE’s and constraints

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    minm,u

    1

    2kH(m)u� qk22 s.t. Pu = d

    minm,u

    kH(m)u� qk22 s.t. kPu� dk22 �

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

  • 20

    Data, PDE’s and constraints

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    minm,u

    1

    2kH(m)u� qk22 s.t. Pu = d

    minm,u

    kH(m)u� qk22 s.t. kPu� dk22 �

    minm

    1

    2kPH(m)�1q� dk2

    2

    =1

    2kd

    pred

    (m)� dobs

    k22

  • Multi-experiment structure:

    21

    0

    BBB@

    P1P2

    . . .Pk

    1

    CCCA

    0

    BBB@

    u1u2...uk

    1

    CCCA�

    0

    BBB@

    d1d2...dk

    1

    CCCA

    0

    BBB@

    H1H2

    . . .Hk

    1

    CCCA

    0

    BBB@

    u1u2...uk

    1

    CCCA�

    0

    BBB@

    q1q2...qk

    1

    CCCA

    k = nsrc ⇥ nfreq k ⇥N field parameters

    Pu� d H(m)u� q

  • 22

    minm,u

    1

    2kPu� dk22 +

    �2

    2kH(m)u� qk22

    A quadratic-penalty based full space method

    minm,u

    kH(m)u� qk22 s.t. kPu� dk22 �        -‐          relation  is  known[W.  Gander,  1980;  A.  Bjork,  1996]� �

  • 23

    updates  for  for  working  subset  of  fields  &  medium  parameters

    minm,u

    1

    2kPu� dk22 +

    �2

    2kH(m)u� qk22

    ✓P ⇤P + �2H⇤H r2u,m�

    r2m,u� �2G⇤mGm

    ◆✓�u�m

    ◆= �

    ✓P ⇤(Pu� d) + �2H⇤

    �Hu� q

    �2G⇤m�Hu� q

    �◆

    A quadratic-penalty based full space method

    Newton’s  method:

  • •update  fields  &  medium  parameters  simultaneously•function  value,  gradient,  Hessian  evaluation  is  ~free  &  exact•sparse  Hessian•theory  allows  for  inexact  updates  computations•requires  storage  of  working  subset  of  fields                                                                                                                                                        +  working  memory  (gradients,  Hessian  &  update)

    •update  computation  is  challenging24

    ✓P ⇤P + �2H⇤H r2u,m�

    r2m,u� �2G⇤mGm

    ◆✓�u�m

    ◆= �

    ✓P ⇤(Pu� d) + �2H⇤

    �Hu� q

    �2G⇤m�Hu� q

    �◆

    A quadratic-penalty based full space method

  • Approximate:  block  diagonal  &  positive  (semi)  definite

    • give  up  some  of  Newton’s  method  properties• update  computation  intrinsically  parallel  per  field• no  need  to  form  off-‐diagonal  blocks

    philosophy:  more  &  cheaper  iterations25

    A quadratic-penalty based full space method

    ✓P ⇤P + �2H⇤H 0

    0 �2G⇤mGm

    ◆✓�u�m

    ◆= �

    ✓P ⇤(Pu� d) + �2H⇤

    �Hu� q

    �2G⇤m�Hu� q

    �◆

  • Memory requirements

    save  fields  for  the  working  subset  of  frequencies  &  sources

    can  be  distributed  over  multiple  nodes

    feasible?  need• parallel  computing• simultaneous  sources  (redrawing  is  possible)• small  frequency  batches

    26

  • Notes on full-space quadratic-penalty methods forPDE-constrained optimizationBas PetersSeismic Laboratory for Imaging and Modeling (SLIM), University of British Columbia

    Abstract

    Introduction

    We investigate different ways to successfully estimate multiple parameters from PDE’s, with focus on the mul‐ti-parameter Helmholtz equation. The first order coupled system of equations describing acoustic wave mo‐tion is

    The pressure field is indicated by , the vector force source term is , where indicates the component. isthe particle velocity and the scalar volume injection source term.The unknown parameters are the scalarbuoyancy and the scalar compressibility . These are the parameters we want to estimate. In an acousticmedium, the compressibility, buoyancy and velocity ( ) are related as .

    Throughout this paper, we will use the discretize-then-optimize formalism. This means equation is the onlycontinuous equation. After discretization of , we restrict ourselves to the case where there are only volumeinjection sources, , and rewrite the discrete version of equation as a second order partial differentialequation (PDE). The only fied in this equation is the pressure field. This is the Helmholtz equation in a losslessand isotropic medium.

    To obtain estimates for the unknown parameters, we would like to match experimentally observed data topredicted data while the predicted field satisfies the PDE. is a linear operator which projects the pre‐dicted field to the receiver locations. This problem is of the general PDE-constrained optimization type

    is a linear projection operator onto the receiver locations, is the discretized two-paramter Helmholtz ma‐trix, is the pressure wavefield, is the source term and is the measured data. The two parameters to beestimated are the buoyancy and the compressibility .

    This paper is concerned with solution methods for problem , specifically suitable for the situation where wewant to estimate multiple parameters occuring in the same PDE. We explore methods based on a quadraticpenalty formulation of a constrained optimization problem. Most reseach for solving problem uses a La‐grangian formulation. We will show that the penalty formulation has some interesting advantages and disad‐vantages compared to the Lagrangian formulation and we will derive some algorithms which exploit the ad‐vantages of a penalty formulation, but do not suffer too much from the disadvantages. We will not discuss de‐tails of memory consumtion and computational requirements of the proposed algorithms, compared to La‐grangian based methods. We do however, show that the proposed algorithms are in the same ballpark as theirlagrangian counterparts, memory and compuation wise.

    Memory and computational cost will be evaluated from the direct solver point of view, QR and Cholseky fac‐torizations will be used to solve the linear systems occuring in this paper. This does not imply we favor directmethods over iterative solvers, but this will change the entire point of view on what is computationally effi‐cient and what is not. The fastest algorithms for PDE-constrained optimization may very well be a combina‐tion of direct and iterative solvers. Some remarks are made about challenges for iterative solvers and placeswhere iterative solvers could be very efficient, compared to direct solvers.

    Context, motivation and challenges

    In parameter estimation problems, we almost exclusively work with multi-experiment data. In the frequencydomain which we consider in this work, the multiple experiments originate from the source(s) which excitethe fields from multiple locations which they do so at multiple frequencies. The multiple frequencies may bedue to exciting the fields multiple times, each time at a different frequency or by Fourier transforming a time-domain signal. In the multi experiment setting, the matrices and vectors in problem obtain a block struc‐ture.

    where the subscript indicates the experiment index.Solving problem using the PDE in equation is difficult for a number of reasons. Besides the usual numeri‐cal challenges of solving large linear and possibily ill-conditioned systems, limited availability of data and theill-posedness of the inverse problem, there is the difficulty of having two parameters to be estimated, occuringin either a coupled first-order PDE system or a second-order PDE. Intuitively, this will introduce a certain‘trade-off’ between (in this case) compressibility and buoyancy in the parameter estimation sense. This couldalso be described as non-uniqueness/uncertainty of the estimation process. This is a particulary problematicissue in this case. We can verify this with a simple example.

    Example 1:

    We generate ‘observed’ data in a model with a buoyancy and compressibility pertubation from the homoge‐neous background model, which serves as the initial guess. We invert this data using the most commonly usedapproch in geophysical exploration: eliminating the constraints in and solving the unconstrained problemby using just the reduced gradient to minimize the objective with the L-BFGS algorithm. (insert citations).(Gradient-descent type algorithms are also widely used in exploration geophysics). The results show that wemanaged to fit over 95% of the data, but the estimated parameters are obviously nowhere near the true model.Moreover, only the compressibility is changed compared to the initial guess, the buoyance is almost un‐touched. This example shows that inverting (effectively) for only one of the two parameters, we are able to fitover 95% of the data anyway. This is very different in the single-parameter estimation problem. To show this,we repeat the previous exercise, but this time we generate data using only a pertubation in the compressibilityand we only invert for the compressibility. Again, we are able to fit over 95% of the data, but this time the esti‐mated model is very close to the true model.

    Example 1 explicitly shows that using just the gradient in a quasi-Newton scheme is not up to the task of esti‐mating the parameters. We observe that the norm of the gradients of the objective value, corresponding to thedifferent paramters, are very different. This causes only one of the two parameters to be updated. One couldsimply scale the gradients by modifying the quasi-Newton or gradient-descent algorithm, but the question is:how to scale these two? Putting them on equal footing may sound reasonable, but this is still a completely ar‐bitrary and unjustified choice. Scalings of this type, but inspired by the physics of wavefields are used in (cite).This type of approach may work well, but it requires problem dependent specification of the scaling. Anotherapproach to mitigate the challenges of multi-parameter estimation is to use an hirarchical strategy: Estimateone parameter while keeping the other fixed and do this in an alternating fashion, possibly just once. Thisstragety is used in (cite tarantola). This can also yield satisfactory results, but it requires a lot of parametertweeking and fine-tuning, to be repeated for every new problem. This strategy requires choosing which para‐meter to start with, for how many iterations or to which objective function value. Yet another way to mitigatethe non-uniqueness and different magnitudes of the gradients is the incorporation of a priori information intothe problem. (meju gallardo) use so called ‘cross-gradient’ regularization of the inverse problem. This requiresthe different estimated parameters to exhibit ‘structural similarity’, ie, they have to vary in a similar fashion. Apriori information should be very usefull, but it has to be available first. This is often not the case, because weneed a priori information about the relation between compressibility and buoyancy, not about each parame‐ter separately (although this can be included as well). In the case of acoustic inverse problems with the earthas the medium, we actually know that the buoyancy and compressibility may sometimes vary in a similar way,but sometimes their relation can be completely different. (example bg model). There is also some litteratureavailable concerned with which parameterization to choose in multi-parameter seismic problems (cite).These works analyse gradient direction of the reduced Lagrangian objective or partial derivatives of the PDE (

    ). Based on those pieces of information, it is decided which parametrization (density-velocity, compressibili‐ty-density, etc.) should give the best parameter estimations. This is also purely focued on gradient-descent orquasi-Newton methods, just like the other approaches described above. This totally bypasses physical infor‐mation which is available in the Hessian. See (pratt for a physical interpretation of the Hessian in the reducedLagrangian). From an optimization point of view, it is well known that Newton’s method has the affine trans‐formation invariance property (refs) and is in principle robust to ill-conditioned problems. Quasi-Newton andgradient-descent do not have these properties.

    Our approach is therefore to use the Hessian and approximations of the Hessian to estimate multiple parame‐ters. We will evaluate if the use of the Hessian allows simultaneous estimation of multiple parameters, withoutusing prior information about the coupling of the parameters in the medium to be estimated, without hirar‐chical strategies and without manually rescaling the different parts of the gradient of the objective. We willthus focus on Newton and Gauss-Newton type algorithms.

    Penalty methods for PDE-constrained optimization

    A penalty formulation for solving problem was presented in Leeuwen and Herrmann (2013b) and used forsingle parameter seismic waveform inversion in Leeuwen and Herrmann (2013a). Earlier, a penalty formula‐tion with a fixed rule for balancing the two terms was used in (cite modified gradient and contrast source). Weuse the same formulation, but now for two unknown parameters

    Problem is a combination of data-misfit and PDE-misfit, where the scalar balances the two. It is importantto note that the balancing between the two terms is the essence of a penalty method. Unlike a Lagrangian for‐mulation of a equality constrained problem, the penalty formulation for a fixed treats the original objectiveto be minimized as well as the constraints in the same way. In other words, there is no longer a notion of whatare constraints and what should be minimized. The value of determines what will happen. A large willprovide results approaching the solutions to problem , while very small values for will provide solutions toa problem similar to , but with objective and constraints switched. Of course, the standard algorithms forquadratic penalty methods honour the notion of minimization objective and constraints by solving a se‐quence of problem of the form of where each problem is solved with a different value for . This guaranteesconvergence to a solution of , see theorem 17.1 in Nocedal and Wright (2000). In this paper we only considera single value for , so we do not solve a sequence of problems and see how well it works. The main reasonsfor this approach are promising results in (cite our papers and abstracts) using a single value for as well asthe cost of solving PDE-constrained optimization problems. For large scale problems, solving it once is al‐ready computationally expensive, solving a sequence of problems would seriously decrease the competitive‐ness of penalty based methods.

    We can introduce the objective function for problem as

    which is an unconstrained minimization problem. We can also write this objective in it’s separable form as

    where is the index for the experiment number. This explicitly shows we can accumulate the objective in par‐allel as a running sum. This is not the case for the gradient and Hessian in general. Another way to rewrite thisobjective is in a block form, organized per angular frequency

    where , blbalalkdsf. This structure is intuitive in case direct factorization

    methods are used for the solution of linear systems arising in the optimization algorithms, because it groupsall blocks in terms of the PDE discretization matrices per frequency. This form is also convenient in the simul‐taneous source scenario, also known as randomized trace estimation (cite eldad,felix).

    The first-order necessary condition for a local minimizer of this unconstrained problem are

    where the partial derivative matrices are given by

    which have the following multi experiment block-structure:

    and the second-order necessary conditions are (in addition to satisfying the first-order conditions)

    Whearas we will use the first-order necessary conditions to design our optimization algorithms, the second-order necessary condition is used as a diagnostic.

    With these standard optimality conditions defined, we can formulate several algorithms to minimize objec‐tive . Each of which has different computational challenges, memory requirements, advantages and disad‐vantages.

    First, we return to the case of a sufficiently small, such that the is enough emphasis on the term to fit the observed data to appoximately, we essentially obtain an approximation to the problem withthe objective and constraints switched

    For now we assume that the observed data is noise free, so fitting it exactly is no concern. In the realistic casewith noise, the constraint should be relaxed to inexact data fitting. The motivation to bring this interpretationof the quadratic penalty method for PDE-constrained optimization with a small , is to explain why we stilluse the quadratic penalty form, rather than solve problem , or related problem with inexact data fit con‐straints using a Lagrangian form. This Lagrangian form is

    with as the vector of Lagrangian multipliers for the equality constraints. There are no obstacles which keepus from solving this constrained problem, but this form does pose some limitations on the algorithms that canbe developed.

    The first order necessary conditions for a minimum of the Lagrangian in are

    Variant 1: Full Newton

    This variant is the classical Newton method to minimize . At each iteration we compute :

    In more explicit form this iteration can be written as

    This method can be seen as the quadratic penalty objective equivalent of the Lagrangian ‘all-at-once’ method(cite), because this method updates both the medium parameters (compressibility and buoyancy) and thefields. This can be an advantage, because the problem has a lot of parameters to work with, minimizing theobjective. Other main advantages are the availability of an exact gradient vector and exact Hessian matrix for free. No computations (PDE-solves) are required, contrary to reduced space methods as described belowand in (cite ghattas, haber, leeuwen). This immediately points at the disadvantages as well; to be able to up‐date the fields, they have to be stored in memory. The feasibility of this is highly dependend on the number ofgrid points and the number of experiments. Even for modest sized problems with a relatively small number ofexperiments, the Hessian will most likely be too large for direct factorization to compute the Newton step. Di‐rect factorization is used by Grote et al. (2011) to solve interior-point systems in originating from a Lagrangianapproch to handle the constraints. Iterative methods need to be used when the required hardware is not avail‐able for direct factorization. This has been the topic of many papers (cite) related to precondioning and solv‐ing the Newton system for the Lagrangian formulation (the KKT-system). In this case, we are not aware of re‐search dedicated to solve systems arising from the penalty form of the PDE-constrained optimization prob‐lem. Although some blocks of our Newton system are the same as in the KKT system for the Lagrangian form(see Haber), a main difference is the block. This block is the combination of the normal equations formof a PDE system matrix and the linear operator selecting the receiver locations. This has a particularly unfa‐vorable eigenvalue distribution and a roughly squared condition number. No litterature has been publisheddedicated to solving this block on its own using iterative methods. In this work we will therefore not consideriterative solutions of the full Newton method, only direct factorizations on small problems. When efficient it‐erative preconditioning and solutions methods are available, this Newton method can enjoy inexact solves ofthe Newton system to save a significant number of iterations. See (Nocedal and Wright, 2000) for some theo‐rems about the required tolerance at every iteration to preserve a desired rate of convergence of the Newtonmethod.

    The Newton iteration uses the solution of as the update for fields and parameters. To ensure global conver‐gence, this update can be used in a trus-region or line-search algorithm. Only line-search algorithms are con‐sidered in this work, without specifying the details or advocating the use of a line-search over a trust-regionapproach.

    Variant 2: Uncoupling fields and medium parameters in Newton’s method

    This method gives up the rate of convergence of Newton’s method, as well as possiblily some robustness withregard to local minima. This last statment is very difficult to quantify in the non-convex setting of PDE-con‐straint optimization, especially when we are dealing with oscillatory fields. It requires numerical experimentsto test wheater this is true or not. Furthermore, theorems about the convergence rate of inexact Newton’smethod no longer hold.

    Uncoupling the fields and medium parameters amounts to setting to zero, the Hessian blocks which couplemedium parameters to fields. This leads to the following Newton-type system which needs to be solved atevery iteration

    This shows the benefits of this approximation of the Hessian. The update for the fields can now be computedin parallel, each block in the block-diagonal block can therefore be factored separately. This algorithmallows the solution of the Newton system using direct solvers for problems a lot larger than variant 1. The ap‐proximation to the Hessian in this variant does not approximate the relation between the medium parametersdirectly, but only indirectly because the removal of the coupling between fields and medium parameter doesaffect the Newton direction at every iterate. This Hessian approximation can be written a bit more explicitly as

    The Hessian in equation can be factored as

    which reveals it is an Hermitian matrix, positive-semidefinite or positive-definite (depending on the rank ofthe factor) with real eigenvalues. The discretization of the PDE will determine , and thus the (semi-)def‐initeness of the Hessian approximation in this section. In case we use the discretization of (cite), is full rankand is also full rank, with no linear dependence between the rows of and . Because the update for thefields and the medium parameter updates , are uncoupled, we can solve every linear system in the

    block in parallel. For the medium parameter updates we can either use a direct factorization or an iter‐ative method. Which one will be most suitable depends on a number of things. First of all, the PDE itself, butalso the mesh-type (regular gird / grids with hard refinement etc), type of finite-differences/finite-elementswhich are used determine the bandwidth, symmetry properties and eigenvalue distribution and the conditionnumber. Therefore it is not possible to make a general recomendation about the solution method for themedium parameter updates.

    Variant 3: Block-diagonal Hessian approximation Newton-type method.

    In this variant we set all off-diagonal blocks of the Hessian corresponding to the full Newton step for objective to zero. This gives the following Hessian approximation

    In addition to the things we gave up to arrive at variant 2, we now also give up the coupling between the medi‐um parameters. We only maintain the action of the Hessian blocks on each class of medium parameters sepa‐rately. Note that this can be more than just scaling. The buoyancy for example, is inside the div-grad operator,giving the corresponding Hessian block a differential operator structure. This can be seen in the section aboutthe heuristics on effets of the various Hessian approximations. The additional advantage compared to variant2, is that even the updates for the medium parameters can be computed in parallel. This allows this variant tobe used for even larger problem when relying on direct factorizations to solve for the field and medium para‐meter updates. By ignoring the blocks coupling the medium parameters, we reduce the bandwith of the sys‐tem to be solved for the medium parameter updates. This cuts back on the memory requirements due to fill-in during the factorization process.

    Variant 4: Reduced-space sparse Hessian approximation

    The main argument for this variant is it’s very limited memory use related to the storage of the fields corre‐sponding to all the experiments. We will depart from the variants which store and update the fields, insteadwe will solve for the fields at every iteration of a Newton-type algorithm. This was presented in Leeuwen andHerrmann (2013b). The first step is to make sure our objective satisfies the first-order optimiality conditionwith respect to the fields. To achieve this, we solve

    which corresponds to the least-squares problem

    In Leeuwen and Herrmann (2013b) it is also noted that this can be interpreted as a variational projection as inAravkin and Leeuwen (2012). This least-squares from explicitly shows that observed data information is aug‐mented to the PDE. The solution will satisfy this observed data if the trade-off parameter is chosen suffi‐ciently small. We proceed by applying block-Gaussian elimination to the full Newton system . To make no‐tation simple, partition the Hessian from equation as

    Eliminating the block results in

    which can be recognized as the Schur-complement. When we subsitute , the above simpli‐fies to

    These are the reduced-Hessian and reduced-gradient for the following reduced objective

    where is the solution to the least-squares problem . The value of the variable projection interpretation isthat this proves (insert theorem) that the stationary points of the reduced objective are also stationarypoints of the original full objective . This does not imply that the solutions generated by minimizing each ofthe objectives will always be the same.

    This reduced objective shows one of the main disadvantages of reduced-space methods: function evaluationsrequire the solution of least-squares problems. This is equivalent to the reduced-space objectives based on aLagrangian, which require PDE-solves for function evaluations (Haber et al., 2000). This is a problem in line-search or trust-region algorithms. To keep the computational cost under control in a backtracking line-searchalgorithm, it is common to limit the number of back-track steps and to reduce the step size considerably tomake sure a step is found quickly. This may cause the algorithm to achieve a sub-optimal rate of convergence.In variants 1, 2 and 3 this is not a problem, because function evaluations are cheap; it merely takes some ma‐trix-vector products. These variants should be able to find the near optimal step-length at every iteration ofthe Newton-type algorithm. A second major disadvantage of reduced-space methods is that the block-elimi‐nation turned the large and sparse full Hessian into a dense and small reduced Hessian. In certain optimiza‐tion problem this may be a favourable trade-in, but in the PDE-constrained optimization setting, especiallywhen dealing with wave-phenomena, the ‘small’ dense Hessian becomes to large to store in memory. for thereduced Hessian based on a Lagrangian objective.

    One way to proceed, is to compute matrix-free matrix-vector products with dense Hessian bysolving the linear system with the block at every iteration of an iterative algorithm to compute the New‐ton-type step. The linear systems in the block can be solved using a direct factorization method, but thecomputation of the Newton-type step is restricted to iterative methods. Therefore this method is only a feasi‐ble option, if the Hessian is sufficiently well conditioned or has some usefull eigenvalue clustering. This is un‐fortunately not the case here, therefore an efficient preconditioner is required, but unavailable at this time.This is a significant disadvantage of the used formulation of the PDE-constrained problem using the quadraticpenalty function. It must be noted however, that in the Lagrangian case where the reduced Hessian (see Haberet al., 2000 for the reduced Hessian based on a Lagrangian objective) is used (Haber et al. (2000),Métivier et al.(2013),ghattas), the extra effort compared to quasi-Newton methods does not always pay off in the single-pa‐rameter waveform-inversion problem (Métivier et al., 2013). (ghattas) describes this method in the multi-pa‐rameter waveform inversion context, but does not show an example. It remains to be investigated how thismethod will perform in the multi-parameter case. The reduced Hessian in the Lagrangian setting differs fromthe one in the quadratic penalty setting in the sense that the Lagrangian based reduced Hessian requires a se‐ries of two linear-system solves to compute a matrix-free Hessian-vector product, whereas the penalty-basedreduced Hessian requires only one. The problem of solving a series of linear-systems is less suitable for inex‐act iterative methods, because the error from solving the first linear system is effectiveley multiplied by theapproximate inverse of the second linear system. We will not investigate the use of the reduced Hessian in thiswork.

    The strucure of the reduced Hessian in equation allows an interesting alternative which still allows us touse the reduced-Hessian. (Leeuwen and Herrmann, 2013b) proposes to use just the sparse part, . The accu‐racy of this approximation depends on the value of , a smaller value increases the accuracy. It must be notedthat we are not primarily intested in how well approximates , but in how well the computedupdate direction using approximates the direction which would follow when was used as theHessian. Taking the sparse part from the Hessian in equation leads to

    In the single parameter estimation problem, the use of this approximation has shown to be approximately aspowerfull as the L-BFGS quasi-Newton method. Here we are primarily interested in the question if the approx‐imate Hessian properly balances the updates for the multiple parameters, rather then rate of convergence. Wewill rely on numerical experiments to show this.

    Variant 5: Reduced-space sparse block-diagonal Hessian approximation

    A further simplification of variant 4 is to approximate the Hessian approximation even further by dropping theoff diagonal blocks of . This uncouples the updates for the different parameter types, but does allow the up‐dates to be computed in parallel. As in variant 3, this allows larger problems to be solved using a direct factor‐ization method.

    Memory requirements

    One of the main drawbacks of algorithms which update the fields, rather than solving for them, is that allfields need to be in memory, compared to 1 or 2 fields for the other class of algorithms. In present day parallelcomputing environments, things have changed a bit. When we solve for the fields, the objective function andits gradient can be computed in a running sum over the multiple experiments. That is why the memory re‐quirements are so low. It is very common however, to parallize over frequencies. That is, split the separableobjective function over the multiple frequencies and compute one frequency per core/node. This is a form oftrivial parallel computing with maximum efficiency. Memory wise, the total storage of fields has increased cor‐respondingly. This is equivalent to the algorithm of variant 2, where the objective is not intrinsically separable,but by uncoupling the fields from medium parameters, we can achieve the same parallelism as in the variantswhere we solve for the fields. Concluding, variant 2 is an algorithm which updates the fields, but is memorywise similar to algorithms which solve for the fields. Note that this is not possible in the full Newton methods(variant 1), because computing updates for the fields in a parallel fashion would require comminucation be‐tween the solution processes for the medium parameter updates and field updates.

    Known deficiencies of quadratic penalty methods in the PDE-constrained optimization setting

    boundedness from below: In general, the objective of a quadratic penalty formulation of an equality con‐strained problem may not be bounded from below (Nocedal and Wright, 2000). In our case the objective is thesum of two norms, therefore the objective function value will be bounded from below by .

    Acknowledgements

    This work was financially supported in part by the Natural Sciences and Engineering Research Council ofCanada Discovery Grant (RGPIN 261641-06) and the Collaborative Research and Development Grant DNOISEII (CDRP J 375142-08). This research was carried out as part of the SINBAD II project with support from thefollowing organizations: BG Group, BGP, BP, Chevron, ConocoPhillips, CGG, ION GXT, Petrobras, PGS, Statoil,Total SA, WesternGeco, Woodside.

    References

    Aravkin, A. Y., and Leeuwen, T. van, 2012, Estimating nuisance parameters in inverse problems: Inverse Prob‐lems, 28, 115016.

    Grote, M., Huber, J., and Schenk, O., 2011, Interior point methods for the inverse medium problem on mas‐sively parallel architectures: Procedia Computer Science, 4, 1466–1474. doi:http://dx.doi.org/10.1016/j.procs.2011.04.159

    Haber, E., Ascher, U. M., and Oldenburg, D., 2000, On optimization techniques for solving nonlinear inverseproblems: Inverse Problems, 16, 1263–1280. doi:10.1088/0266-5611/16/5/309

    Leeuwen, T. van, and Herrmann, F. J., 2013a, Mitigating local minima in full-waveform inversion by expandingthe search space: Geophysical Journal International, 195, 661–667. doi:10.1093/gji/ggt258

    Leeuwen, T. van, and Herrmann, F. J., 2013b, A penalty method for pDE-constrained optimization: UBC. Re‐trieved from https://www.slim.eos.ubc.ca/Publications/Private/Submitted/2013/vanLeeuwen2013Penalty2/vanLeeuwen2013Penalty2.pdf

    Métivier, L., Brossier, R., Virieux, J., and Operto, S., 2013, Full waveform inversion and the truncated newtonmethod: SIAM Journal on Scientific Computing, 35, B401–B437. doi:10.1137/120877854

    Nocedal, J., and Wright, S. J., 2000, Numerical optimization: Springer.

    *

    *

    p + iω =∂k ρk,rvk fk+ iωκp = q.∂rvr

    (1)

    p fk k vkq

    b κ

    c κ = bc2

    11

    = 0 1

    dPu u P

    ∥Pu − d s.t. A(m)u = q.minm,u

    12

    ∥22(2)

    P Au q d

    b κ

    2

    2

    2

    u =

    ⎛⎝⎜⎜⎜

    u1u2

    ⋮uk

    ⎞⎠⎟⎟⎟

    d =

    ⎛⎝⎜⎜⎜

    d1d2

    ⋮dk

    ⎞⎠⎟⎟⎟

    q =

    ⎛⎝⎜⎜⎜⎜

    q1q2

    ⋮qk

    ⎞⎠⎟⎟⎟⎟

    A =

    ⎛⎝⎜⎜⎜⎜

    A1A2

    ⋱Ak

    ⎞⎠⎟⎟⎟⎟

    P =

    ⎛⎝⎜⎜⎜⎜

    P1P2

    ⋱Pk

    ⎞⎠⎟⎟⎟⎟

    (3)

    2 1

    2

    G

    2

    ∥Pu − d + ∥A(m)u − q .minm,u

    12

    ∥22λ2

    2∥22(4)

    4 λ

    λ

    λ λ

    2 λ2

    4 λ2

    λ

    λ

    4

    ϕ(m, u, λ) = ∥Pu − d + ∥A(m)u − q .12

    ∥22λ2

    2∥22(5)

    ϕ(m, u, λ) = ∥ − + ∥A(m − ,∑i

    12

    Piui di∥22λ2

    2)iui qi∥

    22(6)

    i

    ωi

    ϕ(m, u, λ) = ∥ − + ∥A(m − .∑ωi

    12

    PωiUωi Dωi ∥22

    λ2

    2)ωi Uωi Qωi ∥

    22(7)

    = ( )Uωi uω1 uω2 \hdots uωk

    ϕ(m, u, λ) = (A(m)u − q) = 0∇b λ2G∗bϕ(m, u, λ) = (A(m)u − q) = 0∇κ λ2G∗κ

    ϕ(m, u, λ) = (Pu − d) + A(m (A(m)u − q) = 0∇u P∗ λ2 )∗(8)

    = ∂A(b, κ)u/∂bGb= ∂A(b, κ)u/∂κ.Gκ

    (9)

    G =

    ⎛⎝⎜⎜⎜

    G1G2

    ⋮Gk

    ⎞⎠⎟⎟⎟(10)

    dfadfagsd(11)

    5

    λ ∥Pu − d12 ∥22

    d 2

    ∥A(m)u − q s.t. Pu = d.minm,u

    12

    ∥22(12)

    λ

    12

    L(b, κ, u) = ∥A(m)u − q + (Pu − d),12

    ∥22 γ∗(13)

    γ

    13

    5 = gxi+1 H−1

    ( )( ) = −( ).ϕ∇2u,uϕ∇2m,u

    ϕ∇2u,mϕ∇2m,m

    δuδm

    ϕ∇uϕ∇m

    (14)

    ( )( ) = ( ).P + AP∗ λ2A∗ϕ∇2m,u

    ( + )λ2 Rm A∗Gmλ2G∗mGm

    δuδm

    (Pu − d) + (Au − q)P∗ λ2A∗

    (Au − q)λ2G∗m(15)

    g H

    ϕ∇2u,u

    14

    Algorithm 1 Full-Newton for Quadratic Penalty form.

    while Not converged do 1. form Hessian and gradient // Form (~free) 2. Solve for Newton direction // Solve 3. Find steplength using linesearch // evaluate (~free) 4. // update model end

    m = m + αpgn

    ( )( ) = ( ).P + AP∗ λ2A∗0

    0λ2G∗mGm

    δuδm

    (Pu − d) + (Au − q)P∗ λ2A∗

    (Au − q)λ2G∗m(16)

    ϕ∇2u,u

    16

    ⎛⎝⎜

    [ ]λA∗ P∗

    00

    0λG∗κλG∗b

    ⎞⎠⎟

    ⎛⎝⎜

    [ ]λAP0

    0

    λGκ

    0

    λGb

    ⎞⎠⎟

    Gκ GbA

    P P Aδu δκ δb

    ϕ∇2u,u

    Algorithm 2 field-medium parameter uncoupled Newton for Quadratic Penalty form.

    0. construct initial guess for medium and for each field while not converged do 1. form Hessian and gradient // form (~free) 2. ignore the blocks // approximate 3. find & each in parallel // solve 4. find steplength using linesearch // evaluate (~free) 5. & // update model and fields end

    m ui

    ϕ, ϕ∇2u,m ∇2m,u

    δm δuiα

    m = m + αδm u = u + αδu

    5

    = .⎛⎝⎜⎜

    ϕ∇2u,u0

    0

    0

    ϕ∇2κ,κ0

    0

    0

    ϕ∇2b,b

    ⎞⎠⎟⎟⎛⎝

    δuδκδb

    ⎞⎠

    ⎛⎝

    ϕ∇uϕ∇κϕ∇b

    ⎞⎠(17)

    Algorithm 3 Block-diagonal approximation of the Hessian for the Quadratic Penalty form.

    while Not converged do 1. form Hessian and gradient // Form (~free) 2. ignore the blocks // Approximate 3. Solve for the field updates in parallel & solve for the medium parameter updates to obtain 4. Find steplength using linesearch // evaluate (~free) 5. // update model end

    ϕ, ϕ, ϕ, ϕ, ϕ, ϕ∇2u,κ ∇2u,b ∇

    2κ,u ∇

    2b,u ∇

    2b,κ ∇

    2κ,b

    m = m + αp

    ϕ(m, u, λ) = (Pu − d) + A(m (A(m)u − q) = 0,∇u P∗ λ2 )∗

    = arg ( )u − ( ) .ū minu

    ∥∥∥

    λA(m)P

    λqd

    ∥∥∥2

    (18)

    λ

    1414

    = ( ).⎛⎝⎜⎜⎜

    ϕ∇2u,uϕ∇2κ,uϕ∇2b,u

    \vline

    \vline

    \vline

    ϕ∇2u,κϕ∇2κ,κϕ∇2b,κ

    ϕ∇2u,bϕ∇2κ,bϕ∇2b,b

    ⎞⎠⎟⎟⎟ EC BD(19)

    ϕ∇2u,u

    (D − C B)( ) = ( ) − C ϕ( , κ, b)E−1 δκδu

    ϕ∇κϕ∇b

    E−1∇u ū(20)

    ϕ(m, u, λ) = 0∇u

    (D − C B)( ) = ( ),E−1 δκδu

    ϕ∇κϕ∇b

    (21)

    (b, κ, λ) = ∥P − d + ∥A(m) − q .ϕ̄12

    ū ∥22λ2

    2ū ∥22(22)

    ū 1822

    5

    D − C BE−1

    E−1

    E−1

    21D

    λ

    D D − C BE−1

    D D − C BE−1

    21

    D = ( ) = ( ).ϕ∇2κ,κϕ∇2b,κ

    ϕ∇2κ,bϕ∇2b,b

    G∗κGκG∗bGκ

    G∗κGbG∗bGb

    (23)

    D

    0

    Algorithm

    27

  • FWI using very inaccurate accurate iterative solver &LBFGS

    28

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • Full-space method of this talk using very inaccurate accurate iterative solver

    29

    estimated model

    x [m]

    z [m

    ]

    0 1000 2000 3000 4000 5000

    0

    500

    1000

    1500

    2000 1500

    2000

    2500

    3000

    3500

    4000

    4500

  • Related work

    30

    The  presented  algorithm  is  a  quadratic-‐penalty  version  of  Lagrangian-‐based  all-‐at-‐once  algorithms:

    solve  (inexactly)  at  every  iteration:  Newton-‐KKT  system

    [E.  Haber  &  U.M.  Ascher,  2001  ;  G.  Biros  &  O.  Ghattas  ,  2005  ;  Grote  et.  al.,  2011]

    minm,u

    1

    2kPu� dk22 s.t. H(m)u = q

    0

    @⇤ ⇤ G⇤⇤ P ⇤P HG H 0

    1

    A

    0

    @�m�u�v

    1

    A = �

    0

    @G⇤v

    H⇤v + P ⇤(Pu� d)Hu� q

    1

    A

    L(m,u,v) = 12kPu� dk22 + v⇤

    �H(m)u� q

    �G =

    @H(m)u

    @m

  • Related work

    31

    Lagrangian  based  full-‐space  methods  also  store  the  multipliers  +  corresponding  gradient  &  Hessian  blocks.

    no  intrinsic  parallel  structure0

    @⇤ ⇤ G⇤⇤ P ⇤P HG H 0

    1

    A

    0

    @�m�u�v

    1

    A = �

    0

    @G⇤v

    H⇤v + P ⇤(Pu� d)Hu� q

    1

    A

    number  of  field  variables: 2⇥ nsrc ⇥ nfreq ⇥ ngrid

    *  higher  order  terms  

    [E.  Haber  &  U.M.  Ascher,  2001  ;  G.  Biros  &  O.  Ghattas  ,  2005  ;  Grote  et.  al.,  2011]

  • reduced-‐space  (FWI):  • error  in  objective  function  value• error  in  gradient• error  in  Hessian

    full-‐space  (this  talk):• objective  function  value  always  exact• gradient  always  exact• Hessian  always  exact

    32

    Inexact PDE solves– full-space vs reduced-space

    error  in  medium  parameter  update

    globally  convergent  inexact  Newton  methods  [S.C.  Eisenstat  &  H.F.  Walker,  1994]

  • Full vs Reduced-space

    Reduced-space Full-space

    Hessian, gradient &function evaluation

    solve PDE’s ~free

    Hessian, gradient &function evaluation inexact exact

    Hessian dense sparse

    memory for fields 2 fields per parallel processworking subset of simultaneous

    source fields in memory(can be distributed over nodes)

    working memory1 gradient & update

    directionupdate directions & gradients in

    memory

    ~free = sparse matrix-vector products33

  • Conclusions

    Constructed  a  quadratic-‐penalty  based  full-‐space  method  which:

    • updates  fields  &  medium  parameters  simultaneously• main  computations  are  intrinsically  parallel• suitable  for  frequency  domain  waveform  inversion  with  iterative  solvers• con:  need  to  store  working  subset  of  simultaneous  source  fields• but,  less  storage  needed  compared  to  Lagrangian  full-‐space  methods

    34

  • Acknowledgements

    This  work  was  financially  supported  by  SINBAD  Consortium  members  BG  Group,  BGP,  CGG,  Chevron,  ConocoPhillips,  DownUnder  GeoSolutions,  Hess,  Petrobras,  PGS,  Schlumberger,  Statoil,  Sub  Salt  Solutions  and  Woodside;  and  by  the  Natural  Sciences  and  Engineering  Research  Council  of  Canada  via  NSERC  Collaborative  Research  andDevelopment  Grant  DNOISEII  (CRDPJ  375142-‐-‐08).

    Thanks  to  our  sponsors

  • 1. Eisenstat,  Stanley  C.,  and  Homer  F.  Walker.  "Globally  convergent  inexact  Newton  methods."  SIAM  Journal  on  Optimization  4.2  (1994):  393-‐422.

    2. Tristan  van  Leeuwen  and  Felix  J.  Herrmann,    frequency-‐domain  seismic  inversion  with  controlled  sloppiness,  SIAM  Journal  on  Scientific  Computing,  36  (2014),  pp.  S192–S217.

    3. B  Peters,  FJ  Herrmann,  T  van  Leeuwen.  Wave-‐equation  Based  Inversion  with  the  Penalty  Method-‐Adjoint-‐state  Versus  Wavefield-‐reconstruction  Inversion.  76th  EAGE  Conference,  2014.

    4. M.J.  Grote,  J.  Huber,  and  O.  Schenk,  Interior  point  methods  for  the  inverse  medium  problem  on  massively  parallel  architectures,  Procedia  Computer  Science,  4  (2011),  pp.  1466  –  1474.  Proceedings  of  the  International  Conference  on  Computational  Science,  {ICCS}  2011.

    5. Eldad  Haber,  Uri  M  Ascher,  and  Doug  Oldenburg,  On  optimization  techniques  for  solving  nonlinear  inverse  problems,  Inverse  Problems,  16  (2000),  pp.  1263–1280.

    6. E  Haber  and  U  M  Ascher,  Preconditioned  all-‐at-‐once  methods  for  large,  sparse  parameter  estimation  problems,  Inverse  Problems,  17  (2001),  p.  1847.

    7. I  Epanomeritakis,  V  Akcelik,  O  Ghattas,  and  J  Bielak.  A  Newton-‐CG  method  for  large-‐scale  three-‐dimensional  elastic  full-‐waveform  seismic  inversion.  Inverse  Problems,  24(3):034015,  June  2008.

    8. George  Biros  and  Omar  Ghattas,  Parallel  lagrange–newton–krylov–  schur  methods  for  pde-‐constrained  optimization.  part  i:  The  krylov–schur  solver,  SIAM  Journal  on  Scientific  Computing,  27  (2005),  pp.  687–713.

    9.  R.E.  Kleinman  and  P.M.van  den  Berg,  A  modified  gradient  method  for  two-‐  dimensional  problems  in  tomography,  Journal  of  Computational  and  Applied  Mathematics,  42  (1992),  pp.  17  –  35.  

    References

    36

    https://scholar.google.ca/citations?user=hw23CTQAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=hw23CTQAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=9namBMsAAAAJ&hl=en&oi=srahttps://scholar.google.ca/citations?user=9namBMsAAAAJ&hl=en&oi=sra

  • 10. Ake Bjork, Numerical methods for least squares problems. siam, 1996.

    11. Walter Gander, Least squares with a quadratic constraint, Numerische Mathematik, 36 (1980), pp. 291–307.

    References

    37