tarek el moselhy & youssef marzouk...tarek el moselhy & youssef marzouk massachusetts...

Tarek El Moselhy & Youssef Marzouk Massachusetts Institute of Technology

Department of Aeronautics & Astronautics

Inverse problems

•  Infer model parameters from indirect, noisy, and limited observations

•  Problems are often ill-posed and high-dimensional

•  Example: estimate subsurface properties from observations of pressure and transport (here, saturation)

Inverse problems



•  Example: infer electrochemical kinetic parameters and pathways from system-level data

Inverse problems



⇒ An essential step in predictive simulation: endow parameters and subsequent predictions with quantified uncertainties

•  Inversion as statistical inference — a Bayesian approach –  Model parameters represented by random variable x. Data d. –  Apply Bayes’ rule:

posterior density

likelihood function L()

prior density

Statistical inference

•  The posterior density is the full Bayesian solution to the inference problem –  Not just a single value for x, but a probability density –  A complete description of uncertainty –  An input to future simulations — prediction and experimental design

d x( ) p x d( )

p x d( ) = p d x( ) p x( )p d x( ) p x( )dx

Computational issues

•  Extracting information from the posterior –  Means, variances, higher moments; marginal distributions;

realizations; posterior predictions; decisions (expected utility) –  Posterior evaluations may be expensive (forward model = PDE) –  Parameter x may be high-dimensional

•  Usual approaches involve Markov chain Monte Carlo (MCMC) sampling; very useful, but… –  Generates a stream of correlated samples –  Proposal design is difficult; potential for poor mixing –  No clear convergence criteria –  Requires a large number of forward model evaluations—though

surrogate and reduced models (e.g., polynomial chaos, state projection) can help

–  Not recursive

An alternative viewpoint

forward model y(x) + error model

+ data d [i.e., likelihood]

prior knowledge

posterior knowledge

f

Z = f ( X )

“prior random variable”

“posterior random variable”

Z X d ~ d

Random variable transformation

Prior density p

Posterior density πd

transformation

X

X

Z

Z

Z = f X( )

Can we compute an appropriate map f ?

Properties of the map

•  Map “pushes forward” the prior measure to the posterior measure

•  Potential advantages and desiderata –  Generate arbitrary numbers of independent posterior samples, without

additional forward solves –  Clear convergence criterion –  Analytical expressions for posterior moments –  Computationally less expensive? Opportunities for parallelization? –  Propagate posterior through subsequent models? Apply recursively? –  Compute posterior normalizing constant (evidence or marginal

likelihood, for use in model selection)

Outline

1  Motivation and concept

2  Formulation

3  Solution methods

4  Numerical examples

Formulation

•  Some notation: –  Map is (we will discuss the functional form of f later)

–  Forward model is

–  Start with the posterior density as

d (z) =

L z;d( ) p z( )

L z;d( ) p z( )

y :n nobs

f :n n

•  Assume we know an invertible f such that •  Perform a transformation from the posterior to the prior to get a

probability density for X

Z = f X( )

Formulation

q(x) is a probability density for the prior random variable X, parameterized by f

Jacobian of the transformation q x; f( ) =

L f x( );d( ) p f x( )( )

detdfdx

(z) =

L z;d( ) p z( )

Formulation

•  But we already know the density of X, namely the prior

•  The transformed distribution should then satisfy

•  Find f(x) such that q is close to p –  For instance, minimize Kullback-Leibler (KL) divergence,

Hellinger distance, etc –  Some analogy with variational Bayes (though different!)

q x; f( ) = p x( )

p x( )

prior rv X

posterior rv Z

transformation πd

p q

Find transformation such that p = q

q

Formulation schematic

Formulation

•  Putting q close to p… –  Kullback-Leibler divergence

–  Thus T must be constant in x –  Same result holds true for Hellinger and other “distances” –  T = constant also obtained by pointwise equality p(x) = q(x), but

reveals the posterior normalizing constant β

–  As a byproduct of inference, we will calculate the evidence!

DKL p q( ) = log

p x( )q x( ) p x( )dx = 0 E exp T ( X ; f )( ) = exp E T ( X ; f ) ( )

where T x; f( ) log L f x( );d( )( ) + log p f x( )( )( ) + log det

dfdx

log p x( )( )

T x; f( ) = log L f x( );d( )( ) + log p f x( )( )( ) + log det dfdx

log p x( )( ) = log

Optimization problems

1.  Minimize variance

2.  Pointwise equality (T = constant in the L2 sense)

3.  Explicitly minimize KL-divergence

•  Note: expectation and variance are all with respect to the prior distribution p

T xi; f( ) = E T ( X ; f ) , i = 1…N

min

fVar T X ; f( )

min

fDKL p(x) q(x)( )

Example of T

•  In the case of a Gaussian prior (identity covariance) and additive Gaussian noise, the expression for T(x; f) is

T x; f( ) = 12

y f x( )( ) d( )T n1 y f x( )( ) d( ) 1

2f x( )T f x( ) + log det df

dx

+12

xTx

T x; f( ) = log L f x( );d( )( ) + log p f x( )( )( ) + log det dfdx

log p x( )( )

L f x( );d( ) = exp 12y f x( )( ) d( )T n1 y f x( )( ) d( )

p x( ) = exp 12xTx

Existence and uniqueness

•  In general, map exists but is not unique –  Example: linear forward model, additive Gaussian noise,

zero-mean Gaussian prior:

–  Posterior distribution is Gaussian and known in closed form:

–  Any affine transformation such that represents a valid map

–  L is not uniquely defined: Cholesky, matrix square root, etc

y(x) = Ax, d = y + , X ~ N (0, I ), ~ N (0,n )

z = I + ATn

1A( )1, µz = z AT n

1 dZ ~ N µz ,z( )

Z = f X( ) = µZ + LX

Existence and uniqueness

•  Connections with optimal transport theory [Caffarelli, McCann, … ]: –  Use distance minimization to guarantee existence and

uniqueness of an invertible map

–  Map is the gradient of a convex scalar function

•  Full formulation becomes, e.g.,

minfE f (X ) X( )T f (X ) X( )

minfVar T + E f (X ) X( )T f (X ) X( )

Implementation issues

•  Optimization problems –  For use Newton and variants

–  Take full advantage of forward adjoints and Hessians to compute derivates wrt degrees of freedom in f

–  Expectation/variance computed using samples from the prior

–  For T(xi) = constant again use prior samples –  Nonlinear least squares; again uses adjoint information

min

fVar T X ; f( )

Implementation issues

•  Represent f using an orthogonal polynomial expansion (e.g., Hermite chaos) –  Number of coefficients is equal to the dimension of the

parameter space times the number of coefficients

•  Evaluation of T can use efficient surrogate forward models y(z) –  Compute, for example, using prior gPC expansion…

•  Important open question: can we represent monotone functions (or convex functions) efficiently?

Vector of orthogonal polynomials

Matrix of unknown coefficients

Outline

1  Motivation and concept

2  Formulation

3  Solution methods

4  Numerical examples

Simple linear example

•  100 dimensional problem •  A is randomly generated •  Gaussian posterior

y x( ) = Ax, d = y +

X ~ N (0, I ), ~ N (0,n )

•  Start from identity map f(X) = X •  Convergence to exact solution in 12 iterations

A8100; y, d 8

D(p||q) and var(T) evidence p(d)

Reaction kinetics

•  Five late-time observations of A; truth is k1 = 1, k2 = 2

•  Gaussian prior •  Infer k1 and k2

convergence: DKL < 10-5

prior posterior

Reaction kinetics: map

•  7th order polynomial map •  Transformation Jacobian is positive definite except at a few

points at the tail of the prior

First component of posterior

Second component of posterior

Nonlinear PDE

•  Reaction-diffusion equation •  Gaussian prior covariance for log-

diffusivity, (Lc = 0.25, σ = 1.25)

p( ) = 50p 1 p( )

log 0( ) ~ GP 0,C( )

C r1,r2( ) = 2 exp r1 r2

2

2Lc2

•  3481 spatial elements, prior parameterized with 16 K-L modes, 20 observations at random locations

•  Third order map •  At solution: KL-divergence = 0.0075, Var[T] = 0.0136

p = 0 p = 1

p = 1

p = 0

unit square

Results de

t(df/d

x)

xiT xi

Marginal posterior of K-L mode weights

Results

Map captures posterior dependencies among K-L mode weights

Elliptic PDE with high-dim inputs

•  Elliptic PDE in two dimensions, 61x61 spatial grid •  Log-normal prior with an exponential covariance

kernel and large variance: 58 K-L modes, σ = 1.75, Lc=2

•  200 observation points •  Map up to 3rd order

p( ) = 0log 0( ) ~ GP 0,C( )

C r1,r2( ) = 2 exp r1 r2Lc

Posterior median and 0.1/0.9 quantiles

Elliptic PDE in high dimensions

true log-permeability

“gold-standard” MCMC (107 samples), 4.6 hours

yet effective sample size < 2500

map/optimization, 28 minutes


map/optimization = asterisk; MCMC = diamond; long chain MCMC = line

•  Compare map-inference to MCMC at equivalent wall-clock time

Conclusions

•  A new map-based approach to Bayesian inference –  Find a function that pushes forward the prior measure to the posterior –  Connections with optimal transport theory –  Full posterior now obtained by solving an optimization problem –  Clear convergence criterion –  Evidence computed “for free” –  Should be recursive and easily parallelizable

•  Many open issues… –  Favorable performance comparison with MCMC on ill-posed PDE-

based problems –  But, overall scaling with dimension, size/quality of data –  Efficient optimization approaches; better ways to parameterize the map

•  Support from DOE/Office of Advanced Scientific Computing Research (ASCR)

tarek el moselhy & youssef marzouk...tarek el moselhy & youssef marzouk massachusetts...

Documents