mcmc methods for sampling function space

MCMC Methods for Sampling

Function Space

Alexandros Beskos and Andrew Stuart∗

Abstract. Applied mathematics is concerned with developing models with predictivecapability, and with probing those models to obtain qualitative and quantitative insightinto the phenomena being modelled. Statistics is data-driven and is aimed at the devel-opment of methodologies to optimize the information derived from data. The increasingcomplexity of phenomena that scientists and engineers wish to model, together with ourincreased ability to gather, store and interrogate data, mean that the subjects of appliedmathematics and statistics are increasingly required to work in conjunction in order tosignificantly progress understanding.

This article is concerned with a research program at the interface between these twodisciplines, aimed at problems in differential equations where profusion of data and thesophisticated model combine to produce the mathematical problem of obtaining informa-tion from a probability measure on function space. In this context there is an array ofproblems with a common mathematical structure, namely that the probability measurein question is a change of measure from a Gaussian. We illustrate the wide-ranging ap-plicability of this structure. For problems whose solution is determined by a probabilitymeasure on function space, information about the solution can be obtained by samplingfrom this probability measure. One way to do this is through the use of Markov chainMonte-Carlo (MCMC) methods. We show how the common mathematical structure ofthe aforementioned problems can be exploited in the design of effective MCMC methods.

Mathematics Subject Classification (2000). Primary 35R30; Secondary 65C40.

Keywords. Bayes’s formula, inverse problem, change of measure from Gaussian, Langevin

SPDEs, MCMC.

1. Introduction

The Bayesian approach to inverse problems is natural in many situations wheredata and model must be integrated with one another to provide maximal infor-mation about the system. When the object of interest is a function then theposterior measure from Bayes’s formula is a measure on a function space. In thisarticle we introduce a range of applied problems where this viewpoint is natural,and which all possess a common mathematical framework: the posterior measureon function space, π, has density with respect to a Gaussian reference measure;

∗Zeeman Building, University of Warwick, Coventry CV4 7AL, UK

2 Alexandros Beskos and Andrew Stuart

see Section 2. In Section 3 we describe a general approach for writing down π-invariant stochastic partial differential equations (SPDEs). It is important to beable to sample the posterior measure to get information about it. This is the topicof Section 4 where we introduce Markov chain Monte-Carlo (MCMC) methods anddescribe the Metropolis-Hastings variant. Section 5 contains statements of theo-retical results concerning the complexity of these MCMC methods, when appliedto (finite-dimensional approximations of) the target measures of interest in thisarticle; proofs are contained in the Appendix. Section 6 contains a summary anddirections for further research.

2. Measures on Function Space

In this section we give several illustrations of problems whose solution requiressampling of a measure on function space. For simplicity we confine our analysisto the situation where the functions are in a Hilbert space H. In all cases we willsee that the target measure π has Radon-Nikodym derivative with respect to areference Gaussian measure π0, so that we can write

dπ

dπ0(x) ∝ exp

(

−Φ(x))

. (1)

For future reference we will assume that π0 has mean m and covariance operator C.Adopting standard notation we will write π0 ∼ N (m, C). For expression (1) tomake sense we require that the potential Φ : H 7→ R is defined π0-almost surely.Informally1, it is instructive to write the density for the Gaussian measure as

π0(x) ∝ exp(

−1

2〈x − m, C−1(x − m)〉

)

. (2)

The inverse of −C is known as the precision operator and will be denoted byL. Using this notation and combining (1) and (2) we get the following informalexpression for the density π(x) :

π(x) ∝ exp(

−Φ(x) +1

2〈x − m,L(x − m)〉

)

. (3)

In many of our applications L will be a differential operator. Note that the density(3) is maximized at solutions of the equation

L(x − m) − DΦ(x) = 0.

This is a first hint at the difficulties inherent in sampling measures on functionspace: even locating places of high probability involves the solution of differentialequations. Sampling the entire measure will typically be even more difficult.

1In finite dimensions formula (2) gives the density of a Gaussian measure N (m, C) with respectto Lebesgue measure. On a general Hilbert space there is no analogue of Lebesgue measure, sothe formula should be viewed simply as a useful heuristic, which is helpful for understanding theideas in this article. For economy of notation we use the symbol π for both a measure and itsdensity.

MCMC Methods for Sampling Function Space 3

2.1. Molecular Dynamics. In the mathematical description of molecules acommonly used model is that of Brownian dynamics in which the atomic positionsx undergo thermally activated motion in a potential V :

dx

dt= −∇V (x) +

√

2

β

dB

dt. (4)

Here x denotes the vector of atomic positions in RNd where N is the number of

atoms and d the spatial dimension. The process B is a standard Brownian motionin R

Nd and β the inverse temperature. When the temperature is small (β 1)the solution of this stochastic differential equation (SDE) spends most of its timenear the minima of the potential V . Transitions between different minima arethen rare events. Simply solving the SDE starting from one of the minima will bea computationally infeasible way of generating sample paths which jump betweenminima since the time to make a transition is exponentially small in β [12]. Insteadwe may condition on this rare event occurring. Let x± denote two minima of thepotential and consider the boundary conditions

x(0) = x− and x(T ) = x+. (5)

If we now view the Brownian motion as a control, we see that it may be chosen todrive the solution of (4) from one minimum to the other. Since paths of Brownianmotion carry a probability measure, which induces a measure on paths x, we havea mechanism to construct a probability measure on a function space of paths whichrespect (5). We now make these ideas more precise.

The probability measure π governing the stochastic boundary value problem(4), (5) has density with respect to the Brownian bridge (Gaussian) measure π0

arising in the case V ≡ 0. Girsanov’s theorem, together with Ito’s formula [17, 25],gives that

dπ

dπ0(x) ∝ exp

(

−β

2

∫ T

0

G(x; β)dt

)

where

G(x; β) =1

2|∇V (x)|2 − 1

β∆V (x).

We have thus established a particular instance of (1). It is useful conceptuallyto write the Brownian bridge probability density function with respect to an infinitedimensional Lebesgue measure as is frequently done in the physics literature [7];the desired expression, which may be found by discretization and passage to thelimit (see [31] for example) is

exp

(

−β

4

∫ T

0

∣

∣

∣

∣

dx

dt

∣

∣

∣

∣

2

dt

)

,

together with boundary conditions enforcing (5). The rigorous interpretation of

this expression for π0 is that in this case the precision operator is L = β2

d2

dt2


Figure 1. Crystal lattice with vacancy. We condition on the red atom moving into thevacancy.

equipped with homogeneous Dirichlet boundary conditions on t ∈ [0, T ], and themean is the function

m(t) =t

Tx+ +

T − t

Tx−.

Thus, we may think of the probability density for π as being proportional to

exp

(

−β

4

∫ T

0

∣

∣

∣

∣

dx

dt

∣

∣

∣

∣

2

dt − β

2

∫ T

0

G(x; β)dt

)

with the boundary conditions (5) enforced. This is an explicit example of thegeneral structure (3).

A typical application from molecular dynamics is illustrated in Figure 2.1. Thefigure shows a crystal lattice of atoms in two dimensions, with an atom removedfrom one site. The potential is a sum of pairwise potentials between atoms whichhas an r−12 repulsive singularity, r being the distance between a pair of atoms. Thelattice should be viewed as spatially extended to the whole of Z

2 by periodicity.Removal of an atom creates a vacancy which, under thermal activation as in (4),will diffuse around the lattice: the vacancy will move to a different lattice sitewhenever one of the neighboring atoms moves into the current vacancy position.This motion of the vacancy is a rare event; we can now condition our model onthis event occurring. The solution of such rare event problems arising in chemistryand physics is an active area of research. See [4] for an overview of the subject and[11] for an approach which is useful in the zero temperature limit or close to it.

In summary, we have defined a probability measure for x = x(t) in the Hilbertspace H = L2([0, T ], RNd) which we term the diffusion bridge measure. Thismeasure describes the distribution of sample paths of the SDE (4) conditioned tolink two points in phase space R

Nd within a specified time period, as in (5). Solvingproblems of this form has wide application, not only in chemistry and physics, butalso in areas such as econometrics where it is frequently of interest to augmentdiscrete time data driven by an SDE [5, 6].


2.2. Signal Processing. It is often of interest to identify an underlyingsignal x(t)0≤t≤T , given some observation y(t)0≤t≤T . In the context of SDEsthis can be formulated via a pair of coupled equations:

dx

dt= f(x) +

dB1

dt, X(0) ∼ ζ (6)

dy

dt= g(x, y) + σ

dB2

dt, Y (0) = 0. (7)

The filtering problem [24] is to find, for each t ∈ [0, T ], the probability distributionof x(t) ∈ R

m given y only at times up to t: y(s)0≤s≤t. In contrast, the smoothingproblem is to find the distribution of x(t) given all observations y(s)0≤s≤T ; thesmoothing problem can be viewed as finding the probability measure on the en-tire path x(s)0≤s≤T , conditioned on y(s)0≤s≤T . The filtering and smoothingdistributions on x(T ) are the same but differ on x(t) for any t ∈ (0, T ).

The smoothing problem can be formulated as determining a probability mea-sure on L2([0, T ], Rm) of the form

dπ

dπ0(x) ∝ exp

(

−∫ T

0

G(x; y)dt

)

where the observation y appears as fixed data in the probability measure for x. Hereπ0 is again a Gaussian measure, known as the Kalman-Bucy smoother, derived fromthe original problem in the case where f and g are set to zero and ζ is Gaussian.The inverse of the covariance operator is again a second order differential operator,as for the bridge diffusion in the previous example; details may be found in [15, 17].Once again we have established a particular instance of the general framework (1).Figure 2.2 illustrates the set-up.

2.3. Lagrangian Data Assimilation. Understanding oceans is funda-mental in the atmospheric and environmental sciences, and for both commercialand military purposes. One way of probing the oceans is by placing “floats” (at aspecified depth) or “drifters” (on the surface) in the ocean and allowing them to actas Lagrangian tracers in the flow. These tracers broadcast GPS data concerningtheir positions which can be used to make inference about the oceans themselves.The natural mathematical formulation is that of an inverse problem. We derivesuch a formulation, providing at the same time a straightforward illustration of theBayesian approach to inverse problems. In so doing we show that Lagrangian dataassimilation is yet another example of a problem which inherits the structure (1).

As a concrete model of this situation we consider the incompressible forcedNavier-Stokes equations written in the form:

∂v

∂t+ v · ∇v = ν∆v −∇p + f, (x, t) ∈ Ω × [0,∞),

∇ · v = 0, (x, t) ∈ Ω × [0,∞).

Here Ω is the unit square and ν the viscosity. Also, we impose periodic boundaryconditions on the velocity field v and the pressure p. We assume that f has zero


−2

−1

0

1

2

sig

na

l an

d r

eco

nst

ruct

ion

0 20 40 60 80 100t

−2

−1

0

1

2

x|y

Figure 2. The upper panel shows the original signal (a sample path from (6)) togetherwith the mean and standard deviation of the posterior measure on x given y (shadedband). The lower panel shows a single draw from the posterior measure on x given y.(The posterior measure is sampled using the SPDE defined later in Section 3.2.)

average over Ω; note that this implies the same for v(x, t), provided that we requirethat the initial velocity field u(x) = v(x, 0) has zero average.

Our objective is to find the initial velocity field u(x) ∈ H where H is here theHilbert space found as the closure in L2(T2, R2) of the space of periodic divergence-free, smooth functions on T

2, with zero average. We assume that we are given noisyobservations of Lagrangian tracers with position z solving

dz

dt= v(z, t).

The issue of minimal regularity assumptions on u and f so that Lagrangian tracersare well defined is discussed in [9]. For simplicity assume that we observe a singletracer z at a set of times tkK

k=1:

yk = z(tk) + ξk, k = 1, . . . , K,

where the ξk’s are zero mean Gaussian random variables. Concatenating data wemay write

y = z + ξ

where y = (y1, . . . , yK), z = (z(t1), . . . , z(tK)) and ξ ∼ N (0, Σ) for some covariancematrix Σ. Figure 2.3 illustrates the set-up, showing a snap-shot of the flow fieldstreamlines for v(x, t) and the tracer particles z(t) for some fixed time instance t.

We now construct the probability measure of interest, namely the probabilityof u given y. The first step is to assign a prior measure on u. We choose this to


Figure 3. An example configuration of the velocity field at a given time instance. Thesmall circles correspond to a number of Langangian tracers.

be the Gaussian measure with mean zero and precision operator which is minusthe square of the Stokes operator A on H [29]. We now condition this prior onthe observations, to find the posterior measure on u. We observe that z is a(complicated) function G of u, the initial condition, so we may write

y = G(u) + ξ.

Thus the probability of y given u is

P (y | u) ∝ exp(

−1

2|y − G(u)|2Σ

)

where | · |2Σ = |Σ− 12 · |2 and | · | is the standard finite-dimensional Euclidean norm.

By Bayes’s rule we deduce that

dπ

dπ0(u) ∝ exp

(

−1

2|y − G(u)|2Σ

)

where π0 is the prior Gaussian measure. We have now determined another exampleof the probability density structure (1).

Informally we may write

π(u) ∝ exp(

−1

2|y − G(u)|2Σ − 1

2‖Au‖2

H

)

,

where ‖ · ‖H is the norm induced by the inner-product on H. This expressionprovides another example of the general structure (3). The model is a very simpleone, but more realistic models, in complex geometries and for coupled evolutionof velocity, temperature and other fields (with multiple also observations) have asimilar mathematical structure.


2.4. Geophysics. An important problem in subsurface geophysical applica-tions, of interest to both petroleum engineers and hydrologists, is the determina-tion of subsurface properties, in particular the permeability field (also known as thehydraulic conductivity). Making direct subsurface measurements is hard, so theprimary observation is via indirect measurements of flow and transport throughthe medium. The following model for this set-up is taken from [10, 22]. The for-ward problem contains two unknown scalar fields: the water saturation S (volumefraction of water in an oil-water mixture) and pressure p. We study the problemin a bounded open set Ω ⊂ R

d (typically d = 2 or 3). By means of Darcy’s lawwe define the velocity field v = −λ(S)K∇p, where K is a permeability tensor fieldand the scalar λ(S) determines the effect of saturation on permeability. In termsof the velocity field v, mass conservation and scalar advection respectively give theequations

−∇ · v = h, (x, t) ∈ Ω × [0,∞),

∂S

∂t+ v · ∇f(S) = 0, (x, t) ∈ Ω × [0,∞),

Here h is a source term and f the flux function. Boundary conditions are givenfor the pressure, or its gradient in the normal direction, on ∂Ω. One way to un-derstand the equations is as follows: Darcy’s Law determines p, given S; the massconservation equation is then a non-local hyperbolic conservation law for S andboundary conditions are specified on the inflow boundary ∂Ωin ⊂ ∂Ω. We set∂Ωout = ∂Ω\∂Ωin. The initial condition for the saturation is S = 0 and the bound-ary conditions on the inflow boundary are S = 1. In physical terms, the subsurfacerock is assumed to be saturated entirely with oil at time t = 0, and water is thenpumped in at the boundaries.

For simplicity we assume that the tensor K has the simple form K = kI, werek is the scalar permeability field, The inverse problem is to find the permeabilityfield k from noisy measurements of what is known as the fractional flow or oil cutF (t), a measurement which quantifies the fraction of oil produced at the outflowboundary ∂Ωout as water is pumped in through ∂Ωin. Specifically

F (t) = 1 −∫

∂Ωout f(S)vndl∫

∂Ωout vndl

where vn is the component of v normal to the boundary and dl denotes integrationalong the boundary. Assume that we make measurements of F at times tkK

k=1

subject to Gaussian noise. So, the data are as follows

yk = F (tk) + ξk, k = 1, . . . , K,

where the ξk’s are zero mean Gaussian random variables. Concatenating data wemay write

y = F + ξ

where y = (y1, . . . , yK), F = (F (t1), . . . , F (tK)) and ξ ∼ N (0, Σ) for some covari-ance matrix Σ encapsulating measurement errors.


Figure 4. A realization from the prior distribution on the log-permeability field.

It is physically and mathematically important that k be positive in order toensure that the elliptic equation for the pressure is well-posed. We thus writek = exp(u) and consider the problem of determining u. We observe that F is a(complicated) function of u and so we may write

y = G(u) + ξ

as in the previous data assimilation application. The prior is a zero mean Gaussianmeasure on u, usually specified through a covariance function c(x, y) concerningwhich there is direct experimental information. The covariance operator C is de-fined by

(Cu) (x) =

∫

Ω

c(x, y)u(y)dy.

Applying a zero mean Gaussian prior on u with this covariance operator gives riseto what is termed a log-normal permeability. We then have

dπ

dπ0(u) ∝ exp

(

−1

2|y − G(u)|2Σ

)

where π0 is the prior Gaussian measure N (0, C). This provides another explicitexample of the structure (1). A typical sample from the prior distribution on apermeability field is shown in Figure 2.4.

For this problem the precision operator L is not necessarily a differential op-erator; in fact, it is typically a non-local operator. Informally we may write thedesired probability via a density of the form

π(u) ∝ exp(

−1

2|y − G(u)|2Σ − 1

2‖(−L)

12 u‖L2(Ω)

)

providing another explicit example of the structure (3). We have only describeda simplistic model. However, more involved models, in complex geometries andfor the coupled evolution of multiple phases of oil, water and gas, and with multi-ple injection and production sites (for generation and measurement of data), anddetermination of a tensor permeability, share a similar mathematical structure.


2.5. Mathematical Structure of the Posterior Measure. Theexamples given in this section suggest a general approach to the rigorous mathe-matical formulation of a range of problems defined through a probability measureon function space. A fundamental step in such a formulation is a choice of priormeasure π0 for which Φ in (1) is π0-measurable and the Radon-Nikodym derivative(1) is π0–integrable. In the data assimilation application this is intimately con-nected with the question of determining sufficient regularity on the initial velocityfield u so that Lagrangian tracers are well-defined. Similarly, in the geophysicsapplication, it is necessary to specify sufficient regularity on the log-permeabilityfield to ensure that the coupled equations for pressure and water saturation havea unique solution. The regularity of samples from a Gaussian measure on functionspace can be understood in terms of the rate of decay of eigenvalues of the covari-ance operator C, via the Karhunen-Loeve expansion. In this context it is naturalin many applications to specify C through a precision operator L = −C−1 whichis a differential operator as then the full power of spectral theory for differentialequations can be used. In many applications the primary role of the prior mea-sure will indeed be to specify regularity information. However in the geophysicsapplication the situation is somewhat different as there exists direct experimentalevidence concerning the covariance function c(·, ·) which must also be combinedwith regularity issues to determine the prior. In both the data assimilation andgeophysics applications this complete rigorous mathematical formulation is notcarried out in this article, but is left for future study. It is our belief that there area wide range of problems which will benefit from such an analytical investigation.A rigorous formulation of the first two examples, from molecular dynamics andsignal processing, is undertaken in [17].

3. Langevin Stochastic PDEs

Underpinning the probability measure π on H given by (1) is a stochastic par-tial differential equation for which π is invariant. This is an infinite dimensionalLangevin equation. In terms of the precision operator L = −C−1 this Langevinequation may be written as an SDE on Hilbert space with the form

dx

ds= L(x − m) − DΦ(x) +

√2

dW

ds(8)

where W is an H-valued Brownian motion. This equation is written down in [16]and can be given a rigorous interpretation in many concrete situations: see [15, 17].It corresponds to a noisy gradient flow for the functional found as the logarithmof the formal expression (3) for the probability density on function space. We nowgive several explicit instances of this Langevin SPDE connected with the examplesintroduced in the previous section. For general background concerning SPDEs see[8, 13, 30].

3.1. Molecular Dynamics. In the bridge diffusion case of (4), (5), arisingin Brownian dynamics models of thermally activated atomic motion, the Langevin


equation takes the form

∂x

∂s=

β

2

∂2x

∂t2− β

2∇G(x; β) +

√2

∂W

∂s, (9a)

x(0, s) = x− and x(T, s) = x+, (9b)

x(t, 0) = x0(t). (9c)

Here the last term in (9a) is space-time white noise. This SPDE is derived in[17, 25, 31]. Notice that t, the spatial variable in the SPDE, represents the real timein (4) whereas s, the time-like variable in the SPDE, is an artificial “algorithmic”time.

3.2. Signal Processing. In the signal processing case the objective is tosample a path of x from (6) given a single realization of the observation y from(7). The SPDE which is invariant with respect to this conditional distribution ofx is as follows:

∂x

∂s=

∂2x

∂t2− (∇f(x) −∇f(x)>)

∂x

∂s−∇xF (x) +

√2

∂W

∂s

+ dg(x, y)>(σσ>)−1(dy

dt− g(x, y)

)

− 1

2∇x

(

∇y · g(x, y))

∂x

∂t=(

f(x) −∇x ln ζ(x))

, t = 0,∂x

∂t= f(x), t = 1,

x = x0, s = 0.

Here

F (x) =1

2|f(x)|2 +

1

2∇ · f(x).

This SPDE is derived in [17, 31].

3.3. Lagrangian Data Assimilation. Recall that in this case we takeL to be the square of the Stokes operator and

Φ(u) =1

2|y − G(u)|2Σ

where G maps the initial data for the velocity field into the positions of a Lagrangiantracer. We have

DΦ(u) = −DG(u)>Σ−1(

y − G(u))

.

Note that DG requires knowledge of the derivative of the Navier-Stokes equationswith respect to initial data. The Langevin stochastic PDE is

∂u

∂s= −ν2∆2u −∇p + DG(u)>Σ−1 (y − G(u)) +

√2

∂W

∂s, (11a)

∇ · u = 0, (11b)

together with periodic boundary conditions on Ω and a divergence free initialcondition. Here the last term in (11a) is space-time white noise in H. As in theNavier-Stokes equations themselves, the pressure p is a Lagrange multiplier whichacts to enforce the incompressibility condition.


3.4. Geophysics. The geophysical application and the Lagrangian data as-similation problem share a common mathematical structure, with the exceptionof the choice of the precision operator L. Consequently the Langevin stochasticdifferential equation in this case is

∂u

∂s= Lu + DG(u)>Σ−1 (y − G(u)) +

√2

∂W

∂s(12)

together with an initial condition. Here G maps the log-permeability into thefractional flow at the boundary, hence its derivative will be a complex object. Theoperator L is not necessarily a differential operator in this application: it may bea non-local operator. So, in this case equation (12) is not necessarily an SPDE.

3.5. Mathematical Structure of the Langevin Equation. Manyoutstanding questions remain concerning the rigorous formulation of the aboveLangevin SDEs. Such questions have been resolved for the bridge diffusion mea-sure arising in the molecular dynamics example in [17], and the signal processingproblems for some limited choice of vector fields (f, g): the pair should be thesum of a linear function plus a gradient [17]. For the general signal processingproblem there are still open questions [16]. Similarly, checking that the SPDEsfor data assimilation and for the geophysics application are well-posed remains anopen question. As we will see, discretizations of the Langevin SPDE provide goodproposals for MCMC methods and in this context development of the rigorous un-derpinnings of the subject revolve around showing that the MCMC methods canbe defined on function space. Doing so is intimately bound up with the construc-tion of effecient MCMC methods, as shown in [2]. It is to the subject of MCMCmethods that we now turn.

4. Metropolis-Hastings Methods

We have illustrated that a wide range of problems can be written in a single unifyingframework: that of a probability measure on Hilbert space with Radon-Nikodymderivative with respect to a Gaussian measure. Formulating the problems in thisway is, of course, simply the first step in their resolution. The second step is todevelop methods to interrogate the probability measure and thereby extract in-formation from it. In practice we must discretize the function space (via finitedifferences, finite elements or spectral methods for example) leading to a high di-mensional measure on R

n with n 1. Sampling probability measures in highdimensions is notoriously hard. A generic approach to sampling that has seenspectacular success in recent years is the Markov chain Monte-Carlo (MCMC)methodology [21, 26]. A particular variant of this approach, which we will em-ploy for our problems, is the Metropolis-Hastings method [23, 19]. In the nextsection we overview the analysis of such algorithms, when applied to measuresarising from discretization of the structure (1), and show how our set-up fits into abroader context concerning the analysis of Metropolis-Hastings methods in high or


infinite dimensions. In this section we give the necessary background concerningthe MCMC methodology.

We start by discussing a variety of forms of target measure that have beenstudied in the literature, introducing a hierarchy of increasing complexity whicheventually leads to discretizations of (1). We then explain how the Metropolis-Hastings method works in general, illustrating that the key tunable parametersarise through the choice of the proposal distribution. Finally, we introduce a rangeof proposal distributions appropriate for sampling measures such as (1) and itsdiscretizations.

4.1. Structure of the Target. The following hierarchy of target measureswill be central in our discussion of the computational complexity of Metropolis-Hastings methods in high dimensions.

• IID Product in Rn. The earliest attempts to understand the behaviour of

MCMC methods in Rn, n 1, concentrated on measures of product form in

which each component is independent and identically distributed with densityproportional to f (see [14] and references therein to the physics literaturewhich preceded that work). Clearly, such measures are not intrinsically highdimensional as only one component need be sampled accurately to determinethe entire measure. However the Metropolis-Hastings algorithm couples thedifferent components, through the proposal, so study of these measures doesprovide an interesting starting point for analysis of MCMC methods in highdimensions. The structure of the target distribution π is now

π(x) = Πni=1f(xi).

• Scaled Product in Rn. An interesting variant of the IID product is the

case where independence is retained but the independent components are nolonger identical. Specifically they are all derived by scaling a single measureon R with density f . The target measure is now

π(x) = Πni=1

1

λif(xi

λi

)

.

Assuming for simplicity that the measure on R has mean 0 and unit variance,the variance of each component is λ2

i .

• Change of Measure from Product in Rn. Product measures are intrin-

sically limiting for applications. Change of measure from product is a farmore general setting. We will now consider target measures of the form

π(x) ∝ exp(

−Φn(x))

Πni=1

1

λif(xi

λi

)

. (13)

Here we allow for dependency among the different components of x via thepresence of Φn. We will show that, under certain conditions on Φn as n → ∞,


the behavior of Metropolis-Hastings methods on targets like (13) can bevery similar to that arising in the scaled product case. We will give somemotivation for these results in the sequel.

• Change of Measure from Gaussian in Rn. If f(x) = exp(−x2/2) then

the product measure is Gaussian and the form (13) becomes

π(x) ∝ exp(

−Φn(x) +1

2〈x,Lnx〉

)

(14)

with Ln a diagonal matrix with entries −1/λ2i . More generally the structure

(14) is of interest for any negative definite precision matrix Ln. Viewed inthis context, we see that the structure (14) is exactly what will arise from anapproximation of the measure (1) which is of interest to us in this article.

4.2. Metropolis-Hastings Algorithm. The basic idea of MCMC is togenerate a sequence xjJ

j=1 which, for large J , produces a set of approximatedraws from a given target measure π. This is done by creating a Markov chainfor which π is invariant. The approximate samples xj from π are correlated.The MCMC method is very flexible allowing for the construction of a wide rangeof methods with the aforementioned properties. A key issue is the constructionof methods which minimize correlation amongst samples, thereby increasing effi-ciency.

The Metropolis algorithm, a particular MCMC method, was introduced in [23]where it was used by physicists aiming at calculating averages under the Boltzmanndistribution. It was later generalized by Hastings in [19]. The algorithm has provenparticularly effective in a range of applications; we will concentrate on this variantof MCMC methods here.

The goal is to sample π : Rn 7→ R

+. The idea of the method is, given an ap-proximate sample xj , to propose a new sample y from some Markov chain withtransition kernel q(xj , ·). This proposal is then accepted (xj+1 = y) with probabil-ity a(xj , y) and rejected (xj+1 = xj) otherwise. The composition of proposal froma Markov kernel and the accept-reject criteria gives a modified Markov chain. If

a(x, y) = min(

1,π(y)q(y, x)

π(x)q(x, y)

)

(15)

then the resulting Markov chain for the sequence xjJj=1 is π-invariant and will,

for large J , generate samples from π under mild ergodicity hypotheses [21, 26].The following piece of pseudo-code defines the algorithm:

Algorithm 1.

1. Set j = 0. Pick x0 ∈ Rn.

2. Given xj propose y ∼ q(xj , ·).

3. Calculate a(xj , y).


4. Set xj+1 = y with probability a(xj , y).

5. Otherwise set xj+1 = xj .

6. Set j = j + 1 and return to 2.

We mentioned that key to success of the algorithm is minimizing correlationin the generated sequence. From this point of view, the acceptance probability isclearly a key object of interest: if it is small (on average) then the sequence will behighly correlated. In the high dimensional case that we study here our focus willbe on defining appropriate proposals which ensure that the acceptance probabilityis bounded away from zero, on average, as the dimension grows n → ∞. We nowturn to the class of proposals which effect this.

4.3. Proposals for Metropolis-Hastings. Consider a target densityπ : R

n 7→ R+. A commonly used family of proposals are random-walks for which

q(x, y) is the transition kernel associated with the proposal

y = x +√

2∆s ξ (16)

where ξ ∼ N (0, I) is a standard Gaussian random variable in Rn. These proposals

are very simple to implement but, as we will see, can suffer from (relatively) highrejection rate due to the fact that they contain no information about π. Forwhat comes next it is instructive to note that the proposal (16) can be seen as adiscretization of the SDE

dx

ds=

√2

dW

ds.

This SDE contains no information about the target π. In contrast, the LangevinSDE

dx

ds= ∇ log π(x) +

√2

dW

ds(17)

is π-invariant if W is an Rn-valued Brownian motion; a straightforward calculation

with the Fokker-Planck equation will show this. Equation (8) is an infinite dimen-sional version of this SDE, applied to the formal density (3). If we could sampleexactly from the transition density for equation (17) over some time-increment∆s, we would obtain a perfect proposal: it would be accepted with probability1, and a large enough choise of ∆s would ensure lack of correlation among sam-ples. Unfortunately it is not possible, in general, to sample from this transitiondensity. However we can discretize the equation in s to obtain proposals whichapproximate this distribution and hence, for small ∆s, should deliver reasonableacceptance probability. We now pursue this idea further.

It turns out that there is a whole family of equations, including (17) as a specialcase, which are π-invariant. For any positive-definite self-adjoint matrix A the SDE

dx

ds= A∇ log π(x) +

√2A dW

ds(18)

is π-invariant2. Many of the proposals we consider below arise from discetizationof equations of this type.

2Making these assertions about π-invariance rigorous in infinite dimensions requires being


5. Computational Complexity

We now explain a heuristic approach for selecting the time-step ∆s in the proposalsmentioned above, with a view toward optimizing the acceptance probability. Wewill choose the time-step as an inverse power of the dimension n of the state-spaceso that

∆s = n−γ . (19)

Note that the proposal y is now a function of: (i) the current state x; (ii) theparameter γ through the time-step scaling above; and (iii) the noise ξ which willappear in all the proposals that we consider. Thus y = y(x, ξ; γ). We wouldlike γ to be as small as possible, so that the chain will be making large steps anddecorrelation amongst samples will be maximised. However, we would additionallylike to ensure that the acceptance probability does not degenerate to 0 as n → ∞,also to prevent high correlation amongst samples. To that end we define γ0 asfollows:

γ0 = minγc≥0

γc : lim infn→∞

Ea(x, y) > 0 ∀γ ∈ [γc,∞)

.

Here the expectation is with respect to x distributed according to π and y cho-sen from the proposal distribution. In other words, we take the largest possibletime-steps, as a function of n, constrained by asking that the average acceptanceprobability is bounded away from zero, uniformly in n. The resulting time-steprestriction (19) is reminiscent of a Courant restriction arising in the numericalsolution of PDEs.

Carrying this analogy further, we introduce the heuristic that the number ofsteps required to reach stationarity is given by

M(n) = nγ0 .

As we will discuss below, this heuristic can be given a firm foundation in a numberof cases. Here we simply note that, in these cases, the Markov chain arising fromthe Metropolis-Hastings method approximates a Langevin SDE; one could thinkof the Markov chain as traveling with time-step ∆s on the paths of the LangevinSDE. It takes O(1) for the limiting SDE to reach stationarity, so in terms of thetime-step ∆s we obtain the expression for M(n) above. We give more details onthis point in the sequel.

Our goal now is to understand how M(n) depends on the structure of thetarget distribution and the choice of proposal distribution. At our disposal are theform of the discretization and the form of A. We will carry out such a study forthe hierarchy of target distributions introduced in subsection 4.1. We require aregularity condition on the density f .

Condition 1. (i) All moments of f are finite. (ii) log f is infinitely differentiable;log f and all its derivatives have a polynomial growth bound.

much more specific about the problem; for the set-up of subsections 2.1 and 2.2. such a task iscarried out in [17].


All results are obtained under Condition 1 which we assume to hold throughoutwithout further mention. For clarity of exposition all the proofs are collected inthe Appendix; within this section we confine ourselves to a brief discussion of theresults. In this article we make strong conditions on the scalings λi and the changeof measure Φn in order to simplify the proofs. Weaker conditions, and strongertheoretical results, are given in [3].

5.1. IID Products. Here we consider the case of target density with theform

π(x) = Πni=1f(xi).

We discuss two different proposals y = y(x, ξ) found by setting β = 0 and β = 1in the following formula:

y − x

∆s= β ∇ log π(x) +

√

2

∆sξ, ξ ∼ N (0, I).

The choice β = 0 corresponds to the random walk proposal (16) whereas β = 1corresponds to an Euler-Maruyama discretization of the Langevin SDE (17).

Theorem 1.

• If β = 0 then M(n) = O(n).

• If β = 1 then M(n) = O(n1/3).

We provide a direct proof of Theorem 1 only for completeness, since theseresults are implicit in the pair of papers [14, 27] (see also the survey [28]). In factin these papers the much stronger result of convergence, as n → ∞, of any scalarcomponent of the n-dimensional Markov chain to that of a Langevin diffusion,

is demonstrated. To be more precise, if x(i)1 , x

(i)2 , . . . is the trajectory of the ith

scalar component, then by appropriately tuning ∆s ∝ n−γ0 , the continuous-time

process x(i)[s nγ0 ]; s ≥ 0 converges to a Langevin diffusion. Such a result justifies the

statement that the number of steps to reach stationarity is of the order M(n) = nγ0 .The basic takehome message of Theorem 1 is that using steepest ascents in-

formation in the proposal which, for small ∆s, suggests moves in the directionof modes of the distribution, positively impacts the computational complexity ofMetropolis-Hastings algorithms for iid target densities in high dimension. We nowtake this idea further.

5.2. Scaled Products. Now consider the target density of the form

π(x) = Πni=1

1

λif(xi

λi

)

(20)

with λi = i−κ for some κ > 0. Thus, the target measure is of product form withthe ith component having variance i−2κ times a constant. We saw in the previ-ous theorem that including steepest descent information improves complexity. For


this reason we will henceforth work only with proposals arising from discetiza-tions which include the ∇ log π term. Specifically, we employ discretizations of theLangevin equation in the form (18) giving

y − x

∆s= A∇ log π(x) +

√

2A∆s

ξ, ξ ∼ N (0, I). (21)

We define the diagonal matrix Cn = diagλ21, · · · , λ2

n.

Theorem 2.

• If A = I then M(n) = O(n2κ+1/3).

• If A = Cn then M(n) = O(n1/3).

Matrix A can be viewed as a preconditioner which, in the case A = Cn, actsby placing different components on the same scale. By doing so, it is possibleto optimize the time-step ∆s for all components of the proposal, resulting in asubstantial impovement in computational complexity. Thus, the takehome mes-sage from this theorem is that preconditioning positively impacts complexity ofMetropolis-Hastings algorithms. The proof of this result is given in the Appendix.It should be noted however that the result can be proved by a straightforwardgeneralization of the ideas in [27]. The theorem is readily extended to the casewhere the λi are replaced by λi,n satisfying algebraic upper and lower bounds in i,uniformly in n – see [3]. Related results, for scalings somewhat different in naturefrom those considered here, may be found in [1].

5.3. Change of Measure. In both of the previous sections the target mea-sure was of product type and hence not fundamentally high dimensional as eachcomponent could be considered separately. We now move away from this restrictiveassumption and consider targets of the form

π(x) ∝ exp(

−Φn(x))

π0(x),

π0(x) = Πni=1

1

λif(xi

λi

)

.

Similarly to the previous section, we assume that λi = i−κ. We use a family ofproposals which, in the product case, coincides with the proposal (21):

y − x

∆s= A∇ log π0(x) +

√

2A∆s

ξ, ξ ∼ N (0, I). (22)

We also assume the following uniform bound on Φn:

supn∈Z+,x∈Rn

(

|Φn(x)|)

< ∞.

Defining Cn as in the previous section we have the following result.


Theorem 3.

• If A = I then M(n) = O(n2κ+1/3).

• If A = Cn then M(n) = O(n1/3).

We prove this theorem in the Appendix. The takehome message from this the-orem is that the change of measure does not affect the computational complexity.The boundedness assumption on Φn is very severe and mostly considered for clarityof exposition. Weaker and more pragmatic conditions, based on Lipschitz prop-erties of a limiting Φ on Hilbert space, may be found in [3]. The intution behindall the results concerning change of measure, both here and in [3], is that we workunder conditions on the Φn under which the reference product measure structuredominates in the tails; such a situation arises naturally when approximating infi-nite dimensional measures with Radon-Nikodym derivative (1) with respect to aproduct measure π0.

Note that a proposal derived from the discretization of the Langevin SDE (18)would take the form

y − x

∆s= A∇ log π0(x) −∇Φn(x) +

√

2A∆s

ξ, ξ ∼ N (0, I) (23)

instead of (22). However we have omitted the term ∇Φn(x) in (22) to simplify theproof of the complexity results in the above theorem, and because the resultingproposal suffices (under the stated conditions on Φn) to deliver an algorithm whichhas the same computational complexity as the one corresponding to product targetsin Theorem 2; this is in some sense (and apart from extraordinary choices of Φn)the best one can expect. However, whilst use of the proposal (23) might notimprove the asymptotic computational complexity in n, when compared with theresults obtained for the proposal (22), it can have a significant positive effect interms of the constant in the asymptotic cost, and in other measures of efficiency.

5.4. Change of Measure from Gaussian. In the previous section wemade the useful step of considering settings which are no longer of product form,taking us into a family of problems with practical application. Here we take afurther step in the direction of applicability, by assuming that the reference measureπ0 is Gaussian so that the target has the form

π(x) ∝ exp(

−Φn(x) +1

2〈x,Lnx〉

)

. (24)

We have used Ln = −C−1n = diag−λ−2

1 , · · · ,−λ−2n . We consider a family of

proposals parameterised by a θ ∈ [0, 1] which, in the Gaussian reference measurecase, is identical to that from the previous section, in the case θ = 0. Whenθ ∈ (0, 1) the family corresponds to using an implicit discretization of (18):

y − x

∆t= A

(

θLy + (1 − θ)Lx)

+

√

2A∆t

ξ, ξ ∼ N (0, I).

We make the same assumption on Φn as in the previous section.


Theorem 4.

• If θ = 12 and A = I or A = Cn then M(n) = O(1).

• If θ = 0 and A = Cn then M(n) = O(n1/3).

• If θ = 0 and A = I then M(n) = O(n2κ+1/3).

Thus the takehome message from this theorem is that implicitness in the pro-posal can positively impact computational complexity. It turns out that the choiceθ = 1

2 is crucial to obtaining n-independent estimates on M(n). This is due to thefact that θ = 1

2 is the unique choice of θ for which the Metropolis-Hastings methodis well-defined on the limiting (for n → ∞) infinite dimensional Hilbert space H.This result is proved in [2]; for numerical illustrations of the effect of θ see thatpaper and [18].

The results of Theorem 4 are directly relevant to the infinite dimensional mod-els of interest in this paper characterised by the general density structure π in (3)and the π-invariant SPDE (8). The target πn in (24) should be viewed as an ap-proximation of π. One can readily obtain such a structure for a finite dimensionalapproximation of π by truncating the spectral expansion corresponding to eigen-basis of the covariance operator C of the reference Gaussian measure appearing atthe definition of π. Equivalently, this corresponds to an n-dimensional projectionof the Karhunen-Loeve expansion for Gaussian measures. Other methods, like fi-nite differences or finite elements can deliver a similar structure. In these casesnote that appropriate orthogonal transformations can force a diagonal structurefor the approximation of the covariance operator, thus granting the structure (24).In terms of results, Theorem 4 dictates that one should use a θ-method for the dis-cretization of the SPDE (8) in the algorithmic time s-direction, with the particularchoice θ = 1

2 .

6. Conclusions

In this article we have studied a class of problems that lie at the interface of appliedmathematics and statistics. We have illustrated the following:

• Applications. Measures which have density with respect to a Gaussian arisenaturally in many applications where the solution is a measure on functions.

• SPDEs. There is a natural notion of Langevin equations on function spacefor these measures. These Langevin equations are often stochastic partialdifferential equations (SPDEs).

• Algorithms. Using these SPDEs, and their finite dimensional analogues,natural MCMC methods can be constructed to sample function space.

• Numerical Analysis. Ideas such as steepest descents, preconditioning andimplicitness have crucial impact on the complexity of MCMC algorithms.


Many interesting issues remain open for further study:

• Mathematical Formulation. As indicated in subsection 2.5, providinga rigorous formulation of many problems which require a measure on func-tion space, especially inverse problems, is an open and interesting area foranalysis.

• Algorithms. In theory it is advantageous to incorporate information con-cerning ∇Φn(x) (as in (23)) in the proposal. In practice, calculation of thisderivative may be very expensive: study of the data assimilation [20] andgeophysical applications [22] will illustrate this. Thus, it is important to findcheaper surrogates for ∇Φ which result in improved acceptance probabilities.

• Applications. As we have shown these are numerous in chemistry, physics,data assimilation, signal processing and econometrics. Realizing the potentialfor the methodology studied here remains a significant challenge.

• Stochastic Analysis. The existing theory of π-invariant SPDEs would ben-efit from extension, in the case of conditioned diffusions, to non-gradient vec-tor fields, state-dependent noise, degenerate noise and non-Gaussian noise.More generally, in particular for inverse problems, making sense of the re-sulting SPDEs remains an open and interesting problem - see subsection3.5.

• Numerical Analysis. It is important to develop an approximation theoryfor the S(P)DEs and MCMC methods on function space written down in thisarticle. Challenging issues include nonlinear boundary conditions, nonlinearDirac sources, and preserving symmetry of the inverse covariance matrix.

• Statistics. Incorporation of this function space sampling into the (Gibbs)sampler to estimate parameters as well as functions. Study of optimal scalingof proposals in various singular limits, such as small diffusion in the case ofbridge diffusions or signal processing, or rapidly varying permeability in thecase of geophysical applications.

Apart from the intrinsic interest in the class of problems studied here, andthe specific conlclusions listed, the work presented here is perhaps also of interestbecause it highlights an important general trend, namely that applied mathemat-ics and statistics are increasingly required to work in tandem in order to tacklesignificant problems in science, engineering and beyond.

Acknolwedgements. We are very grateful to Yalchin Efendiev, Frank Pinski,Gareth Roberts and Jochen Voss for comments on early drafts of this article, andto Efendiev (Figure 2.4), Pinski (Figure 2.1) and Voss (Figures 2.2 and 2.3) forproviding the illustrations in the article.


References

[1] M. Bedard, Weak Convergence of Metropolis algorithms for non-i.i.d. target

distributions. Ann. Appl. Probab. 17(2007), 1222-1244.

[2] A. Beskos, G.O. Roberts, A.M. Stuart and J. Voss. An MCMC Method for

diffusion bridges. Submitted.

[3] A. Beskos, G. Roberts and A.M. Stuart. Scalings for local Metropolis-Hastings

chains on non-product targets. Submitted.

[4] P.G. Bolhuis, D. Chandler, C. Dellago and P.L. Geissler Transition path sam-

pling: Throwing ropes over rough mountain passes, in the dark. Ann. Rev.Phys. Chem. 53(2002), 291-318.

[5] G.O. Roberts and Stramer, O. On inference for partially observed nonlinear

diffusion models using the Metropolis-Hastings algorithm. Biometrika 88(2001),603–621.

[6] O. Elerian, S. Chib, N. Shephard. Likelihood inference for discretely observed

nonlinear diffusions. Econometrica 69(2001), 959–993.

[7] A.J. Chorin and O.H. Hald. Stochastic tools in mathematics and science, vol-ume 1 of Surveys and Tutorials in the Applied Mathematical Sciences. Springer,New York, 2006.

[8] G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions. Cam-bridge University Press, 1992.

[9] M. Dashti and J. Robinson. Uniqueness of the particle trajectories of the weak

solutions of the two-dimensional Navier-Stokes equations. Arch. Rat. Mech.Anal. (2007). Submitted.

[10] P. Dostert, Y. Efendiev, T.Y. Hou and W. Luo, Coarse-grain Langevin algo-

rithms for dynamic data integration and uncertainty quantification. J.Comp.Phys, 217(2006), 123–142.

[11] W. E, W.Q. Ren and E. Vanden-Eijnden. String method for the study of rare

events. Phys. Rev. B 66, 2002, p. 052301.

[12] M.I. Freidlin and A.D. Wentzell. Random Perturbations of Dynamical Systems.

Springer-Verlag, New York, 1998.

[13] G. Garcia-Ojalvo and J.M. Sancho, Noise in Spatially Extended Systems

Springer(1999).

[14] A. Gelman, W.R. Gilks and G.O. Roberts, Weak convergence and optimal

scaling of random walk Metropolis algorithms. Ann. Appl. Prob. 7(1997), 110–120.

[15] M. Hairer, A.M.Stuart, J. Voss and P. Wiberg. Analysis of SPDEs Arising

in Path Sampling. Part 1: The Gaussian Case. Comm. Math. Sci. 3(2005),587–603

[16] M. Hairer, A.M.Stuart and J. Voss. Sampling the posterior: an approach to

non-Gaussian data assimilation. PhysicaD, 230(2007), 50–64.

[17] M. Hairer, A.M.Stuart and J. Voss. Analysis of SPDEs Arising in Path Sam-

pling. Part 2: The Nonlinear Case. Ann. Appl. Prob., 17(2007), 1657–1706.


[18] M. Hairer, A.M.Stuart and J. Voss. Sampling conditioned diffusions. To ap-pear in “Trends in Stochastic Analysis”, Cambridge University Press, 20 pages(2008).

[19] W.K. Hastings. Monte Carlo sampling methods using Markov chains and their

applications. Biometrika 57(1970), 97–109.

[20] F.-X. Le Dimet and O. Talagrand, Variational algorithms for analysis and

assimilation of meteorological observations: theoretical aspects. Tellus A,38(1986), 97–110.

[21] J. Liu, Monte Carlo Strategies in Scientific Computing. Springer Texts in Statis-tics, Springer-Verlag, New York, 2001.

[22] X. Ma, M. Al-Harbi, A. Datta-Gupta, and Y. Efendiev. Multistage sampling

approach to quantifying uncertainty during history matching geological models,Soc. Petr. Eng. Journal (2007), to appear.

[23] N. Metropolis, A.W. Rosenbluth, M.N. Teller and E. Teller, Equations of state

calculations by fast computing machines. J. Chem. Phys. 21(1953), 1087–1092.

[24] B. Oksendal, Stochastic differential equations, Springer, New York, 1998.

[25] M. Reznikoff and E. Vanden Eijnden. Invariant measures of SPDEs and con-ditioned diffusions. C.R. Acad. Sci. Paris, 340(2005), 305 - 308

[26] C.P. Robert and G.C. Casella, Monte Carlo Statistical Methods. Springer Textsin Statistics, Springer-Verlag, 1999.

[27] G.O. Roberts and J. Rosenthal, Optimal scaling of discrete approximations to

Langevin diffusions. JRSSB 60(1998), 255–268.

[28] G.O. Roberts and J. Rosenthal, Optimal scaling for various Metropolis-Hastings

algorithms. Statistical Science 16(2001), 351–367.

[29] J.C. Robinson, Infinite-Dimensional Dynamical Systems. Camrbridge Univer-sity Press, Cambridge, 2001.

[30] B. Rozovskii, Stochastic evolution systems: linear theory and applications to

non-linear filtering. Kluwer Academic Publishers, 1990.

[31] A.M.Stuart, J. Voss and P. Wiberg. Conditional Path Sampling of SDEs and

the Langevin MCMC Method. Comm. Math. Sci. 2(2004), 685–697.

Appendix. Proof of Theorems

The following generic result will allow us to obtain estimates for the Metropolis-Hastings acceptance probability.

Lemma 1. Let T be a real-valued random variable.

i) For any c > 0:

E [ 1 ∧ eT ] ≥ e−c

(

1 − E |T |c

)

.


ii) If E [ T ] < 0, then:

E [ 1 ∧ eT ] ≤ eE [ T ]/2 + 2E |T − E [ T ] |

(−E [ T ]).

Proof. For the first result note that:

E [ 1 ∧ eT ] ≥ E [ (1 ∧ eT ) · I|T | ≤ c ] ≥ e−cP [ |T | ≤ c ] .

The Markov inequality now gives the required result. For the second result, we setµ := −E [ T ], T0 := T − E[ T ]. Then:

E [ (1 ∧ eT ) · I|T0| ≤µ

2 ] + E [ (1 ∧ eT ) · I|T0| >

µ

2 ] ≤ e−µ/2 + P [ |T0| >

µ

2] .

The result follows from Markov inequality.

For simplicity in the proofs that follow, we set:

g(x) = log f(x)

and we use g(j) to denote the jth derivative of g.

Proof of Theorem 1:

• β = 0

The acceptance probability a(x, y) in (15) is now determined as follows:

a(x, y) = 1 ∧ π(y)

π(x)= 1 ∧ eRn ; Rn :=

n∑

i=1

(

g(yi) − g(xi))

.

Recall that since β = 0:yi = xi +

√2∆s ξi .

Case A: ∆s = n−γ with γ ≥ 1.

We take a second order Taylor expansion of Rn = Rn(√

∆s) around√

∆s = 0. So:

Rn = A1,n + A2,n + Un ,

with individual components:

A1,n =√

∆s

n∑

i=1

C1,i A2,n = ∆s

n∑

i=1

C2,i, Un = ∆3/2s

n∑

i=1

Ui,n;

C1,i =√

2 g′

(xi)ξi, C2,i = g′′

(xi)ξ2i , Ui,n =

√2

3g(3)(xi +

√2∆∗

i ξi) ξ3i ,


for some ∆∗i ∈ [0,

√∆s], i = 1, . . . , n. Notice that C1,ii and C2,ii are both

sequences of iid random variables, so we will ignore reference to the index i whenconsidering expectations w.r.t. C1,i or C2,i. Using Condition 1(ii), we find that:

|Ui,n| ≤ M1(xi)M2(ξi)M3(∆∗i ) , (25)

for some positive polynomials M1, M2, M3. Using Condition 1(i), E [ M1(xi) ] < ∞,E [ M2(ξi) ] < ∞, both expectations not depending on i. Since ∆∗

i is bounded aboveuniformly in i, n, so is M3(∆

∗i ). Since the xi and ξ are independent of one another,

it is now clear that E |Ui,n| ≤ K0 for some constant K0 not depending on i, n, andsubsequently:

limn→∞

E | Un | = 0 .

Note now that, since E [ C1,· ] = 0, Jensen’s inequality gives:

E [ |A1,n| ] ≤√

∆s√

n E [ C21,· ]

1/2 .

Also,E [ |A2,n| ] ≤ ∆s n E [ |C2,·| ] .

Since ∆s = n−γ with γ ≥ 1, we deduce that lim supn E |Rn | < ∞. Lemma 1(i)now implies that:

lim infn→∞

E a(x, y) > 0 .

Case B: ∆s = n−γ with γ ∈ (0, 1).

We select an integer m such that (m + 1)γ > 2 and use the mth-order Taylorexpansion:

Rn =

m∑

j=1

Aj,n + U ′

n ,

with terms specified as follows:

Aj,n := (√

∆s)jn∑

i=1

Cj,i, U ′

n = (√

∆s)m+1n∑

i=1

U′

i,n;

Cj,i =(√

2)j

j!g(j)(xi)ξ

j , U′

i,n =(√

2)m+1

(m + 1)!g(m+1)(xi +

√2∆∗

i ξi) .

for some corresponding ∆∗i ∈ [0,

√∆s]. The residual terms U

′

i,n can be boundedby a constant as in (25), so the particular choice of m gives that:

limn→∞

E | U ′

n| = 0 .

Also, since E [A1,n ] = 0:

E [ Rn ] =

m∑

j=2

E [Aj,n ] + O(1); E [Aj,n ] = (√

∆s)j n E [ Cj,· ] .


From the analytical expression for C2,i:

E [ C2,· ] = −∫

R

g′

(x)2 expg(x)dx < 0.

All other Cj,· satisfy E|Cj,·| < ∞. So, E [ Rn ] → −∞ as fast as −n1−γ . For theexpectation E |Rn−E [ Rn ] |, we use Jensen’s inequality to get the following upperbound:

E |Rn − E [ Rn ] | ≤m∑

j=1

(√

∆s)j√

n Var [ Cj,· ]1/2 + O(1) .

So, E |Rn − E [ Rn ] | does not grow faster than ( |E[Rn] | )1/2. From Lemma 1(ii):

limn→∞

E a(x, y) = 0 .

• β = 1

The proof follows the same lines as the case β = 0. The acceptance probabilitya(x, y) can be written again as 1 ∧ eRn for some corresponding Rn. We will againconsider Taylor expansions of Rn = Rn(

√∆s) around

√∆s = 0. So, considering

an mth-order expansion we obtain the following structure:

Rn(√

∆s) =

m∑

j=1

Aj,n + Un; (26)

Aj,n = (√

∆s)j

n∑

i=1

Cj,i, Un = (√

∆s)m+1n∑

i=1

G(xi, ξi, ∆∗i ) (27)

for some Cj,i, G involving g and it’s derivatives, and some ∆∗i ∈ [0,

√∆s], 1 ≤ i ≤ n.

For the explicit expressions for Cj,i and G see [27]. We will only exploit the fol-lowing characteristics:

Cj,i = Cj,·(xi, ξi) , (28)

C1,i = C2,i ≡ 0, i = 1, . . . n; E [ C3,· ] = E [ C4,· ] = E [ C5,· ] = 0; E [ C6,· ] < 0 ,

G has a polynomial growth bound .

Since the first two terms in the expansion cancel out, a larger step-size√

∆s cannow control the remaining term compared with the case β = 0. Working as above,we can show that:

E | Aj,n | ≤ (√

∆s)j √n E [ C2j,· ]

1/2, j = 3, 4, 5 ,

E [ |A6,n| ] ≤ (∆s)3 n E |C6,· | .

So, when ∆s = n−γ with γ ≥ 1/3 all terms in a sixth-order Taylor expansion ofRn(

√∆s) will have n-bounded absolute expectation, and Lemma 1(i) will again

give the bound lim infn E a(x, y) > 0. Using the same arguments as in the casewhen β = 0, one can also prove that limn→∞ E a(x, y) = 0 if γ ∈ (0, 1/3). Weavoid further details.


Proof of Theorem 2:

• A = I

The proof is a slight modification of the proof of Theorem 1. Again, we considerthe exponent Rn = Rn(

√∆s) from the expression 1 ∧ eRn for the acceptance

probability a(x, y), and consider Taylor expansions of it around√

∆s = 0. Theformulae are similar to the ones for the iid case given in (26) and (27). Analytically:

Rn =

m∑

j=3

Aj,n + Un;

Aj,n = (√

∆s)jn∑

i=1

Cj,i/λji , Un = (

√∆s)m+1

n∑

i=1

G(xi/λi, ξi, ∆∗i /λi)/λm+1

i .

for some ∆∗i ∈ [0,

√∆s], i = 1, . . . n. The functional G is the same as in (27),

whereas Cj,i = Cj,·(xi/λi) for the functions Cj,· in (28); in particular Cj,ii areagain iid for all j ≥ 1.

We work as before. For γ ≥ 2κ + 1/3, we consider the sixth-order expansion(m = 6), and find that:

|G(xi/λi, ξi, ∆∗i /λi)| ≤ M1(xi/λi)M2(ξi)M3(∆

∗i /λi) ,

for some positive polynomials M1, M2, M3. One can now easily check that:

limn→∞

E | Un| = 0 .

We then obtain the bounds:

E | Aj,n | ≤ (√

∆s)j(

n∑

i=1

λ−2ji

)1/2

E [ C2j,· ]

1/2, j = 3, 4, 5 ,

E |A6,n | ≤ (∆s)3(

n∑

i=1

λ−6i

)

E |C6,· | .

Recall that λi = i−κ. So, when ∆s = n−γ with γ ≥ 2κ + 1/3, then one can easilyverify that lim supn E |Rn | < ∞. So, from Lemma 1(i), lim infn→∞ E a(x, y) > 0.

When γ ∈ (2κ, 2κ+1/3), we consider an m-th order expansion, for (m+1)γ > 2and work as in Theorem 1, taking into consideration the scalings λi as above. Weavoid further details.

• A = Cn

One can easily check that, on the transformed space x 7→ C−1/2n x the original algo-

rithm with target distribution (20) and proposal (21) coincides with the algorithmof the iid case given in section 5.1. So the result follows from Theorem 1, withβ = 1.


Proof of Theorem 3:The acceptance probability will now be:

a(x, y) = 1 ∧ eRn−Φn(y)+Φn(x) ,

for Rn as in the product case. Note now that:

lim supn

Eπ |Rn − Φn(y) + Φn(x) | ≤ K1 + K2 lim supn

Eπ0|Rn| ,

Eπ

[

1 ∧ eRn−Φn(y)+Φn(x)]

≤ K Eπ0[1 ∧ eRn ] ,

for some constants K, K1, K2 > 0, where we have used the assumption of a uniformbound on Φn. Consider the case A = I with ∆s = n−γ . We have already showedin the proof for Theorem 2 above that if γ ≥ 2κ+1/3 then lim supn Eπ0

|Rn | < ∞.The first inequality above implies that also lim supn Eπ |Rn−Φn(y)+Φn(x) | < ∞,and Lemma 1(i) gives a lower bound for the average acceptance probability instationarity. When γ ∈ (2κ, 2κ + 1/3), we showed that Eπ0

[ 1 ∧ eRn ] → 0, so alsoEπ [ 1 ∧ eRn−Φn(y)+Φn(x) ] → 0.

A similar argument gives the required result for A = Cn.

Proof of Theorem 4:

• θ = 0.

The required results for θ = 0 are special cases of Theorem 3.

• θ = 12 , A = I.

After carrying out some calculations, the acceptance probability can be written as1 ∧ eTn where:

Tn = Φn(x) − Φn(y) +1

2(θ − 1

2)∆s

n∑

i=1

λ−2i

(

(yi/λi)2 − (xi/λi)

2)

.

So, when θ = 1/2, the average acceptance probability in stationarity is lowerbounded even for constant ∆s = c. A similar simplification of the acceptanceprobability expression arises also in the case A = Cn.

Department of Statistics, University of Warwick, Coventry CV4 7AL, UK

E-mail: [email protected]

Mathematics Institute, University of Warwick, Coventry CV4 7AL, UK

E-mail: [email protected]

mcmc methods for sampling function space

Documents