bayesian modelling and computationsb2116/bayesian_computation/bmc... · 2018-11-21 · bayesian...

Bayesian Modelling and Computation

S. A. Bacallado*

Statistical Laboratory, University of Cambridge

October 5th, 2018 – November 28th, 2018

Contents

1 Bayesian inference 2

2 Exponential families and conjugate priors 4

3 Graphical models 8

3.1 Inference on graphical models . . . . . . . . . . . . . . . . . . . 11

3.2 Belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Loopy belief propagation . . . . . . . . . . . . . . . . . . . . . . 15

4 The Monte Carlo method 18

4.1 Quantiles and ratios . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Sampling univariate distributions . . . . . . . . . . . . . . . . . 23

4.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Variance reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Importance sampling 29

6 Fundamentals of Markov chain Monte Carlo 34

6.1 Countable state spaces . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 General state spaces and Harris recurrence . . . . . . . . . . . . 35

6.3 Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Basic MCMC Algorithms 41

7.1 Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8 Auxiliary variables and data augmentation 48

8.1 Slice sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8.2 Hit and run algorithm . . . . . . . . . . . . . . . . . . . . . . . . 51

8.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

*[email protected]

1

mailto:[email protected]

1. Bayesian inference

9 Expectation Maximisation and Variational Inference 55

9.1 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . 58

9.2 Stochastic Variational Inference . . . . . . . . . . . . . . . . . . . 60

10 Langevin Dynamics and Hamiltonian Monte Carlo 64

10.1 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 65

10.2 No U-Turn Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 68

10.3 Riemannian Manifold HMC . . . . . . . . . . . . . . . . . . . . . 69

10.4 Stochastic gradient HMC . . . . . . . . . . . . . . . . . . . . . . 69

1 Bayesian inference

A fundamental concept of this course is that of a latent or nuisance variable.As the name latent suggest, these are random variables in a probability modelwhich are not observed, and as the word nuisance implies, we are not necessar-ily interested in inferring them. Models with latent variables are ubiquitous.Perhaps the most basic example is a Generalised Linear Model with randomeffects.

1 example (Logistic regression with random effects). Suppose we are tryingto model the outcome of N elections as a function of p opinion polls. Let Xi bethe outcome variable, a proportion of ni electors who voted for, for example,the Labour party. The predictors wi ∈ [0, 1]p are proportions of support forLabour in p different opinion polls for election i. Perhaps the simplest modelfor binomial outcomes is logistic regression, which assumes

niXi ∼ Binomial

(ni,

11 + exp(−w>i β)

)

for a vector of parameters β ∈ Rp.The shortcomings of this model are not difficult to imagine. In a typi-

cal election, the number of electors ni is very large, and the variance of Xiconditional on wi and θ is O(1/ni). Therefore, the model assumes Xi is prac-tically determined by the parameters and close to 1/(1 + exp(−w>i β)), whichis highly implausible. This type of mispecification is known as overdispersion,because the mean-variance relationship of the Binomial family does not re-flect reality, and in particular, responses with identical predictors have highervariance than the model would suggest.

One way to deal with this issue is by introducing random effects Zi ∼N(0, σ2) i.i.d. for i = 1, . . . , N, and letting

niXi ∼ Binomial

[ni,

11 + exp(−w>i β + Zi)

].

In this example, the observables are X = (X1, . . . , XN), and the parameters ofinterest are θ = (β, σ2) which capture, respectively, the relationship betweenopinion polls and the outcome of the election and the level of overdispersion.

2

1. Bayesian inference

The random effects Z = (Z1, . . . , ZN) are latent or nuisance variables.

Introducing latent variables can render even the simplest procedures, suchas maximum likelihood estimation of θ nontrivial. To introduce some notation,let fX,Z|Θ(x, z | θ) be the joint density of the observable X and the latent vari-able Z for a fixed value of the parameter. Densities are defined with respectto a product measure µX(dx)µZ(dz). As we only observe X, the (marginal)likelihood of θ is proportional to

fX|Θ(x | θ) :=∫

fX,Z|Θ(x, z | θ)µZ(dz).

Even for the simple model above, there is no closed-form expression for thisintegral. Therefore, we must resort to numerical methods to find its maxi-mum over θ. This course deals with methodologies which allow us to solveproblems such as this, which are not necessarily Bayesian.

In a fully Bayesian analysis, the parameter is considered a random vari-able Θ with a prior density fΘ(θ). The random variables X, Z, and Θ are de-fined in the same probability space, and their joint distribution is fX,Z|Θ(x, z |θ) fΘ(θ)µX,Z(dx, dz)µΘ(dθ). For the moment, we won’t specify the measurespace in which each variable lives.

Inferences and decisions in the Bayesian paradigm are based on the pos-terior distribution. The posterior distribution is a regular conditional probabil-ity distribution, which for every point x defines a measure on the parameterspace

µΘ|X(dθ | x) =

∫fX,Z|Θ(x, z | θ) fΘ(θ)µZ(dz)∫∫

fX,Z|Θ(x, z | θ) fΘ(θ)µZ(dz)µΘ(dθ)µΘ(dθ)

=fX|Θ(x | θ) fΘ(θ)∫

fX|Θ(x | θ) fΘ(θ)µΘ(dθ)µΘ(dθ).

2 remark. Regular conditional probability distributions may not exist whenX is continuous. A sufficient condition for existence is that X lives in (X,X )a Polish (separable, complete metric) space with the Borel σ-algebra.

Inferential summaries about the parameter Θ include the Bayes estimateunder a loss function L,

θBayes = arg mint

∫L(t− θ)µΘ|X(dθ | x),

which most commonly refers to the posterior mean of a real parameter whenL(`) = `2, or the posterior median when L(`) = |`|. Other quantiles of theposterior distribution are used to define credible intervals which serve as theBayesian counterpart of confidence intervals and in some cases, but not al-ways, have similar frequentist properties.

3 notation. We will henceforth omit the subscript from prior and posteriordensities and measures when it can be inferred from the argument. For exam-

3

2. Exponential families and conjugate priors

ple, we might write µ(dθ | x) instead of µΘ|X(dθ | x).

In many models the observation is a sequence of random variablesX1, . . . , Xn conditionally independent given Θ, i.e. f (x1:n | θ) = ∏n

i=1 f (xi | θ).The task of prediction is performed using the posterior distribution of futureobservations Xn+1, . . . , Xn+m or predictive rule:

f (xn+1:n+m | x1:n) =∫ n+m

∏i=n+1

f (xi | θ)µ(dθ | x1:n).

Beyond estimation of future observables, the predictive rule can be used toderive optimal decisions under uncertainty. The von Neumann-Morgensternutility theorem motivates the goal of maximising the posterior expectation ofa utility function on future outcomes.

Finally, model selection in the Bayesian paradigm is usually done througha Bayes factor between a model M1 and an alternative M2, or the ratio of theirposterior probabilities

π(M1)∫

f1(x | θ1)µ1(dθ1)

π(M2)∫

f2(x | θ2)µ2(dθ2),

where π(Mi) is the prior probability of model i ∈ 1, 2 and fi, µi are thelikelihood and prior under that model. If there is no a priori preference for onemodel over the other, so π(M1) = π(M2), the critical quantities are the inte-grals in the numerator and denominator, which are the marginal probabilitiesof the data under each model, sometimes called the evidence.

The inferential tasks discussed: parameter estimation, uncertainty quantifi-cation, prediction, decision making under uncertainty, and model comparison,all rely on taking integrals against the prior or posterior distributions, whichis intimately related to the problem of sampling these distributions. Theselecture notes discuss a range of numerical methods that address this goal.

2 Exponential families and conjugate priors

4 definition. We say a probability distribution on the space (X,B) is in anexponential family with sufficient statistic T : X → Rd and natural parameterθ ∈ Rd if it has density

f (x | θ) = exp(T(x)>θ − K(θ)),

where K(θ) = log∫

Xexp(T(x)>θ)dx. Let S be the set θ ∈ Rd; K(θ) < ∞.

Due to the convexity of the exponential, the set S is convex. The derivativesof K(θ) can be related to the moments of the sufficient statistic. In fact,

∇K(θ) =∂∂θ

∫X

exp(T(x)>θ)dx∫X

exp(T(x)>θ)dx

4


=

∫X

T(x) exp(T(x)>θ)dx∫X

exp(T(x)>θ)dx= Eθ(T(X)),

when Eθ′ |T(X)| < ∞ for θ in a neighbourhood of θ. Under similar regularityconditions that allow us to differentiate under the integral sign, it is straight-forward to show that the Hessian,

∂2K(θ)∂θ∂θ>

= Eθ

[(T(X)−EθT(X))(T(X)−EθT(X))>

]= Var(T(X)).

If X1, . . . , Xn are i.i.d. from f (· | θ), the likelihood is

f (x1, . . . , xn | θ) =n

∏i=1

f (xi | θ) = exp

(θ>

n

∑i=1

T(xi)− nK(θ)

).

A conjugate prior is one that has a similar form as the likelihood,

f (θ | λ) = exp(θ>λ1 − K(θ)λ2 − K(λ)).

Note this is also an exponential family, with sufficient statistic [θ>, K(θ)]>,natural parameter λ = [λ>1 , λ2]

>, and log partition function K. A conjugateprior is said to be closed under sampling, meaning that the posterior distribu-tion is a member of the same exponential family as the prior. Indeed, by Bayesrule

f (θ | x1, . . . , xn, λ) =f (x1, . . . , xn | θ) f (θ | λ)

f (x1, . . . , xn | λ)

= exp(θ>λ′1 − K(θ)λ′2 − K(λ′)),

where λ′1 = λ1 + ∑ni=1 T(xi), and λ′2 = λ2 + n.

The diagram in Fig. 1 lists conjugate priors for a number of common ex-ponential family likelihoods. In the diagram, every arrow points from a like-lihood to its conjugate prior, with the parameter indicated on the arrow. Apattern to remember is that the conjugate prior for parameters which are prob-abilities tends to be the Beta distribution, or its multidimensional generalisa-tion, the Dirichlet, while the conjugate prior for scale or precision parameterstends to be the Gamma distribution. A multivariate normal likelihood withmean Aµ + b linear in some parameter µ, has a multivariate normal conjugateprior on µ.

It is worth noting that closure under sampling does not characterise theconjugate priors shown in the figure. Indeed, if we enlarge the vector of suffi-cient statistics of the prior [θ>, K(θ)]> with some function T(θ), the resultingfamily retains the property of closure under sampling. Also, the set of alldistributions over the parameter θ is trivially closed under sampling.

A way to characterise the conjugate priors in Fig. 1 was defined by Diaco-nis and Ylvisaker (Ann. Stat., 1979). They define a conjugate prior as one inwhich the predictive distribution—the posterior of T(Xn+1)—is linear in the

5


Geometric

Bernoulli/Binomial

Negative Binomial

Beta Discrete/Multinomial Dirichlet

Exponential

Poisson

Normal

Gamma

Log-Normal

Multivariate Normal WishartΣ−1

p

p

p

λ

λ

τ

β

µ

µ

τ

p

µ

Figure 1: Diagram of conjugate likelihood-prior pairs.

sufficient statistic:

E(T(Xn+1) | x1, . . . , xn) = E(∇K(θ) | x1, . . . , xn) = an

∑i=1

T(xi) + b,

for some constants a and b. Under regularity conditions, only 6 univariateexponential family priors, including the Normal, Binomial, Negative Bino-mial, Gamma, and Poisson distributions satisfy this property. These familiescoincide with those described by Carl Morris (Ann. Stat. 1982) as having aquadratic mean-variance relationship.

Conjugate priors also make it easy to compute the evidence, or themarginal probability of the observable. Indeed,

f (x1, . . . , xn | λ) =∫

f (x1, . . . , xn | θ) f (θ | λ)dθ = exp(K(λ′)− K(λ)).

If the log partition constant K is available in closed form, it is possible to com-pute the evidence analytically. This is useful in Empirical Bayes procedures,where we aim to maximise the evidence over λ.

5 example (Beta-binomial model). Suppose that we observe ni cases of a bac-terial infection in hospital i for i = 1, . . . , N. In each hospital i, we observe anumber xi of antibiotic-resistant infections. We will model the data as follows

Xi | µi ∼ Binomial(ni, µi) independently for i = 1, . . . , n

µ1, . . . , µNiid∼ Beta(a, b).

The posterior of µi is

f (µi | xi, a, b) ∝ µXii (1− µi)

ni−Xi µa−1i (1− µi)

b−1

6


which is a Beta(a + Xi, b + ni − Xi) distribution. The evidence in this model is

f (x1, . . . , xn | a, b) =N

∏i=1

∫f (xi | µi) f (µi | a, b)dµi

=N

∏i=1

∫ (nixi

)µ

Xii (1− µi)

ni−Xiµa−1

i (1− µi)b−1

B(a, b)dµi

=N

∏i=1

(nixi

)1

B(a, b)

∫µ

a+Xi−1i (1− µi)

b+ni−Xi−1dµi

=N

∏i=1

(nixi

)B(a + Xi, b + ni − Xi)

B(a, b)

where B is the Beta function.There are several ways to estimate the parameters a and b. The maximum

likelihood estimate is obtained by maximising f (x1, . . . , xn | a, b). Neglectingthe factors which do not depend on a, b, we can write the objective function

N

∏i=1

a↑Xi b↑ni−Xi

(a + b)↑ni

where we use Pochhammer notation a↑` := a(a+ 1) . . . (a+ `− 1). This is easyto differentiate and numerical methods can be applied to obtain the MLE.

It is perhaps more common to use a moment matching estimator. In thesimple case where ni = n, we have

EXi =a

a + b, EX2

i =na(na + n + b)

(a + b)(a + b + 1).

We can define estimators in order to match the empirical moments of Xi, i.e.let a, b be the solution of the equations

1N

N

∑i=1

Xi =a

a + b,

1N

N

∑i=1

X2i =

na(na + n + b)(a + b)(a + b + 1)

.

Moment matching estimators have explicit formulas and enjoy good frequen-tist properties.

6 exercise. Consider the model

Xi | λi ∼ Poisson(λi) independent for i = 1, . . . , n

λ1, . . . , λniid∼ Gamma(α, β).

Derive the posterior distribution f (λi | Xi). Prove that the marginal distri-bution of each Xi is a Negative Binomial distribution; i.e. the distribution ofthe number of successes in i.i.d. Bernoulli(p) trials before r failures. What arethe parameters, p and r, of this distribution in terms of α and β? Derive themaximum likelihood estimator for p for a fixed r. Derive moment matching

7

3. Graphical models

estimators for p and r by equating the theoretical mean and variance of Xi tothe empirical mean and variance.

3 Graphical models

Graphical models are representations of the conditional independence in amultivariate distribution over a set of variables XV = Xv; v ∈ V indexedby the vertices in a graph. As in a Bayesian setting data and parameters arerandom variables in the same space, we will not make distinctions betweenthem in this section and we include them all in XV . Vertices corresponding todata and parameters are sometimes called observed and unobserved nodes.

There are three types of graphical representation which are used widely inthe literature.

7 definition. A Bayes Network with respect to a directed acyclic graph (DAG)on the vertices V, satisfies the following property. If p(v) is the set of parentsof v in the graph, the distribution of XV can be written

f (xV) = ∏v∈V

f (xv | xp(v)).

8 definition. A Markov Random Field with respect to an undirected graph G =(V, E) satisfies the Global Markov Property: for every partition of the verticesV = V1 ∪V2 ∪W with no edges between V1 and V2, the variables XV1 and XV2

are conditionally independent given XW .

9 definition. A Gibbs Random Field with respect to an undirected graph G =(V, E) is a distribution of the form

f (xV) =1

ZG,ψ∏C⊆V

a clique

ψC(xC).(1)

where ψC is a non-negative compatibility function for every clique C in thegraph G1. The number ZG,ψ is called the partition function.

A very similar concept is that of a Factor Graph, which can be definedby augmenting the vertex set V with a set of factor nodes F, and defining abipartite graph G between V and F. The distribution of Xv is a product ofpotentials indexed by the factor nodes,

f (xV) =1

ZG,ψ∏a∈F

ψa(xδa).(2)

where δa is the set of neighbours of node a.

10 remark. Conditioning on the value x∗O of the observable nodes in a GibbsRandom Field (G, ψ) leads to a different Gibbs Random Field with potentialsφ, defined by φC(xC) = ψ(xC)1(xO∩C = x∗O∩C) for every clique C.

1We include 1-cliques or single vertices.

8

3. Graphical models

11 example. A symptom-disease network is a factor graph with variablenodes d1, . . . , dm for diseases and s1, . . . , s` for symptoms, which take valuesin 0, 1 indicating the presence of a disease or symptom. Each factor node akis connected to one disease node dk and to all the symptoms that occur withthis disease. It is conventional to draw factor nodes as filled squares:

d1 d2 d3 d4 d5

s1 s2 s3 s4 s5 s6 s7

a1 a2 a3 a4 a5

The compatibility functions are defined by

ψak (xδak ) =

0 if xdk

= 1, xδak 6= 11 otherwise.

In words, the distribution gives equal probability to any configuration whichsatisfies the condition that when a disease is present, all related symptomsoccur. In practice, we might be interested in conditioning on the value of thesymptom nodes, which are observable, to infer the posterior of the diseasenodes. We might also be interested in learning the graph itself, which is knownas structure learning.

12 exercise. Show that any distribution is a Gibbs Random Field, and aMarkov Random Field with respect to the complete graph. Construct a di-rected acyclic graph, such that any random vector XV is a Bayes Network onthe graph.

13 exercise. The Markov blanket b(v) of a node v in a Bayes network is theunion of the parents of v, the set of children of v, and the parents of v’schildren. Prove that Xv is independent of every other variable given its Markovblanket: Xv ⊥ XV\v\b(v) | Xb(v).

A direct consequence of the fact shown in the previous exercise is that ifXV is a Bayes network with respect to a DAG G, then it is a Markov RandomField with respect to the moralised graph G, i.e. a graph obtained by makingthe edges of G undirected and connecting any pair of nodes with commonchildren.

14 exercise. A Hidden Markov Model is a time series model where the ob-servables Y1:n and a latent Markov chain X1:n have the following conditionalindependence relationships.

Yi ⊥ Y−i, X−i | Xi for all i ∈ 1, . . . , nXi ⊥ X1:i−2 | Xi−1 for all i ∈ 2, . . . , n.

Draw graphical representations of this model as a Bayes Network, a Markov

9

3. Graphical models

Random Field, and a Factor Graph.

It is fairly easy to see that a Gibbs Random Field satisfies the GlobalMarkov Property on its graph. This will be left as an exercise. The converse isa fundamental result due to Hammersley and Clifford.

15 theorem (Hammersley-Clifford). Suppose the distribution of XV is positive,i.e. f (xV) > 0 for all xV ∈ XV , and satisfies the Global Markov Property on G. ThenXV is also a Gibbs Random Field on G.

Proof (Grimmet, Bull. London Math. Soc. 1973). Define for any S ⊆ V a putativecompatibility function

ψS(xS) := ∏U⊆S

f (xU , 0V\U)(−1)|S|−|U| > 0,

where 0V is some arbitrary configuration in XV , and f (xU , 0V) is the densityof a configuration which coincides with xV at U and 0V everywhere else. Weassume the convention that products over subsets include the empty set. Weclaim that, first, f (xV) = ∏S⊆V ψS(xS), and furthermore ψS(xS) is a constantwhen S is not a clique in the Markov Random Field. This implies that thedistribution of XV factorises over cliques in G, which is the desired result.

To prove the first claim, consider any strict subset U ⊂ V and count thenumber of times a factor f (xU , 0V\U) occurs, with power 1 and −1, in theproduct ∏S⊆V ψS(xS). It appears once with power 1 in ψU(xU), once withpower −1 in each ψS(xS) for S including U and one extra vertex, once withpower 1 in each ψS(xS) for S including U and two extra vertices, and so on.So, the total power of f (xU , 0V\U) in ∏S⊆V ψS(xS) is

1−(|V| − |U|

1

)+

(|V| − |U|2

)− · · ·+ (−1)|V|−|U|

(|V| − |U||V| − |U|

)= (1− 1)|V|−|U| = 0.

So every factor in ∏S⊆V ψS(xS) cancels out except for f (xV) which appearsonly once with power 1.

To prove the second claim, we must use the Global Markov Property. Con-sider any set of vertices S ⊆ V which is not a clique. Then, there is a paira, b ∈ S which is not in the edge set, and we can write

ψS(xS)

= ∏U⊆S\a,b

[f (xU , 0V\U)

f (xU∪a, 0V\U\a)f (xU∪a,b, 0V\U\a,b)

f (xU∪b, 0V\U\b)

](−1)|S|−|U|

.

We shall see that the two fractions multiplied in each factor cancel each other,

10

3. Graphical models

so the product is equal to 1. We have

f (xU , 0V\U)f (xU∪a, 0V\U\a)

=f (0a | xU , 0V\U\a)f (xa | xU , 0V\U\a)

=f (0a | xU∪b, 0V\U\a\b)f (xa | xU∪b, 0V\U\a\b)

=f (xU∪b, 0V\U\b)

f (xU∪a,b, 0V\U\a,b),

where the second equality follows from the Global Markov Property—as theyare not connected, a and b are independent given the rest of the graph.

3.1 Inference on graphical models

Bayesian inference tasks in a Gibbs Random Field (G, ψ) can be phrased interms of the following abstract problems:

1. Compute the partition function Z.

2. Sample the distribution of XV .

3. Compute a conditional distribution f (xA | xB) for A, B ⊆ V.

4. Compute a marginal distribution f (xA) for a subset A ⊆ V.

When the domain of the variables is discrete, for example Xv = 0, 1, for allv ∈ V, it can be shown that the computational hardness of the three problemsis roughly the same.

Reduction between marginals and conditionals. A marginal is a special case ofa conditional with B = ∅. The conditionals can be obtained from marginalsthrough Bayes theorem.

Reduction from marginals to sampling. The marginal of XA can be esti-mated through its empirical distribution in a set of independent samplesX(1)

V , . . . , X(n)V . By Hoeffding’s inequality, the empirical estimate of f (xA) for

any configuration xA, attains a precision ε with probability at least 1− δ withn = log(2/δ)/(2ε2) samples.

Reduction from sampling to marginals. Taking an arbitrary ordering of thevariables v1, v2, . . . , vm, we can successively sample Xv1 ∼ f (Xv1), Xv2 ∼f (Xv2 | Xv1), ..., Xvm ∼ f (Xvm | Xv1 , . . . , Xvm−1).

Reduction from marginals to partition function. By Bayes theorem, themarginal f (x∗A) is the ratio of partition functions ZG,φ/ZG,ψ, where φ are com-patibility functions modified as in Remark 10 for the conditional distributiongiven XA = x∗A.

Reduction from partition function to marginals. The problem is to computea partition function Zψ for a Gibbs random field on G with potentials ψ, ifwe can call a routine which computes the marginal, f (xV). Then, for anyconfiguration xV ,

Zψ =∏C∈cliques(G) ψC(xC)

f (xV).

11

3. Graphical models

. . .

. . .

i a

. . .

. . .

Figure 2: Partition of a factor tree. The subtree enclosed in blue has variablenodes V(i, a) and factor nodes F(i, a). The subtree enclosed in red has variablenodes V(a, i) and factor nodes F(a, i).

3.2 Belief propagation

Belief propagation (BP) is a family of message-passing algorithms used tomarginalise and maximise distributions of the form

f (xV) =1

ZG,ψ∏a∈F

ψa(xδa)

for a factor graph G = (V, F, E), where δa denotes the neighbours of a ∈ F.In this section we restrict our attention to discrete random fields. For the sakeof clarity, we will use the symbols i, j, k, . . . to denote variable nodes, anda, b, c, . . . for factor nodes.

Suppose that G is a tree. Any edge (i, a) divides the graph into two sub-trees. We shall denote F(i, a) and F(a, i) the set of factor nodes in each subtree,and V(i, a) and V(a, i) the variable nodes in each subtree, as shown in Fig. 2.

The messages µa→i and µi→a for all i ∈ V, a ∈ F are probability distributionson the space of the variable Xi, defined by

µa→i(xi) ∼= ∑x∗V(a,i)∪i

x∗i =xi

∏b∈F(a,i)

ψb(x∗δb)

µi→a(xi) ∼= ∑x∗V(i,a)x∗i =xi

∏b∈F(i,a)

ψb(x∗δb),

where ∼= indicates that the two sides are equal up to a constant, such that theleft hand side is a probability in each case.

First, observe that marginals can be recovered from the messages through

f (xi) ∼= ∏a∈δi

µa→i(xi)

f (xδa) ∼= ψa(xδa) ∏j∈δa

µj→a(xj);

this is simply a consequence of the distributive property of sum and product.

12

3. Graphical models

Second, note that the messages satisfy the following recursion

µi→a(xi) ∼= ∏b∈δi\a

µb→i(xi)(3)

µa→i(xi) ∼= ∑x∗

δa\i

ψa(xi, x∗δa\i) ∏j∈δa\i

µj→a(x∗j ).(4)

The idea of the sum-product algorithm is to iterate these recursions in orderto find a fixed point.

16 definition. The sum-product algorithm iterates the following message up-date,

µ(t+1)i→a (xi) ∼= ∏

b∈δi\aµ(t)b→i(xi)(5)

µ(t)a→i(xi) ∼= ∑

x∗δa\i


µ(t)j→a(x∗j )(6)

The order in which messages are updated is called the schedule, and the casespecified by Eqs. 5 and 6 is known as parallel updating.

17 theorem (Convergence of sum-product on trees). If the factor graph G is atree, and T is the length of the longest path of variable nodes, or the diameter, of G,the set of messages converges after T iterations: µ(T) = µ.

Proof. We first observe that for any edge (v, w), after any number of iterationst, the messages used to compute µ

(t)w→v are always in the subtree induced by

v containing (w, v) and point in the direction of v. Now, we can prove thetheorem by induction on T. It is easy to check the base case; when there is justone variable node, the algorithm converges after just one iteration. Supposethe theorem holds for all trees of diameter T − 1. Consider any message µ

(t)i→a

or µ(t)a→i. In a graph with diameter T, the observation above and the induction

hypothesis ensure that after T − 1 iterations, the messages for edges adjacentto (i, a) have converged to the right solution. The recursions in Eqs. 3 and 4

then imply the convergence of the messages under consideration.

18 exercise. Consider a tree with diameter D, maximum degree d, and wherethe state space has cardinality |Xv| = n for each v ∈ V. Is it possible to upperbound for the number of operations required for convergence in the sum-product algorithm? In particular, how does this depend on D, d, |V|, and n?

The max-product algorithm is used to find the mode of the distribution ofXV . Define the max-marginals

Mi(xi) = maxx∗V ;x∗i =xi

f (x∗V)

Ma(xδa) = maxx∗V ;x∗δa=xδa

f (x∗V).

13

3. Graphical models

If we have a way to obtain max-marginals, we can find the mode of the dis-tribution. Taking an arbitrary ordering of the variables X1, . . . , Xm, (i) find themax-marginal M1, (ii) choose x1 = arg maxx M1(x), and (iii) reduce the modelby conditioning on X1 = x1; repeating steps (i)-(iii) for X2, X3, and so on,we recover the mode. In each step of this procedure, we need only know themax-marginals up to a constant factor.

On a tree, we can define messages

Ma→i(xi) = maxx∗V ;x∗i =xi

∏b∈F(a,i)

ψb(x∗δb)

Mi→a(xi) = maxx∗V ;x∗i =xi

∏b∈F(i,a)

ψb(x∗δb),

and observe that

Mi(xi) =1

ZG,ψ∏a∈δi

Ma→i(xi),

where we use the fact that the maximum of a product of functions of indepen-dent variables is the product of the individual maxima. This fact also impliesthe recursion

Mi→a(xi) = ∏b∈δi\a

Mb→i(xi)(7)

Ma→i(xi) = maxx∗

δa\iψa(xi, x∗δa\i) ∏

j∈δa\iMj→a(x∗j ).(8)

19 definition. The max-product algorithm iterates the following message up-date,

M(t+1)i→a (xi) = ∏

b∈δi\aM(t)

b→i(xi)(9)

M(t)a→i(xi) = max

x∗δa\i


M(t)j→a(x∗j ).(10)

20 theorem (Convergence of max-product on trees). If the factor graph G is atree, and T is the length of the longest path of variable nodes, or the diameter, of G,the set of messages converges after T iterations: M(T) = M.

Proof. We apply the same argument used to prove convergence of the sum-product algorithm on trees.

21 example (Hidden Markov Models). The factor graph for this model is atree. When the latent variables are discrete, the sum-product algorithm canbe applied to sample the posterior of the Markov chain given the observablesf (x1:T | y1:T). This is known as the Forward-Backward algorithm, because thesum-product messages can be updated in a schedule which goes from t = 1to t = T, followed by sampling the variables from XT to X1. In fact, it is nothard to see that parallel updating would be wasteful in this model. The max-

14

3. Graphical models

product algorithm applied to the posterior of the latent chain is known as theViterbi algorithm.

22 exercise. A Gaussian State Space Model has the same structure as a Hid-den Markov Model. The latent variables X1, . . . , XT are random vectors in Rp1

and the observables Y1, . . . , YT are random vectors in Rp2 . The joint distribu-tion is specified by:

X0 = x0

Xi | Xi−1 ∼ N(AXi−1 + c, Σ)

Yi | Xi ∼ N(BXi + d, Σo)

The Kalman filter is an algorithm for posterior inference, which may be in-terpreted as a continuous-space version of the sum-product algorithm. Theposterior density factorises in the following way:

f (x1:T | y1:T) =1Z

T−1

∏i=0

ψi(xi, xi+1)T

∏i=1

ψi(xi, yi).

As this is a pairwise model (every factor depends on at most two variables),we can define messages between variable nodes as a shorthand. For j ≥ 0,

µj→j+1(xj+1) ∼=∫

. . .∫ j

∏i=0

ψi(xi, xi+1)j+1

∏i=1

ψi(xi, yi)dx1 . . . dxj.

1. Write the posterior distribution of xT given y1:T in terms of messages.

2. Write a recursion for the messages.

3. Show that it is possible to compute the messages in the order µ1→2,µ2→3, . . . , µT−1→T . What is the computational complexity in terms of p1and p2?

4. Explain how to sample Xt | Xt+1:T , Y1:T for t = T − 1, T − 2, . . . , 1.

3.3 Loopy belief propagation

The message updates defined in the previous section can be applied to factorgraphs with cycles. Belief propagation in such graphs is only approximate,but it often works well. In this setting, the sum-product iteration may nothave fixed-points, and when it does, the algorithm may not necessarily con-verge. Furthermore, the quality of the approximation of marginal estimatesfrom fixed-point messages is difficult to assess. This section discusses a set ofcorrelation decay conditions which ensure the convergence and correctness ofbelief-propagation.

23 definition. The computation tree Tti (G) is a factor graph with nodes V′, F′.

For every non-reversing path (no consecutive crossings of any given edge) in

15

3. Graphical models

i

a

j k

b

` G

i1

a1

j1 k1

b1 b2

`1 `2k2 j2

T2i (G)

T3i (G)

Figure 3: A factor graph G on the left, and the corresponding computationtrees of depth 2 and 3, on the right. Every non-reversing path in G rooted at i,like the one shown in red, maps to a path in the computation tree. The verticesare related by the non-injective mapping π, i.e. π(b1) = π(b2) = b.

G starting from i ∈ V, which goes through at most t variable nodes, thereis a path from the root of Tt

i (G). Two paths which share a common prefix inG map to two paths with a common prefix of the same length in Tt

i (G). Thepaths define a natural mapping π : V′ ∪ F′ → V ∪ F from the nodes in thecomputation tree and those in G. This mapping may not be injective; i.e. theremay be many nodes in V′ ∪ F′ mapping to the same node in V ∪ F. See Fig. 3

for an example.

We now define extended distributions on the computation tree using thepotentials of our loopy factor graph (G, ψ):

f (i,t)(xV′) =1Z ∏

a∈F′ψπ(a)(xδa),

f (i,t)η (xV′) =1

Zη

[∏a∈F′

ψπ(a)(xδa)

] ∏j∈δTt

i (G)

ηj→a(j)(xj)

,

where δTti (G) is the set of variable nodes farthest away from the root, and

ηj→a(j) are boundary potentials indexed by one of these nodes and its adjacentfactor node.

24 exercise. Set the boundary potentials to ηj→a(j) = µt0π(j)→π(a(j)), the sum-

product messages after t0 ≥ 0 iterations. Prove that for t1 ≥ 1 iterations,

f (i,t1)η (xi) = f (t1+t0)(xi)(11)

where the right hand side is the estimate of the marginal distribution of Xi

16

3. Graphical models

obtained from belief propagation after t1 + t0 iterations, i.e.

f (t)(xi) ∼= ∏a∈δi

µ(t)a→i(xi).

25 theorem. Writing L = δTti (G), if

supxL ,x∗L

∣∣∣ f (i,t)(xi | xL)− f (i,t)(xi | x∗L)∣∣∣ ≤ δ(t),(12)

then, for any t1, t2 ≥ t,∣∣∣ f (t1)(xi)− f (t2)(xi)∣∣∣ ≤ δ(t).

In particular, the sum-product algorithm converges if δ(t)→ 0.

Proof. By Eq. 11 with t0 = t, η = µ(t1−t) and η′ = µ(t2−t),∣∣∣ f (t1)(xi)− f (t2)(xi)∣∣∣ = ∣∣∣ f (i,t)η (xi)− f (i,t)η′ (xi)

∣∣∣=

∣∣∣∣∣∣∑xL

f (i,t)(xi | xL) f (i,t)η (xL)−∑x∗L

f (i,t)(xi | x∗L) f (i,t)η′ (x∗L)

∣∣∣∣∣∣=

∣∣∣∣∣∣ ∑xL ,x∗L

(f (i,t)(xi | xL)− f (i,t)(xi | x∗L)

)f (i,t)η (xL) f (i,t)η′ (x∗L)

∣∣∣∣∣∣≤ ∑

xL ,x∗L

∣∣∣ f (i,t)(xi | xL)− f (i,t)(xi | x∗L)∣∣∣ f (i,t)η (xL) f (i,t)η′ (x∗L) ≤ δ(t).

The following result ensures that the fixed-point of belief propagationyields good approximations of marginals. The bound will be best when thegraph does not have very short cycles.

26 theorem. If the subgraph Bi(t) of G induced by the variable nodes a distance tfrom i is a tree, and inequality 12 holds, then∣∣∣ f (xi)− f (t)(xi)

∣∣∣ ≤ δ(t).

Proof. If Bi(t) is a tree, then f (xi | xL) = f (i,t)(xi | xL). Observing that f (xi) =

∑xLf (xi | xL) f (xL), we can apply the same argument in the previous theorem.

The correlation decay inequality 12 is not easy to verify in most cases.Dobrushin defined a criterion which only requires controlling the influencebetween pairs of variables and is often more practical.

17

4. The Monte Carlo method

27 theorem (Dobrushin, 1968). Define the influence of node j on node i as

Cij = maxxV\i ,x∗j

‖ f (Xi = · | xV\i,j, xj)− f (Xi = · | xV\i,j, x∗j )‖TV,

where ‖p− q‖TV denotes the total variation distance between two distributions p andq. In words, this is the maximal effect on the conditional distribution of Xi of changingthe value of Xj while keeping every other variable fixed. Let,

γ = supi∈V

∑j 6=i

Cij,

then, letting B be the complement of Bi(t),

supxB ,x∗

B

‖ f (Xi = · | xB)− f (Xi = · | x∗B)‖TV ≤γt

1− γ.(13)

28 corollary. Let G be a pairwise factor graph, where every factor node has at mosttwo neighbours. If i, j ∈ V′ are two variable nodes in the computation tree Tt

i (G),the influence in the extended distribution f (i,t) satisfies Cij = Cπ(i)π(j) if the nodesare adjacent to a common factor node, or Cij = 0 otherwise. Applying the theorem tothe computation tree leads to inequality 12 with δ(t) = γt/(1− γ). Therefore, thesum-product algorithm converges if γ < 1.

Proof sketch of Theorem 27. The total variation distance ‖g − f ‖TV is the infi-mum of Pr(X 6= Y) over all couplings with X ∼ g, Y ∼ f . The proof proceedsby constructing a coupling for f (Xi = · | xB) and f (Xi = · | x∗

B), using the

couplings that exist for the conditional distributions in the expression for theinfluence.

29 exercise. The ferromagnetic Ising model is a system of spins in −1, 1,with joint distribution

f (xV) =1Z

exp

∑i∈V

Bxi + ∑(i,j)∈E

βxixj

where (V, E) is a graph with regular degree k. Prove that belief propagationconverges in this model when β < (1/2)arc tgh(1/k).

4 The Monte Carlo method

The Monte Carlo method is used to estimate features of a distribution—mostcommonly, an expectation—by simulating random variates. The oldest exam-ple usually cited is due to the Comte de Buffon, an 18th c. naturalist who iscredited with introducing the calculus to probability.

30 example (Buffon’s needle). Suppose we have a board split into N horizon-tal strips separated by a distance `, and we drop a needle of length ` on the

18


board at random.

hθ

An easy calculation shows that if the orientation of the needle and its heightare uniformly distributed, θ ∼ Unif(0, 2π) and h ∼ Unif(0, `N), then

Pr(needle intersects a line) = 4

π2∫

θ=0

` sin θ∫h=0

1`2π

dh dθ =2π

.

Therefore, we can estimate π by repeating the experiment many times andobserving the proportion of time the needle crosses a line.

More generally, if we want to estimate any expectation EY = µ, the MonteCarlo estimate is defined as the empirical average

µn =1n

n

∑i=1

Yi

where Y1, . . . , Yn are i.i.d. copies of Y. The laws of large numbers tell us thatunder the mild condition µ < ∞, µn converges to µ in probability and almostsurely. If we assume further that Var(Y) = σ2 < ∞, then the Central LimitTheorem states that

√n(µn − µ)

d→ N(0, σ2).

This allows us to perform inference. For example, using an empirical estimateof the variance

σ2n =

1n

n

∑i=1

(Yi − µn)2,

we can construct an approximate 95% confidence interval for µ,

(µn − 1.96 σn/√

n, µn + 1.96 σn/√

n).

The power of Monte Carlo is based on its versatility; it guarantees an ac-curacy that improves as O(n−1/2) with the number of samples n, with barelyany assumptions on the distribution of Y. A more complex example where Yis a function of many variables lends support to this assertion.

31 example. A group of engineers is developing a communications systemwhich routes messages between wifi routers in a city. The location of therouters is modelled by a Poisson point process in a square area. Each routerhas a limited range and can only communicate with other routers a distance at

19


most r away. To estimate the average delay, the engineers want to compute theexpectation of the minimum number of hops a message must make to travelbetween two randomly chosen routers.

While it would be difficult to estimate this expectation analytically, as itdepends in a non-linear way on potentially many locations, it is very simpleto simulate a Poisson process on the square and find the minimum number ofhops required to travel between two routers.

This versatility has made Monte Carlo simulation central to the historyof computing. Developed and named by Stanisław Ulam and John von Neu-mann, who used it to study neutron diffusion during the Manhattan Project, itquickly found applications in many fields as computers became widespread:there are examples in operations research and queueing systems in the 1960s,in chemical physics and structural biology, as well as finance in the 1970s.Monte Carlo became a mainstay of Statistics with the development of theBootstrap by Efron (1979) and computer intensive methods for Bayesian anal-ysis in the 1980s (cf. Geman and Geman, 1984).

Numerical integration is a long-standing problem, and many techniquesfor one-dimensional quadrature are vastly superior to Monte Carlo estimation.For example, Simpson’s rule defines an estimate for the integral

∫ ba f (x)dx given

evaluations of the function on a regular grid fi = f (a+ ih) with h = (b− a)/n:

b∫a

f (x) dx ≈ h3

[f0 + 2

n/2−1

∑j=1

f2j + 4n/2

∑j=1

f2j−1 + fn

].

This quadrature rule, based on quadratic interpolation, converges to thetrue integral at a rate O(n−4) when the function f has a continuous fourthderivative. However, a straightforward extension of the rule to functions of dvariables converges at a rate O(n−4/d), which deteriorates quickly with thedimension—Monte Carlo estimates don’t suffer from this problem.

Knowledge of an integrand’s smoothness can lead to quadrature rules thatare more efficient than simple Monte Carlo. This is the focus of much re-search in numerical analysis, including Quasi Monte Carlo methods, wherethe points are designed to cover the space rather than sampled at random,Bayesian or probabilistic numerics, and kernel quadrature. Many of thesemethods are adaptive, in the sense that they provide rates superior to O(n−1/2)for functions with a range of smoothness.

4.1 Quantiles and ratios

Two problems that appear frequently cannot be interpreted as estimating ex-pectations, so they deserve individual attention. The first is the problem ofestimating quantiles of the distribution FY. A p-quantile, Qp is defined by

Pr(Y ≤ Qp) = p.

Quantiles are necessary to construct posterior credible intervals, in Bayesiananalysis, as well as confidence intervals through the Bootstrap method. A

20


Monte Carlo estimate is defined from the order statistics Y(1) ≤ Y(2), . . . ,≤Y(n) or n i.i.d. copies of Y by

Qp = Y(bnpc).

Qp is identically distributed as F−1y (B) for B ∼ Beta(bnpc, n − dnpe), which

is centred at Qp as n grows large. However, it is easier to build confidenceintervals of the form (Y(L), Y(R)), noting that

Pr(Y(L) ≤ Qp < Y(R)) = Pr(L ≤ X < R)

for X ∼ Binomial(n, p). To build an approximate (1− α) confidence interval,we choose L and R to be close to the (α/2) and 1− (α/2) quantiles of thebinomial distribution:

L = max

0 ≤ ` ≤ n ;

`

∑i=0

(ni

)pi(1− p)n−i ≤ α/2

R = min

0 ≤ r ≤ n ;

n

∑i=r

(ni

)pi(1− p)n−i ≤ α/2

.

If either of the sets on the right hand side is empty, the confidence interval isextended to the boundary of the range of Y.

The second non-standard Monte Carlo problem is that of estimating a ratio

θ =EYEX

=µYµX

for a pair of random variables X, Y in the same probability space. This is avery common problem, as the next two examples show.

32 example. A conditional expectation E(Y | A) = E(Y1A)/ Pr(A) is a ratioof expectations for any measurable set A.

33 example (Importance sampling). We often want to estimate an expectation∫f (y)µ(dy) by Monte Carlo. However, it is difficult to sample the measure µ.

Often, we can find a measure ν which is absolutely continuous with respectto µ and is easy to sample. If we know the Radon-Nikodym derivative up toa constant dµ/dν(y) = ch(y), we can write

∫f (y)µ(dy) = c

∫f (y)h(y)ν(dy) =

∫f (y)h(y)ν(dy)∫

h(y)ν(dy),

a ratio of expectations with respect to ν.

Given (Xi, Yi)1≤i≤n i.i.d. copies of (X, Y), the Monte Carlo ratio estimator

21


is

θ =1n ∑n

i=1 Yi1n ∑n

i=1 Xi=

YX

.

It is possible to construct approximate confidence intervals with this estimator.Its asymptotic distribution can be derived with the delta method, as is usuallydone with plug-in estimators. Write θ = f (X, Y), and define

Var(θ) =1n

E(Y− θX)2

µ2X

,

and

Var(θ) =1

n2X2

n

∑i=1

(Yi − θXi)2.

34 theorem. If X and Y are both square integrable variables, as n→ ∞,

θ − θ√Var(θ)

d→ N(0, 1).

Proof. The first-order Taylor series approximation of f (X, Y) around µ =(EX, EY) with the mean value formulation of the remainder is

θ = θ + ∂ fX(µ)(X− µX) + ∂ fY(µ)(Y− µY)

for some µ = tµ + (1− t)(X, Y) with 0 ≤ t ≤ 1. By the weak law of largenumbers, and the continuous mapping theorem (35),

∂ f (µ) d→ ∂ f (µ)(14)

nVar(θ) d→ nVar(θ).(15)

The multidimensional CLT gives us√

n(X − µX , Y − µY)d→ N(0, Σ), where

Σ is the covariance matrix of (X, Y). A straightforward computation showsn−1 ∂ f (µ)TΣ∂ f (µ) = Var(θ). Therefore, by continuous mapping and 14,

√n[∂ fX(µ)(X− µX) + ∂ fY(µ)(Y− µY)]

d→ N(0, Var(θ)).

The continuous mapping theorem and 15 yield the desired result.

35 theorem. If Xnd→ X and g is a continuous function in the range of Xn, then

g(Xn)d→ g(X).

22


4.2 Sampling univariate distributions

Algorithms to sample univariate distributions are the building blocks ofMonte Carlo and Markov Chain Monte Carlo methods. Most algorithms relyon pseudorandom numbers, which are streams u1, u2, . . . that appear randomand uniformly distributed in [0, 1].

The topic of pseudorandom numbers is fascinating, because they are en-tirely deterministic sequences so defining the sense in which they are randomis a deep question. Random number sequences generated by physical phe-nomena merit a separate discussion, even though the existence of random-ness in nature is far from obvious. To prove this point, Persi Diaconis built acoin-tossing machine which reliably makes the coins land heads.

Pseudorandom numbers are typically generated from arithmetic recur-sions depending on a number, called the seed. Cryptographically-secure se-quences are those which are provably difficult to predict by an adversary whodoes not know the seed. A less stringent criterion for randomness is that thesequence has statistical properties matching those of an i.i.d. Uniform(0, 1)sequence. This criterion is sufficient for Monte Carlo methods. Alas, we willnot discuss the statistical tests of uniformity satisfied by various sequencesand take for granted that pseudorandom numbers implemented in softwarepackages are well-tested.

36 definition. An inversion algorithm is one which transforms aUniform(0, 1) variable U into a variable with CDF FY, by defining Y =F−1

Y (U). More generally, a variable Z with a continuous CDF FZ may beused to generate Y = F−1

Y (FZ(Z)) with distribution FY, by application of thequantile-quantile function F−1

Y (FZ(·)).

In many cases, the CDF FY can be inverted analytically. For example, anExponential(λ) random variable can be generated from a uniform variate Uby taking − log(U)λ. In other cases, FY is known, but it must be invertednumerically through bisection or Newton’s algorithm. It is also possible thatneither FY nor its inverse have an analytical expression, and F−1(Y) mustbe approximated numerically from the density function fY. This is the casefor the normal distribution, and this is in fact how most software packagesgenerate random normal variates, despite the fact there exist tricks like theBox-Muller transform.

37 example. The Box-Muller transform allows us to generate 2 independentstandard normal variables, Z1, Z2 from two uniform variates U1, U2, through

Z1 =√−2 log U1 sin(2πU2)

Z2 =√−2 log U1 cos(2πU2).

To prove (Z1, Z2) ∼ N(0, I) write the vector in polar coordinates Z1 + iZ2 =Re−iθ and observe that θ is uniform in [0, 2π] by the rotational symmetry ofthe bivariate normal distribution, while R2 = Z2

1 + Z22 has a χ2

2 distribution,which is an exponential distribution with mean 2. Surprisingly, the cost of

23


evaluating the above expressions to high precision is comparable to that ofestimating the inverse CDF required by the inversion algorithm.

4.3 Rejection sampling

Rejection sampling can be used to sample a density of the form f (·) = c f (·)for which:

1. f is known,

2. there is an integrable function g, which dominates f ,

3. the density g(x)/(∫

g(x)dx) is easy to sample.

The idea is based on the fact that if h : R2 → R+ is the uniform density onthe graph of f (the region in R×R+ under the function), then the marginaldensity of the first coordinate is f and the conditional density of the secondcoordinate given the first is uniform. Indeed,

h(x) =∫

h(x, y)dy =

f (x)∫0

c dy = f (x),

h(y | x) =h(x, y)h(x)

=c

f (x)for 0 ≤ y ≤ f (x).

So, if we want to sample f , it is enough to obtain a uniform sample on thegraph of f and record the abscissa. Observe that the graph of g contains thegraph f . In order to sample a uniform point in the graph of f , we can samplepoints in the graph of g until we obtain a point which is in the graph of f .This is illustrated in the following figure, where the blue and red points aredrawn uniformly on the graph of g, but only the blue points are kept.

x

f (x)

g(x)

The following pseudocode generates one sample from f .

while True:

draw X with density gdraw U uniformly between 0 and g(X)if U ≤ f (X):

return X

24


The fact that we only need to know f and g up to a normalisation constantis significant. In Monte Carlo algorithms for Bayesian analysis, we often needto sample conditional densities of the form f (x | y) = f (x, y)/ f (y), where thejoint density in the numerator is known analytically, but the denominator isnot.

Note that at each iteration, the probability of keeping a point is

p =

∫f (x)dx∫g(x)dx

and the probability that the algorithm produces a sample by the ith iteration,1− (1− p)i, decreases exponentially with i. The key to efficiency is definingan envelope g which hugs f closely.

38 exercise. The Gamma distribution is one of the few classical distributionsthat are not easy to sample by inversion. This is an exponential family ofunimodal, right-skewed distributions on R+ which is used very frequently inBayesian modelling. A Gamma distribution with shape parameter a and scale1 has density

fa(x) =xa−1e−x

Γ(a).

For the case a > 1, construct an envelope for this density which is proportionalto a normal density with matched moments in an interval [0, z], and to anexponential density on (z, ∞). Derive an upper bound on the expected numberof iterations needed to obtain a sample.

Gamma variates with a ≤ 1 can be derived by noting that if G ∼Gamma(a) and U ∼ Uniform(0, 1), then GU1/a ∼ Gamma(a − 1). Further-more, many other distributions can be sampled given a routine to samplegammas. For example, if G1 ∼ Gamma(a1), . . . , Gn ∼ Gamma(an) are inde-pendent, then(

G1

∑ni=1 Gi

, . . . ,Gn

∑ni=1 Gi

)∼ Dirichlet(a1, . . . , an).

The Beta distribution is a special case of the Dirichlet with n = 2. A χ2m random

variable has distribution Gamma(m/2).Very frequently, we may not be able to bound a univariate density f by an

integrable function effectively using analytical arguments, as in the previousexercise. However, we may be able to verify that f is log-concave; that is, forany x and y in R,

log f (tx + (1− t)y) ≥ t log f (x) + (1− t) log f (y) ∀0 ≤ t ≤ 1.

This property can be checked using closure properties of concave functions;for example, the log-density is concave if it is a sum of concave functions.

Log concavity allows us to construct a piecewise exponential envelope, by

25


taking a series of lines `i(·) tangent to log f at xi, for i = 1, . . . , n, and defining

gn(x) = exp(mini≤n

`i(x)).

For a density supported on R, two points x1 and x2 on either side of the modeare sufficient to make g integrable.

In adaptive rejection sampling we update the envelope function at each step.Letting g2 be the initial envelope, the algorithm proceeds as follows.

for i in 3, 4, 5, . . . :draw Xi with density gi−1draw U uniformly between 0 and gi−1(X)if U ≤ f (Xi):

return Xielse:

define gi by adding a tangent line at Xi

It is not difficult to see that the candidate samples tend to be located in ar-eas where the difference between the envelope and f is large, so the envelopeimproves rapidly iteration after iteration.

4.4 Variance reduction

This section deals with a set of related techniques used to reduce the varianceof a Monte Carlo estimator for EX. The estimators defined in this sectionare a function of n random variates and will have a variance σ2

1 /n, for someconstant σ2

1 ideally smaller than Var(X).

Antithetic sampling

To estimate E f (X), take any coupling (X, X′) where X and X′ are identicallydistributed. If (Xi, X′i)1≤i≤n are independent copies of (X, X′), the estimator

µanti =1n

n

∑i=1

f (Xi) + f (X′i)2

is clearly unbiased Eµanti = E f (X), and has variance

Var(µanti) =Var( f (X))

n

(1 + ρ

2

),

where ρ = Corr( f (X), f (X′)). The reduction of variance will be large whenf (X) and f (X′) are negatively correlated.

To compare this estimator to simple Monte Carlo, we need to take intoaccount the reduction of variance, as well as the difference in the cost of com-puting each summand. Call the variance of the antithetic estimator σ2

1 /n andthat of the simple Monte Carlo estimator σ2

0 /n. Call the cost of computing onesummand in each method c1 and c0, respectively. In order to compare the twomethods, we can use the ratio of the cost of achieving a given accuracy, for

26


example a variance of v, using each method,

E =c1σ2

1 /vc0σ2

0 /v=

c1σ21

c0σ20

.

For antithetic sampling c1 can be as large as 2c0, when the cost of evaluatingf is large compared to the cost of sampling (X, X′), or when sampling (X, X′)is twice as expensive as sampling X. On the other hand, c1 is not much largerthan c0 when the cost is dominated by the cost of sampling X, and derivingX′ from X is simple.

Stratification

Assume we are interested in EX, and suppose we can partition the probabilityspace (Ω,F ) into disjoint measurable sets A1, . . . , AJ , called strata, such that

1. we know the probability of each set wj := Pr(Aj), 1 ≤ j ≤ n, and

2. we can sample X restricted to Aj; i.e. changing the measure to Prj(B) =Pr(B ∩ Aj)/wj for B ∈ F .

Sampling Xj,1, . . . , Xj,nj , i.i.d. from Prj for each stratum, we define the estima-tor

µstrat =J

∑j=1

wj

nj

nj

∑i=1

Xi,j

which by the law of total expectation is unbiased Eµstrat = µ, and has variance

Var(µstrat) =J

∑j=1

w2j

njσ2

j ,

where σ2j is the variance of X in (Ω,F , Prj). Remarkably, the estimator can

have zero variance if the variance within each stratum is zero, σ2j = 0, mak-

ing it possible to estimate µ without error given a single sample within eachstratum.

39 example. Stratification was developed in the field of survey sampling. Forexample, consider a poll with the aim of estimating the proportion of Brexitsupporters in England. A pollster might have ways of sampling people in dis-tinct subpopulations showing a smaller variability of opinion than the wholepopulation, for instance, men in a specific London borough. If the size of thesubpopulations is known from census data, the stratified estimator could havelower variance than a simple Monte Carlo estimator.

A natural way to allocate a number of samples to each stratum is propor-

27


tional allocation, nj = nwj. This rule yields a variance

Var(µprop) =1n

J

∑j=1

wjσ2j ≤

Var(X)

n,

where the inequality is due to the law of total variance. So, the stratified esti-mator is always better than the simple estimator, provided that the cost cj ofsampling within stratum j is the same as the cost of sampling X uncondition-ally. If these costs and the variance within each stratum were known, it wouldbe possible to derive optimal sample sizes using the method of Lagrange mul-tipliers, which yields the rule nj ∝ wjσj/

√cj. In the case when cj is constant,this leads to the variance

Var(µopt) =1n

(J

∑j=1

wjσj

)2

≤ 1n

J

∑j=1

wjσ2j = Var(µprop)

by Jensen’s inequality. However, it is in general not possible to compute theoptimal allocation as the variances σj are not known a priori. One could onlyhope to estimate them empirically.

Inference on stratified estimators is possible by appealing to the CLT forµstrat using the empirical variance

Var(µstrat) =J

∑j=1

w2j

n2j

nj

∑i=1

(Xij − µj)2 where

µj =1nj

nj

∑i=1

Xij.

In addition to the usual asymptotics which take the number of samples nj →∞, it is possible to prove a normal limit when the number of strata J → ∞,even as nj ≥ 2 is fixed for each stratum.

Control Variates

We want to estimate µ = E f (X). A control variate is a variable h(X) whoseexpectation θ can be obtained analytically. Consider the Monte Carlo estimator

1n

n

∑i=1

f (Xi)− h(Xi) + θ

which clearly has expectation µ. The variance will be small when f (Xi) andh(Xi) are positively correlated, but if this were not the case, one could con-struct a better control variate by multiplying the function h by a scalar.

More generally, suppose we have p variables h(X) = (h1(X), . . . , hp(X)),with known expectations θ ∈ Rp. Given any vector of coefficients β ∈ Rp we

28

5. Importance sampling

can define an unbiased estimator

µβ =1n

n

∑i=1

f (Xi)− β>h(Xi) + β>θ.

The variance is

Var(µβ) =1n

E[( f (X)− µ− β>(h(X)− θ))2

].(16)

Minimising this variance with respect to the coefficients β is a leastsquares problem, which would require knowing the covariance matrix of( f (X), h1(X), . . . , hp(X)). We can instead minimise the empirical variance

Var(µβ) =1n

n

∑i=1

( f (Xi)− µβ − β>(h(Xi)− θ))2.

with respect to β. Conveniently, this is equivalent to minimising

1n

n

∑i=1

( f (Xi)− µ− β>(h(Xi)− θ))2.

with respect to µ and β, an ordinary least squares regression. Solving thisregression yields at once the estimate of the optimal coefficient β, the MonteCarlo estimator µβ as the fitted intercept, and the empirical variance of µβ

through the standard error.While µβ is unbiased, the same is not true of the estimator µβ, as the

coefficients are estimated from the data. Indeed, letting βopt the the minimiserof the variance in Eq. 16, we can write the error of our estimator

µβ − µ = µβ − µβopt + µβopt − µ

= (β− βopt)>(θ − h) + (µβopt − µ)

where h = n−1 ∑ni=1 h(Xi). The second term in the last expression has mean 0,

as βopt is deterministic. On the other hand, the first term is the inner productof two zero-mean terms which may be correlated, leading to bias. However,this bias tends to be small because ‖β − βopt‖ = Op(n−1/2) and ‖h − θ‖ =

Op(n−1/2), so their inner product is Op(n−1) and this is dominated by thesecond term which is Op(n−1/2). One must be cautious when the number ofcontrol variates p is large, as the first term grows linearly with p. It is safe toneglect the bias when p √n.

5 Importance sampling

Importance sampling is related to variance reduction methods and to rejectionsampling, but is usually easier to apply. On the other hand, it can be difficultto determine when it works well.

Suppose we want to find the expectation I = E f (Y) for Y ∼ ν. Impor-

29


tance sampling uses samples of a different random variable X ∼ µ, whosedistribution µ is absolutely continuous with respect to ν, and for which theRadon-Nikodym derivative

ρ(x) =dν

dµ(x)

is known. Then E f (Y) = E( f (X)ρ(X)), which allows us to define the impor-tance sampling estimator

In =1n

n

∑i=1

f (Xi)ρ(Xi).

If ρ(x) = τ(x)c is only known up to a normalisation constant c, we can alsodefine the ratio estimator

Jn =∑n

i=1 f (Xi)τ(Xi)

∑ni=1 τ(Xi)

introduced in Example 33.Importance sampling may be used in situations when ν is difficult to sam-

ple but ρ is easy to evaluate. The method is used more broadly in situationswhere f is only large in a set A which has very low probability under ν, suchthat a sample Y ∼ ν would rarely produce large values f (Y). This happens,for example, when estimating the probability of rare events by Monte Carlo.In this case, we can design µ to give more weight to the important set A,which is why ρ(X) is called an importance weight.

It is instructive to consider the variance of the estimator,

Var(In) =1n

E( f (X)ρ(X)− I)2

=1n

[E( f (Y)2ρ(Y))− I2

].

For simplicity, consider the case in which ν and µ have a density with respectto the Lebesgue measure, and let us overload the symbols ν and µ to representthe respective densities. The variance then can be written

Var(In) =1n

[∫ f (y)2ν2(y)µ(y)

dy− I2]

.

In the case when f is positive, and I > 0, the density µ(x) ∝ f (x)ν(x) yieldsVar(In) = 0. Indeed, with this choice of µ, evaluating the importance weightat any point would allow us to compute the desired expectation I. While thisis clearly impossible in most circumstances, it provides a guiding principle tobuild good measures µ.

In practice Var(In) can be very large. Furthermore, when µ and ν are nearly

30


singular, the empirical estimate of variance,

Var(In) =1n

n

∑i=1

( f (Xi)ρ(Xi)− In)2

can be very bad, because the distribution of the summand is heavy-tailed, suchthat we only sample large values rarely. This makes approximate confidenceintervals based on the CLT inaccurate for small values of n. The varianceof the importance sampling estimator may even be infinite, as shown in thefollowing example.

40 example. Let f (x) = x, and ν and µ be exponential distributions withmeans 1 and 2, respectively. We have ρ(x) = 2ex/2, and Var( f (X)ρ(X)) =E( f 2(X)ρ2(X))− 1 =

∫R

x2dx− 1 = ∞.

Fortunately, even in this case, it is possible for the estimator In to convergeto I in terms of mean absolute deviation E|In− I|. The following theorem char-acterises the number of steps required for convergence as roughly exponentialin the Kullback-Leibler divergence between µ and ν, assuming concentrationof log ρ(Y) around its mean.

The Kullback-Leibler divergence

L := D(ν‖µ) =∫

log(ρ(y))ν(dy) = E log(ρ(Y))

is non-negative by Jensen’s inequality, and it is zero for µ = ν a.e. It canbe shown that it is a convex function of µ in the convex set of probabilitymeasures; therefore, L = 0 if and only if µ = ν a.e. In the following, we writeIn( f ) for the importance sampling estimator of E f (Y).

41 theorem (Diaconis and Chatterjee, 2015). Letting ‖ f ‖L2(ν)= [E f 2(Y)]1/2,

if n = exp(L + t) for some t ≥ 0,

E|In( f )− I( f )| ≤ ‖ f ‖L2(ν)

(e−t/4 + 2

√Pr(log ρ(Y) > L + t/2)

).(17)

Conversely, if 1 is a function equal to 1 everywhere, and n = exp(L− t) for somet ≥ 0, for any δ ∈ (0, 1),

Pr(|In(1)− 1| < δ) ≤ e−t/2 +Pr(log ρ(Y) ≤ L− t/2)

1− δ.(18)

Proof. To prove 17, define a = exp(L + t/2), and let h(x) = f (x) if ρ(x) ≤ aand h(x) = 0 otherwise. We write the error

|In( f )− I( f )| ≤ |In( f )− In(h)|+ |In(h)− I(h)|+ |I( f )− I(h)|.

Now, we bound the expectation of each term. By Cauchy-Schwarz,

|I( f )− I(h)| ≤ E[| f (Y)|1(ρ(Y) ≥ a)] ≤ ‖ f ‖L2(ν)

√Pr(ρ(Y) ≥ a).

31


Similarly,

E|In( f )− In(h)| ≤ E[| f (X)− h(X)|ρ(X)] = E[| f (Y)|1(ρ(Y) ≥ a)]

≤ ‖ f ‖L2(ν)

√Pr(ρ(Y) ≥ a).

Finally,

E|In(h)− I(h)| ≤√

Var(In(h))

=

√Var(h(X)ρ(X))

n

≤√

E(h2(X)ρ2(X))

n

≤√

aE( f 2(X)ρ(X))

n

= ‖ f ‖L2(ν)

√an

.

Combining these bounds, we obtain the inequality 17. Next, let a = exp(L−t/2). Markov’s inequality gives

Pr(ρ(X) > a) ≤ Eρ(X)

a=

1a

and E[ρ(X)1(ρ(X) ≤ a)] = Pr(ρ(Y) ≤ a), thus

Pr(In > 1− δ) ≤ Pr(

max1≤i≤n

ρ(Xi) > a)+ Pr

(1n

n

∑i=1

ρ(Xi)1(ρ(Xi) ≤ a) > 1− δ

)

≤n

∑i=1

Pr(ρ(Xi) > a) +E[ρ(X)1(ρ(X) ≤ a)]

1− δ

≤ na+

Pr(ρ(Y) ≤ a)1− δ

,

which is equivalent to 18.

The value of this theorem is mostly theoretical, as the Kullback-Leibler di-vergence is not easy to compute in general. In practice, it can be difficult toassess the quality of an importance sampling estimator. The following exam-ple provides an illustration of this.

42 example. In the 1976 paper Coping with finiteness, Knuth aimed to estimatethe number N10 of self-avoiding paths on a 10 × 10 grid, starting from thepoint (0, 0) and ending at the point (10, 10). A self-avoiding path is one whichdoes not visit any vertex more than once. An example of such a path is shownin the figure.

32


This number can be written as the expectation of a constant functionf (x) = N10, for a random path Y with probability mass function p,

N10 = ∑y∈paths

N10 p(y).

Suppose p(y) = 1/N10 is the uniform distribution on the set of self-avoidingpaths. Then, for any other distribution q, we can write the expectation

N10 = ∑x∈paths

f (x)ρ(x)q(x) = ∑x∈paths

N10N−1

10q(x)

q(x).

So, the importance sampling estimator given samples X1, . . . , Xn from q is

N10 =1n

n

∑i=1

1q(Xi)

.

Knuth constructs a distribution q by specifying a sequential procedure to sam-ple from it. Namely, we grow the path from (0, 0) by sampling at each stepone of the possible moves with equal probability. Any move which leads toa vertex which is not connected to (10, 10) through vertices in the comple-ment of the path is excluded. If ni(x) is the number of moves available to thesampler at step i in path x, we have q(x) = [n1(x)n2(x) . . . n|x|−1(x)]−1.

Using a few million samples, Knuth estimates there are (1.6± 0.3)× 1024

self-avoiding paths, which is very close to the true number

1, 568, 758, 030, 464, 750, 013, 214, 106.

It is hard to assess or bound the variance of the estimator, because the sum-mands 1/q(Xi) are usually between 107 and 1011, but a few of the values aremuch greater and dominate the sum. The reason the weights are distributedso unevenly is that q and p are nearly singular; there are many paths whichhave small probability under q and are sampled rarely.

There are several heuristics used to assess convergence based on the set ofweights ρ(Xi), which aim to capture whether a few of the weights are much

33

6. Fundamentals of Markov chain Monte Carlo

larger than the others. One is the effective sample size

ne =1

∑ni=1 ρ(Xi)2

or, for the ratio estimator,

ne =[∑n

i=1 τ(Xi)]2

∑ni=1 τ(Xi)2 .

The motivation for these quantities is the following. If we considered ρ(Xi) orτ(Xi) as fixed weights, independent of f (Xi)—which is clearly false—then wewould have

Var(In) =1ne

Var( f (X)), Var(Jn) =1ne

Var( f (X)),

where the effective sample size replaces the number of samples in the usualvariance formula for Monte Carlo. A related statistic, proposed by Diaconisand Chatterjee, is

qn =max1≤i≤n ρ(Xi)

∑ni=1 ρ(Xi)

.

They show that for certain distributions, it is possible to bound the error aboveand below by this quantity up to constant factors. In several problems, it is abetter indicator of accuracy than empirical estimates of the variance.

6 Fundamentals of Markov chain Monte Carlo

The objective of Markov Chain Monte Carlo is the approximation of expecta-tions E f (X) for X ∼ µ. In Bayesian analysis, µ will be a posterior distribution.Letting (Xi)i≥0 be a discrete time Markov chain for which µ is an invariantmeasure, we can define the Monte Carlo estimator

1n

n

∑i=1

f (Xi).(19)

Under certain conditions, this estimator will satisfy a law of large numbers,converging to E f (X) in probability or almost surely as n → ∞. This sectionestablishes conditions for convergence.

6.1 Countable state spaces

We will assume familiarity with the theory of Markov chains in discretespaces, so we merely state the main results here. Suppose first that the Markovchain takes values in a finite space (X,B). Then, the process is parametrisedby a transition matrix K, with K(x, y) = Pr(X1 = y | X0 = x). We use K also

34


to denote an operator on functions in X,

K f (x) = ∑y

K(x, y) f (y) = E[ f (X1) | X0 = x]

as well as an operator on measures in (X,B),

νK(x) = ∑y

ν(y)K(y, x),

in words, if X0 ∼ ν, then X1 ∼ νK. Note that these operators correspond tomultiplication by K on the right and on the left, respectively, if we considerf and ν vectors. Finally, let Kn be the nth power of the matrix or the n-steptransition probability matrix.

We say the Markov chain is irreducible if Kn > 0 for n large enough, andaperiodic if for any x, y ∈ X, the set of powers n > 0 ; Kn(x, y) > 0 hasgreatest common denominator 1. Note that, as K is a stochastic matrix, a vectorof ones is an eigenvector of K with eigenvalue 1, i.e. K1 = 1.

43 theorem (Perron-Frobenius). Let K be an irreducible, aperiodic, Markov ker-nel in a finite state space. Then the eigenspace corresponding to eigenvalue 1 is ofdimension 1, and every other eigenvalue is strictly smaller than 1 in magnitude.

In consequence, if µ is the left eigenvector with eigenvalue 1 and ∑x µ(x) = 1,we have limn→∞ Kn(x, y) = µ(y) for all x, y ∈ X.

In a countably infinite space, irreducibility and aperiodicity imply theexistence of a stationary measure µ, which is unique up to multiplicationby a constant. However, µ need not be a probability measure. A sufficientand necessary condition for the existence of a stationary probability measureis that at least one state x ∈ X is positive recurrent, or Eτx < ∞, whereτx = mint > 0, Xt = x is the first hitting-time of x.

6.2 General state spaces and Harris recurrence

In a general state space, the kernel K(x, ·) is a probability measure on (X,B)for every x ∈ X. As before, we can overload K to denote an operator onfunctions,

K f (x) =∫

f (y)K(x, dy) x ∈ X,

and an operator on measures

νK(A) =∫

K(x, A)ν(dx) A ∈ B.

Similarly, we let Kn be the n-step transition kernel and associated operators.Note that the set of operators Kn, n ≥ 1 forms a semi-group, as the convolu-tion of two elements satisfies KnKm = Kn+m for all n, m ≥ 1.

44 definition. We say a set A ∈ B is small if there exists a probability measure

35


ν on (X,B) and constants λ > 0, m ≥ 1, such that

Km(x, ·) ≥ λν(·) for all x ∈ A.

We say A satisfies a minorisation condition.

Note that as Km(x, ·) is a probability measure, we must have λ ≤ 1, andtherefore, if A is small we can write the m-step kernel as a mixture

Km(x, ·) = λν(·) + (1− λ)νx(·)

for all x ∈ A, where νx(·) = [Km(x, ·)−λν(·)]/(1−λ) is a probability measurewhich may depend on x. The key is that the first component does not dependon x. This allows us to apply a renewal argument; every time that the Markovchain enters A, we have a fixed probability λ of generating the next statefrom a measure which does not depend on the current state. Every time thathappens, we are allowed to “forget the past”.

Verifying the minorisation condition can be difficult in general. In somecases the measures Km(x, ·); x ∈ A admit a density p(x, y) with respect to abase measure ψ, with

Km(x, B) =∫B

p(x, y)ψ(dy) ∀x ∈ A.(20)

If we choose

ν(B) = λ−1∫B

infx∈A

p(x, y)ψ(dy), and

λ =∫X

infx∈A

p(x, y)ψ(dy),

then minorisation holds when λ > 0. This will happen, for example x 7→p(x, y) is continuous, positive, and A is some compact set.

The existence of a small set alone is not enough to apply a renewal ar-gument, we also need the Markov chain to return to the small set frequentlyenough. This is captured by the notion of Harris recurrence.

45 definition. A Markov chain is Harris recurrent if there exists a small setA ∈ B, with a hitting time which is almost surely finite. More precisely, ifτA = mint > 0; Xt ∈ A, then Prx(τA < ∞) = 1 for any x ∈ X. We say thechain is positive Harris recurrent if in addition, supx∈X Ex(τA) < ∞.

46 remark. If the space X is countable, an irreducible Markov chain is Harrisrecurrent if there is at least one recurrent state, and positive Harris recurrent ifthere is at least one positive recurrent state. This is because any set containinga single state is small. Note that Harris recurrence alone does not guaranteethe existence of a stationary probability measure, as in the countable case allstates could be null recurrent.

36


47 theorem (Aperiodic ergodic theorem). Suppose a Markov chain (Xn)n≥0 ispositive Harris recurrent and aperiodic. If µ is a stationary probability measure, then

‖Kn(x, ·)− µ‖TV → 0(21)

as n→ ∞, for all x ∈ X.

48 remark. Convergence in total variation of Kn(x, ·) to µ implies weak con-vergence or convergence in distribution.

Before we prove this theorem through a renewal argument, we need tointroduce the concept of a splitting, due to Nummelin. The goal is to obtaina Markov chain in an augmented space, which projects down to the Markovchain with kernel K, and for which there is a non-null set from which thetransition kernel is constant. To be precise, define the augmentation X∗ =X× 0, 1, where we attach a variable equal to 0 or 1 to every state, and letB∗ be the σ-algebra generated by sets B × 0, B × 1; B ∈ B. Using thenotation Bi = B× i for i = 0, 1, define a mapping from any measure π in(X,B) to a measure π∗ in (X∗,B∗) by

π∗(B0) = π(B ∩ AC) + π(B ∩ A)(1− λ)

π∗(B1) = π(B ∩ A)λ.

We can define a new Markov kernel K on this augmented space, by

K(x, ·) = K(x, ·)∗ for x ∈ X0 \ A0

K(x, ·) =[

K(x, ·)− λν(·)1− λ

]∗for x ∈ A0

K(x, ·) = ν(·)∗ for x ∈ X1.

We have used the minorisation condition to establish that the kernel in thesecond line above is a probability measure. Despite the cumbersome notation,it is not hard to see that if the initial state is in X0, the Markov chain withtransition kernel K projects down to a Markov chain with transition kernelK when we ignore the variable 0 or 1 attached to each state. In addition, thekernel K(x, ·) from any x ∈ A1 is constant and equal to ν.

Proof of Theorem 47. Recall that total variation distance is defined by

‖µ− ν‖TV = infX∼µ,Y∼ν

Pr(X 6= Y)

Therefore, if X and Y are arbitrary random variables with distribution µ andν, we can upper bound the total variation distance by Pr(X 6= Y). We shallconstruct a coupling of two Markov chains, (Xi)i≥0 and (Yi)i≥0 with transitionkernel K, letting X0 = x and Y0 ∼ µ. The variable Xn has distribution Kn(x, ·)and, as µ is stationary, Yn has distribution µ; therefore,

‖Kn(x, ·)− µ‖TV ≤ Pr(Xn 6= Yn).

37


Without loss of generality, assume that the constant m in the minorisationcondition is 1. When m > 1, the same argument can be used to prove thatKmn+i(x, ·)→ µ in total variation as n→ ∞. This holds for i = 0, 1, . . . , m− 1,so the theorem follows.

We shall construct the coupling between two split Markov chains in thespace (X∗,B∗). We initialise Y0 ∼ µ∗, and X0 = (x, 0) if x ∈ AC, or if x ∈ A,we set X0 = (x, Z) with Z ∼Bernoulli(λ). The Markov chains evolve indepen-dently with kernel K until the first time at which they arrive simultaneouslyat A1, TA1 = mint > 0, Xt ∈ A1, Yt ∈ A1. After that point, we can make thechains identical while preserving the distribution of each one, noting that thetransition kernel at TA1 is independent of XTA1

and YTA1. Therefore,

‖Kn(x, ·)− µ‖TV ≤ Pr(Xn 6= Yn) ≤ Pr(TA1 > n)

It remains to show that the right hand side converges to 0 as n→ ∞. To dothis, define the hitting times τA1(t) = mint > τA1(t− 1), Xt ∈ A1 for t > 0,with τA1(0) = 0. Note that the increments τA1(t + 1)− τA1(t) are identicallydistributed for t > 1; call their distribution p. By definition, τA1(t+ 1)− τA1(t)is the sum of a geometrically distributed number of hitting times for A0 ∪ A1,which we assume have finite expectation, so E(τA1(t + 1)− τA1(t)) < ∞ forall t.

Now, define a renewal process, as the process taking values on 0, 1, 2, . . . equal at time t to the number of steps remaining until the next hitting time ofA1 in (Xt)t≥0. Note that after the first step, this is simply a Markov chain withkernel Q,

Q(j, j− 1) = 1 for j > 1,

Q(0, j) = p(j + 1) for j ≥ 0.

The hitting times of A1 coincide with visits to 0 in the renewal process. Notethat the renewal process has a stationary probability measure

π(j) =∑∞

i=j+1 p(i)

∑∞i=1 ip(i)

j ≥ 0,

where we use the fact that E(τA1(t + 1)− τA1(t)) = ∑∞i=1 ip(i) < ∞.

We can define a bivariate renewal process where the first coordinate isdriven by (Xt)t≥0 and the second is driven by (Yt)t≥0. The time TA1 then coin-cides with the first visit to (0, 0) in the bivariate renewal process. Aperiodicityguarantees that the state (0, 0) can be reached from every state (i, j) in the bi-variate renewal process. Finally, as (Xt)t≥0 and (Yt)t≥0 are independent beforeTA1 , the distribution π × π is stationary for the bivariate renewal process, so(0, 0) is positive recurrent and in particular Pr(TA1 > n)→ 0 as n→ ∞.

49 remark. We did not explicitly define aperiodicity for general state spaces,but this condition (or the implication used in the proof) is easy to verify in allMarkov chains commonly used for Markov chain Monte Carlo.

38


6.3 Geometric ergodicity

Quantifying how quickly the Markov chain approaches the stationary distri-bution is a more difficult problem. The standard approach involves a couplingargument like the one used in the proof of Theorem 47, with a more refinedcontrol on the return times to the small set. The following condition is a strongform of ergodicity satisfied by many Markov chain samplers,

‖Kn(x, ·)− µ‖TV ≤ C(x)ρ(x)n,(22)

for x µ-a.e. and functions C : X→ R+, ρ : X→ [0, 1). We say a Markov chainsatisfying these conditions is geometrically ergodic.

Under ψ-irreducibility and aperiodicity, this is equivalent to the seeminglystronger inequality

‖Kn(x, ·)− µ(·)‖V ≤ CV(x)ρn,(23)

where C < ∞, V : X → [1, ∞), and ρ < 1 does not depend on x. The norm isdefined by

‖ν‖V = sup

∣∣∣∣∫ F(x)ν(dx)∣∣∣∣ ; sup

x∈X

|F(x)|V(x)

≤ 1

.

50 exercise. Prove that the inequality (23) implies the inequality (22).

In fact, the second formulation of geometric ergodicity is easier to verify.It is sufficient to construct the Lyapunov function V : X → [1, ∞) such that itsatisfies the following drift condition: there exists a small set A and constants0 < γ < 1, 0 < b < ∞ such that

KV(x) ≤ γV(x) + b1A(x).

51 theorem. If A is a small set and V a Lyapunov function satisfying the driftcondition, the Markov chain admits a unique stationary measure µ, and there existconstants C > 0 and ρ ∈ (0, 1) such that the inequality (23) holds.

The interested reader can refer to Hairer and Mattingly (2008) for a self-contained proof of this result, or to the relevant chapters in Markov Chains andStochastic Stability by Meyn and Tweedie.

Finding a good Lyapunov function to establish the drift condition can bedifficult. When X = Rd, good candidates are

• 1 + ‖x‖p,

• 1 + ea‖x‖p,

• 1 + (log(1 + ‖x‖))p.

39


6.4 Reversibility

52 definition. We say a Markov chain is µ-reversible if for any pair of mea-surable sets A, B ∈ B,

Prµ(X0 ∈ B, X1 ∈ A) =∫B

K(x, A)µ(dx)

=∫A

K(x, B)µ(dx) = Prµ(X0 ∈ A, X1 ∈ B).

If µ(·) and K(x, ·) have densities with respect to a base measure, for examplethe Lebesgue measure: µ(dx) = µ(x)dx, K(x, dy) = K(x, y)dy, then reversibil-ity is equivalent to µ(x)K(x, y) = µ(y)K(y, x) for µ-almost all x, y ∈ X.

53 proposition. If a Markov chain is µ-reversible, then µ is a stationary distribu-tion.

Proof. By reversibility∫X

K(x, A)µ(dx) =∫A

K(x, X)µ(dx) =∫A

µ(dx) = µ(A).

For a µ-reversible Markov chain, the linear operator K is self-adjoint in theHilbert space L2(µ), with inner product 〈 f , g〉µ =

∫f (x)g(x)µ(dx); indeed,

〈 f , Kg〉µ =∫

f (x)∫

g(y)K(x, dy)µ(dx)

=∫∫

f (x)g(y)K(y, dx)µ(dy)

= 〈K f , g〉µ.

In addition, the operator is bounded because by Jensen’s inequality

‖K f ‖2µ =

∫ [∫f (y)K(x, dy)

]2µ(dx)

≤∫∫

f (y)2K(x, dy)µ(dx) =∫

f (y)2µ(dy) = ‖ f ‖2µ.

with equality satisfied for the eigenfunction f (x) = 1. The spectral theorem forself-adjoint operators says that the eigenfunctions of K form an orthonormalbasis for L2(µ); furthermore, the spectrum σ(K) is real.

Define the spectral radius

ρ = supf∈L2(µ), f 6=0,µ( f )=0

‖K f ‖µ

‖ f ‖µ= sup

λ∈σ(K),λ<1λ.

The condition ρ < 1 implies geometric ergodicity for a reversible Markov

40

7. Basic MCMC Algorithms

chain. Indeed, for any f ∈ L2(µ)

‖Kn f − µ( f )‖µ = ‖Kn( f − µ( f ))‖µ ≤ ρn‖ f − µ( f )‖µ,

and this can be shown to imply geometric ergodicity. It can also be shown thatgeometric ergodicity is necessary for ρ < 1 in a reversible Markov chain.

54 theorem. Let (Xi)i≥0 be a reversible, geometrically ergodic Markov chain withstationary distribution µ and X0 ∼ µ. The variance of an MCMC estimator satisfiesfor any f ∈ L2(µ) and n ≥ 1,

1n

Var

(n

∑i=1

f (Xi)

)≤[

1 +2

1− ρ

]Var( f (X0)).

55 remark. The constant in the bound measures the effectiveness of anMCMC estimator relative to a Monte Carlo estimator with Xi ∼ µ i.i.d.

Proof. We have,

1n

Var

(n

∑i=1

f (Xi)

)= Var( f (X0)) +

2n

n

∑i=1

n

∑j=i+1

Cov( f (Xi), f (Xj))

≤ Var( f (X0)) +2n

n

∑i=1

n

∑j=i+1

E[ f (Xi) f (Xj)]

≤ Var( f (X0)) +2n

n

∑i=1

n

∑j=i+1

E[ f (Xi)K j−i f (Xi)].

Here, we denote f (x) = f (x) − µ( f ) the centred version of f . Notethat E[ f (Xi)K j−i f (Xi)] = 〈 f , K j−i f 〉µ ≤ ‖ f ‖µ‖K j−i f ‖µ ≤ ρj−i‖ f ‖2

µ =

ρj−iVar( f (X0)) by the Cauchy-Schwarz inequality, so

1n

Var

(n

∑i=1

f (Xi)

)≤ Var( f (X0)) +

2n

n

∑i=1

n

∑j=i+1

ρj−iVar( f (X0))

≤ Var( f (X0)) +2

1− ρVar( f (X0)).

7 Basic MCMC Algorithms

This section introduces two Markov chain samplers, the Metropolis–Hastingsand Gibbs algorithms, which are the building blocks of most MCMC methods.

7.1 Metropolis–Hastings

We will construct a Markov chain with stationary distribution µ from aMarkov kernel q. Given the state of the Markov chain at time t, Xt = xt,

41


we sample Xt+1 as follows. Sample a proposal x′t+1 from q(xt, ·). Then, withprobability min1, R(x, y) set Xt+1 = x′t+1, and otherwise set Xt+1 = xt.Here,

R(x, y) =µ(dy)q(y, dx)µ(dx)q(x, dy)

is known as the acceptance ratio. Formally, R(x, y) is the Radon–Nikodymderivative, which satisfies∫

A

∫B

h(x, y)R(x, y)µ(dx)q(x, dy) =∫A

∫B

h(x, y)µ(dy)q(y, dx)

for any two measurable sets A, B, and function h. In the case where µ andq(x, ), x ∈ X have densities with respect to a common measure Λ (say, theLebesgue measure), then R(x, y) takes on a simple form

R(x, y) =µ(y)q(y, x)µ(x)q(x, y)

where with an abuse of notation we use the same symbols for measures anddensities. An important property of this is that we need only know our targetdensity µ up to a constant in order to implement the algorithm. In Bayesianapplications, where µ is a posterior probability, this tends to be the case.

56 proposition. The Metropolis–Hastings Markov chain is µ-reversible.

Proof. We need only check the detailed balance condition∫A

µ(dx)K(x, B) =∫B

µ(dx)K(x, A)(24)

for A and B disjoint. Start from the right hand side,∫A

µ(dx)K(x, B) =∫

x∈A

∫y∈B

min1, R(x, y)q(x, dy)µ(dx)

=∫

x∈A

∫y∈B

[R(x, y)1R(x,y)<1 + 1R(x,y)≥1

]q(x, dy)µ(dx)

=∫

x∈A

∫y∈B

1R(x,y)<1 q(y, dx)µ(dy)

+∫

x∈A

∫y∈B

1R(x,y)≥1 q(x, dy)µ(dx).

As R(x, y) = R(y, x)−1, we have 1R(x,y)<1 = 1R(y,x)≥1 a.e., and the right handside is clearly symmetric in A and B. This proves Eq. 24.

Metropolis–Hastings maps any transition kernel q to a µ-reversible Markov

42


kernel of the form

KMH(x, dy) = min1, R(x, y)q(x, dy) + αxδx(dy),

where

αx = 1−∫X

min1, R(x, y)q(x, dy)

is the probability of staying at a current state x. The following theorem char-acterises the mapping from q to KMH in geometric terms.

57 theorem (Billera and Diaconis). In a finite state space, the Metropolis–Hastings kernel,

KMH ∈ arg minK µ-reversible

∑x 6=y

µ(x)|K(x, y)− q(x, y)|

and it is the unique minimiser which is coordinatewise smaller than q on the off-diagonal elements.

This result can be generalised to continuous space Markov chains. It makesit clear the sense in which Metropolis–Hastings aims to approximate theproposal kernel q. If q puts no weight on self-transitions, then Metropolis–Hastings will try to minimise the weights on the diagonal. This seems reason-able, as it would be wasteful to encourage the Markov chain to stay put. Thefollowing theorem motivates this principle.

If (Xn)n≥0 is an ergodic Markov chain with kernel K, the asymptotic vari-ance of a function f is defined as

Var( f , K) = limn→∞

1n

Var

(n

∑i=1

f (Xi)

).

If the samples were i.i.d., this would match the variance of f (X1), so thisquantity measures how quickly the Markov chain samples decorrelate. Thesmaller it is, the more effective the sampler.

58 theorem (Peskun). Suppose two µ-reversible Markov kernels satisfy the Peskunorder K1 K2, i.e. for any x ∈ X, and A ∈ B with x /∈ A, K1(x, A) ≥ K2(x, A).Then, for any square integrable f ,

Var( f , K1) ≤ Var( f , K2).

Hence, reducing the diagonal of the transition kernel cannot worsen theasymptotic variance. In practice, the convergence behaviour of a Metropolis–Hastings chain is determined by a careful choice of the proposal kernel q.

Setting q(x, dy) = µ(dy) would make the acceptance ratio 1, and theMarkov chain an i.i.d. sequence. Of course, this is usually impossible, as ouroriginal goal was to sample µ. Instead, the distribution q(x, ·) aims to approx-

43


imate µ, at least on a neighbourhood of x on which µ is relatively smooth. Ifthe size of this neighbourhood is large, the algorithm proposes bigger steps,but we risk making proposals with low stationary probability and thus a smallacceptance ratio. On the other hand, if the proposals tend to be too close tox, we accept them with high probability, but the Markov chain suffers fromrandom-walk behaviour.

Usually, the proposal kernel has parameters that tune the average stepsize. These parameters can be adjusted to achieve an acceptance rate whichyields the optimal asymptotic variance. This choice will be discussed later inthe context of Hamiltonian Monte Carlo.

7.2 Gibbs sampling

Gibbs sampling is used to sample multivariate distributions. Say we want tosample a vector X = (X1, . . . , Xp) with distribution µ. Given the state of theMarkov chain at time t, Xt = xt, a random scan Gibbs sampler generates thenext state by

1. Sampling a random coordinate j ∈ 1, . . . , p from some distribution πwith π(i) > 0 for 1 ≤ i ≤ p.

2. Sampling xjt+1 from the full conditional µ(dX j | X−j = x−j

t ) where X−j

represents all the coordinates of X except X j.

3. Setting Xit+1 = xi

t for all i 6= j, and X jt+1 = xj

t+1.

59 proposition. The random scan Gibbs sampler is µ-reversible

Proof. The random scan Gibbs sampler is a special case of Metropolis–Hastings with an acceptance ratio which is always 1, such that proposals arealways accepted. Indeed, letting j be the coordinate updated at time t,

R(Xt+1, Xt) =π(j)µ(dXt+1)µ(dX j

t | X−jt+1)

π(j)µ(dXt)µ(dX jt+1 | X−j

t )

=µ(dX j

t+1 | X−jt+1)µ(dX−j

t+1)µ(dX jt | X−j

t+1)

µ(dX jt | X−j

t )µ(dX−jt )µ(dX j

t+1 | X−jt )

=µ(dX j

t+1 | X−jt )µ(dX−j

t )µ(dX jt | X−j

t )

µ(dX jt | X−j

t )µ(dX−jt )µ(dX j

t+1 | X−jt )

= 1

where in the second equality, we use the definition of conditional probability,and in the third, we use the fact that X−j

t = X−jt+1.

A systematic scan Gibbs sampler, rather than choosing a random coordinateto modify at each step, iterates through all the coordinates in a pre-specifiedorder. This forms an inhomogeneous Markov chain, which still has stationarydistribution µ.

44


Gibbs sampling works well when the coordinates of X are nearly inde-pendent in µ. Indeed, when the coordinates are independent, a systematicscan Gibbs sampler only requires one iteration through all the coordinatesin order to produce a sample from µ independent from the initial state ofthe Markov chain. In each iteration, the algorithm is sampling a coordinatefrom its marginal distribution. A random scan Gibbs sampler would have towait until all the coordinates are resampled. The length of time this takes,T = inft > 0; Xi

t 6= Xi0, 1 ≤ i ≤ p, is studied by the coupon collector’s problem.

It is well known that

ET = O(p log p) Var(T) ≤ (πp)2

6,

and Pr(T > βp log p) ≤ p1−β.On the other hand, when the coordinates are highly correlated, the Gibbs

sampler will move across the space very slowly because in each step the coor-dinate sampled is highly dependent on the others. For an extreme example ofthis, a Gibbs sampler for the uniform distribution on [0, 1/2]2 ∪ [3/4, 1]2 ⊂ R2

is reducible, as we can never reach one square from the other.Gibbs sampling is practical when the full conditionals of the target distri-

bution µ are easy to sample. The following are three common scenarios:

1. The full conditional is in a standard exponential family. This occurswhen conjugate priors are used.

2. The full conditional is log-concave. This can be verified using the factthat log-concave densities are closed under addition, marginalisation,and convolution, among other operations.

3. The full conditional can be approximated well locally by a distributionq. In this case, we can take a step of Metropolis–Hastings with proposaldistribution q for the coordinate in question. If q is similar to the fullconditional, this will approximate the Gibbs sampler well. This is knownas Metropolis-within-Gibbs.

The first case merits a more detailed discussion. In hierarchical models,conjugate priors lead to an easy implementation of Gibbs sampling. Considera representation of the model as a Bayes network. To sample the posterior ofthe parameters conditional on the observable nodes via Gibbs sampling, weonly need to be able to sample the full conditional of each parameter. TheMarkov blanket of a parameter η often has the structure shown in Fig. 4. Inthis plate diagram, we understand that the variables enclosed in a rectangle areiterated over the range indicated on the corner, with no connections betweenthe variables in different iterates. If the prior p(η | λ) is conjugate to thelikelihood p(Yi | η, τ) for the parameter η, then the full conditional of η willbe in the same family as the prior.

The following examples illustrate this abstraction. In each case, a quickexamination of the Bayes network reveals how to construct a Gibbs sampler.

45


η

Yi

τ

λ

i ∈ [n]

Figure 4: The Markov blanket of a node η in a Bayes network.

60 example (Mixed model). A mixed model is a regression method with re-sponses Y ∈ Rn, and two design matrices X ∈ Rn×p and Z ∈ Rn×d. Weare interested in the relationship between the predictors X and the response,while the covariates in Z are typically categorical variables whose effect canbe assumed to be drawn from a large population. For example, Yi may be theweight of an animal subject exposed to a drug in a toxicity study, and xi couldbe the dosage that the subject received or its food intake; on the other hand,zi could contain variables which might affect the response but we don’t needto estimate, for example, an indicator for the subject’s litter or cage.

The model is specified by

Y = Xβ + Zu + ε,

where u ∼ N(0, Σ) and ε ∼ N(0, σ2 I) are independent. The coefficients β arecalled fixed effects and the coefficients u are called random effects. As we onlycare about the fixed effects, these are usually estimated along with the covari-ance Σ, while u is treated as a nuisance parameter. In a Bayesian framework,everything is a random variable, so we can put priors on β, and Σ and baseour inferences on the posterior of β. Choosing conjugate priors,

β ∼ N(0, σ2β I)

σ2β ∼ inv-Gamma(τβ, 1)

Σ ∼ inv-Wishart(I, s)

σ2 ∼ inv-Gamma(τ, 1)

leads to a fast Gibbs sampler. A quick inspection of the Bayes network in Fig.5 will reveal that the full conditionals of σβ, β, u, and Σ are all in the samefamily as their prior.

61 example (Mixture model). A mixture model is a form of density estima-tion, as well as a procedure for model-based clustering. We assume (Yi)1≤i≤n

46


τβ

σ2β

β

s

Σ

u

τ

σ2

Yi

i ∈ [n]

Zi

Yi θt

p

i ∈ [n] t ∈ [T]

Zij

Yij θt

pj

i ∈ [n] t ∈ [T]j ∈ [J]

Figure 5: Bayes network for three hierarchical models, a mixed model (top), amixture model (middle), and a topic model (bottom).

47

8. Auxiliary variables and data augmentation

are i.i.d. observations from a mixture

T

∑t=1

pt f (·; θt)

of densities f (·; θt) in an exponential family with parameters θt. Each com-ponent in the mixture is weighted by a probability pt. It will be convenientto augment the space with indicator variables (Zi)1≤i≤n drawn i.i.d. from thediscrete distribution p on the set 1, . . . , T, such that Yi | Zi ∼ f (·; θZi ). TheBayes network for this construction is shown in Fig. 5.

Put a prior on each θt which is conjugate to the likelihood f (·; θ), and letp ∼ Dirichlet(α/T, . . . , α/T). Inspecting the Bayes network reveals how to doGibbs sampling. The full conditional of p is Dirichlet(α/T + n1, . . . , α/T + nT),where nt = ∑n

i=1 1(Zi = t). The full conditional of each θt is in the same con-jugate family as the prior, because conditioning on (Zi)1≤i≤n specified whichsamples came from which component. Finally, the full conditional of Zi isindependent for i = 1, . . . , n and specified by

Pr(Zi = t | θ, Y, p) =f (Yi; θt)pt

∑T`=1 f (Yi; θ`)p`

.

62 example (Topic model). A topic model is a kind of mixture model, withthe following narrative. We have a collection of J documents, each of whichcan be considered a bag of words; the goal of the model is to infer underlyingtopics which explain the distribution of words across documents. Let Yij bethe ith word in document j. A topic θt is a distribution over a word dictionary.The distribution of words in a document j will be a distinct mixture of thetopics θ1, . . . , θT with positive weights pj1, . . . , pjT adding up to 1.

We can represent this using variables (Zij)i≥1 sampled i.i.d. from the dis-tribution pj, for each document j, such that Yij | Zij has distribution θZij . TheBayes network for this model is shown in Fig. 5. As in a regular mixturemodel, we assign a conjugate Dirichlet prior to each pj for j = 1, . . . , J. Finally,we assign a Dirichlet prior for each θt, which is conjugate to the discrete like-lihood of Yij | Zij = t. This model is known as Latent Dirichlet Allocation. Itwill be left as an exercise to check that the full conditional for every variableis easy to sample.

8 Auxiliary variables and data augmentation

In many examples, we want to sample a posterior µ(θ | Y) which is intractable,but it is possible to introduce additional variables Z in the probability spacewhich leave the joint distribution of Y and θ invariant and make it easierto define efficient Gibbs samplers. In some cases, the latent variable Z has anatural interpretation, while in others it is merely a convenient construction.This section discusses several examples.

63 example (Swendsen–Wang algorithm). This is one of the earliest, and avery striking application of auxiliary variables. The goal of this method is to

48


simulate a Ferromagnetic Ising model. This is a distribution over a spin systemX = Xi; i ∈ V with coordinates indexed by the vertices in a graph (V, E)taking values in −1, 1. The distribution is

µ(x) =1Z

exp

∑(i,j)∈E

Jxixj

,

where J > 0 is a parameter known as the coupling constant, as it makes neigh-bouring spins likely to have the same sign. When J is large, the system hastwo stable regions—one in which most of the spins are +1, and one in whichmost are −1.

It is possible to define a Gibbs sampler which updates one entry of X at atime. However, this algorithm could mix very slowly when J is large, becausethe strong correlations between the coordinates makes it hard to transitionfrom one stable region to the other.

The Swendsen–Wang algorithm introduces a vector of auxiliary variablesD = De; e ∈ E indexed by the edges in the graph, and taking values in0, 1. We define the joint distribution µ of X and D by letting X ∼ µ anddefining

µ(d | x) = ∏e∈E

µ(de | xδe)

where δe are the endpoints of e. In words, the auxiliary variables are indepen-dent given X, and each De depends only on the spins of the edge e’s endpoints.We let

µ(d(i,j) = 1 | xi, xj) =

0 if xi 6= xj

1− e−2J if xi = xj.

Clearly, the marginal distribution of X has µ(x) = µ(x). So, in order to sampleX ∼ µ we can implement a Gibbs sampler which alternates sampling X | Dand D | X. By construction, the second step is simple. What is surprising isthat the distribution of X | D is also easy to sample.

64 lemma. Take any vector d, and define a graph (V, E′) with (i, j) ∈ E′ if d(i,j) = 1.The distribution of X | D = d can be described as follows:

1. With probability 1, each connected component of (V, E′) has constant spin.

2. The spin in each component is equally likely to be +1 or −1 and the spins forthe components are independent.

Proof. To prove 1, note that we can only have d(i,j) = 1 if xi = xj. To prove 2,note that for any allowed configuration x, i.e. any configuration with constantspin on the connected components of (V, E′), the joint distribution can be

49


written

µ(x | d) ∼= µ(x)µ(d | x) =1Z ∏

(i,j)∈Eexp

(Jxixj + log(e−2J)1(d(i,j) = 0, xi = xj)

+ log(1− e−2J)1(d(i,j) = 1, xi = xj))

.(25)

A factor for an edge (i, j) within a connected component is independent ofthe sign of the component, because an allowed configuration has xi = xj. Afactor for an edge between two connected components always has d(i,j) = 0,so when xi = xj, it is equal to

eJ×1+(−2J) = eJ

and when xi 6= xj, it equals

eJ×(−1)+0 = eJ .

Therefore, for an allowed configuration, all factors in Eq. 25 are independentof the spin of the connected components. We conclude that all assignments ofspins to the connected components are equally likely.

It is not hard to see that the Swendsen–Wang sampler is an irreducibleMarkov chain. Also, in the case when J is large, it can transition quickly be-tween the two stable regions. Suppose the initial state has all spins positive.Then, when we sample D | X, we will likely form a single large connectedcomponent. Sampling X | D would allow us to flip all the spins in one itera-tion with high probability.

8.1 Slice sampling

The goal is to sample a random variable X with density µ. This is usually aunivariate or low-dimensional distribution, and it is often the full conditionalrequired in a step of Gibbs sampling. The idea is to augment the space with avariable H, which has H | X ∼ Uniform(0, µ(X)) where µ is equal to µ up toa constant.

As we saw earlier when discussing rejection sampling, this construction isequivalent to (X, H) being a uniform point in the graph of µ. Slice samplingis nothing but a Gibbs sampler in which we alternate sampling H | X andX | H. The first step is trivial, by construction. But the second step is alsosimple when X ∈ Rd: as (X, H) is uniform in the graph of µ, X | H will beuniform in the level set x ∈ X; µ(x) > H, known as the slice.

Sampling from the level set of µ can often be done by rejection sampling.When X is one-dimensional and µ is unimodal, we can find an interval whichcontains the slice by starting with an arbitrary interval [a, b] and expanding itlinearly until we go beyond the slice’s boundaries. Then, we sample from theinterval repeatedly until we obtain a sample that falls in the slice. Each timewe have a rejection, we can shrink the interval. When µ is not unimodal, asimilar algorithm still leaves the target distribution µ stationary.

50


Slice sampling can be a good replacement for a Metropolis–Hastings pro-posal in Metropolis-within-Gibbs, as it reduces random walk behaviour.

8.2 Hit and run algorithm

The Hit and Run algorithm aims to sample a random variable X with densityµ with respect to the Lebesgue measure in Rd. We introduce an auxiliaryvariable H which is a random line passing through X. To sample H | X, wedraw a uniform point x′ on the unit sphere centred at X, and let H be the linepassing through X and x′, X + (X− x′)t; t ∈ R. The Hit and Run algorithmis just a Gibbs sampler for the pair (X, H).

The main insight is that the regular conditional probability distributionof X | H has a density with respect to the Lebesgue measure on the line H.Letting h = vt + a; t ∈ R, the conditional density is

µ(vt + a | h) =µ(vt + a)∫ ∞

−∞ µ(vs + a)dsfor t ∈ R.

We call this the restriction of µ to the line. The Gibbs sampler will be practicalwhen this univariate distribution is easy to sample, for example, when µ islog-concave.

It is possible to generalise this idea by letting H be a random subset of thespace containing X, but not necessarily a line. Indeed, we can frame most ofthe algorithms discussed thus far and many others in this way (see Hit andRun as a Unifying Framework, Diaconis and Andersen, 2007).

In a remarkable paper, Lovas and Vempala characterise the mixing time ofa Hit and Run sampler when the density µ in Rd is log-concave. The mixingtime is defined as the number of iterations required for the chain to convergeto a given distance, in total variation, to the stationary distribution:

τε = mint ≥ 0 ‖L(Xt)− µ‖TV < ε.

They show that if the initial state is “warm”, meaning that it is not too con-centrated in any given area, the Markov chain mixes in τε = O(d3+δ) steps forany δ > 0; i.e. the complexity scales as the cube of the dimension. This is con-sidered a good algorithm for unimodal distributions in moderate dimensions.

8.3 Data augmentation

Including latent variables from a hierarchical model in a Gibbs sampler issometimes called data augmentation. The usual structure is presented in theBayes network below, where Y are the observables, θ is the parameter of in-terest, and Z are the latent variables.

Z Yθ

In many cases, although not always, the latent variable has a scientificinterpretation, so it can be thought of as “augmenting” the observables. In

51


general, sampling the joint posterior µ(θ, Z | Y) will be easier than samplingµ(θ | Y) directly. When the conditional independence Y ⊥ θ | Z implied bythe Bayes network holds, the latent variable is called a sufficient augmentation,or centred parametrisation. There may be many possible augmentations, and thechoice can have a significant effect on the convergence of the Gibbs sampler.

65 definition. We say the random variable Z∗ is a reparametrisation of Zwhenever Z = h(Z∗, θ) for some deterministic function h. An ancillary aug-mentation or non-centred parametrisation is one which has Z∗ ⊥ θ a priori.

66 example. Consider the random effects model

Yi ∼ N(Z>i ui, σ2)

Zi ∼ N(θ, Σ)

i.i.d. for i = 1, 2, . . . , n. The parameter of interest is the population mean θ,and as Yi is independent of θ given Zi, the variables Z are a sufficient aug-mentation. We can define an ancillary augmentation through Z∗i = Zi − θ.As Z∗i ∼ N(0, Σ), the distribution of Z∗i is free of θ; i.e. this is an ancillaryaugmentation for θ and, in Bayesian terms, θ and Z∗i are independent a priori.

67 remark. Defining ancillary augmentations is not always easy, but inlocation-scale families, there is a standard technique.

• We say θ is a location parameter if Zi is equal in distribution to X + θ forsome random variable X whose distribution is independent of θ. Then,the transformation Z∗i = Zi − θ is an ancillary augmentation.

• We say θ is a scale parameter if Zi is equal in distribution to θX + a forsome random variable X whose distribution is free of θ and a constanta. Then Z∗i = (Zi − Z)/θ where Z = n−1 ∑n

i=1 Zi defines an ancillaryaugmentation. If a is known, we can also use Z∗i = (Zi − a)/θ.

For any augmentation Z∗, we can define a Gibbs sampler alternating

1. Z∗ | θ, Y

2. θ | Y, Z∗.

The posterior dependence between Z∗ and θ will determine the rate of con-vergence of the Markov chain. Clearly, if Z∗ and θ are nearly independent aposteriori, each step approximately samples the marginal posteriors Z∗ | Yand θ | Y.

Sufficient and ancillary augmentations have a nice complementarity—when the Gibbs sampler for the sufficient augmentation works well, it tendsnot to for the ancillary augmentation, and vice versa. This would suggest astrategy of alternating steps from the Gibbs sampler of each chain. To be pre-cise, consider the augmented space (θ, Y, Z, Z∗) where Z is sufficient and Z∗

is ancillary; then we could iterate sampling

1. Z∗ | θ, Y

52


2. θ | Y, Z∗.

3. Z | θ, Y

4. θ | Y, Z.

Surprisingly, it is usually much better to take a different strategy. The an-cillarity sufficiency interweaving strategy or ASIS iterates the steps

1. Z∗ | θ, Y

2. θ | Y, Z∗.

3. Z | θ, Y, Z∗

4. θ | Y, Z.

The only difference is in the third step, where we condition on Z∗. If Z∗ is areparametrisation of Z, with Z = h(Z∗, θ), this step is a deterministic map-ping.

68 theorem (Yu and Meng). Consider two data augmentations Z1 and Z2. Sup-pose the Gibbs samplers using each augmentation are geometrically ergodic with ratesγ1 and γ2, respectively. Namely, the total variation distance of each chain to thestationary distribution after n steps is bounded by a sequence which is O(γ−n

1 ) orO(γ−n

2 ), respectively. Then, the ASIS chain interweaving steps of the two samplersis geometrically ergodic with constant

γASIS ≤√

γ1γ2R1,2,

where

R1,2 = supf ,g∈L2(µ)

Corr( f (Z1)g(Z2) | Y)

is the maximal correlation between the two augmentations in the posterior.

This theorem tells us that ASIS cannot be worse than the worst of the twooriginal Gibbs samplers, because γASIS ≤ maxγ1, γ2. On the other hand,remarkably, the convergence of the ASIS chain can be much faster if Z1 andZ2 are weakly correlated in the posterior. When Z1 and Z2 are sufficient andancillary, this turns out to be the case.

To understand this last point, write the posterior density of an ancillary-sufficient pair (Z∗, Z)

µ(Z, Z∗ | Y) ∼= µ(Y | Z, Z∗)µ(Z, Z∗)

= µ(Y | Z, θ)µ(Z∗, θ)J(Z, Z∗);

in the last equality we apply the change of variables from (θ, Z∗) to (Z, Z∗)assuming that they are related by a deterministic bijection M : (θ, Z∗) 7→(Z, Z∗). The factor J(Z, Z∗) is the magnitude of the Jacobian determinant of M.

53


Then we can simplify the first two factors using the ancillarity and sufficiencyassumptions:

µ(Z, Z∗ | Y) ∼= µ(Y | Z)µ(Z∗)µ(θ)J(Z, Z∗).(26)

On the right hand side, θ is a function of Z and Z∗, so in order for thesevariables to be independent in the posterior, we need the last two factorsµ(θ)J(Z, Z∗) to split into factors depending on Z and factors depending onZ∗.

In Example 66, the mapping (Z, Z∗) 7→ (θ, Z∗) = (Z − Z∗, Z∗) has Jaco-bian determinant −1. Therefore, choosing the improper prior µ(θ) = 1 makesthe distribution 26 independent for Z and Z∗. Theorem 68 then implies theASIS sampler is geometrically ergodic with rate γASIS = 0 as the maximal cor-relation is 0; in other words, the sampler yields an independent sample fromthe posterior in a finite number of iterations.

This is quite remarkable and only happens in a few examples. If θ is a posi-tive scale parameter, then the change of variables factor for the transformation(Z, Z∗) 7→ (θ, Z∗) = ((Z− a)/Z∗, Z∗) has

J(Z, Z∗)−1 =

∣∣∣∣det[

1/Z∗

0(a− Z)/Z∗2

1

]∣∣∣∣ = ∣∣∣∣ 1Z∗

∣∣∣∣ ;

therefore, choosing the improper prior µ(θ) = θ−1 (the flat prior in the logscale) makes the variables Z and Z∗ independent in 26. In this instance, ASISalso produces independent samples from the posterior.

An astute reader might consider whether we can always choose a priorto be proportional to the inverse of J(Z, Z∗). This may not always lead to aproper posterior. Conversely, we might try choosing the transformation Msuch that J(Z, Z∗) is the inverse of the prior; however, this might not preservethe ancillarity of Z∗. In any case, it often happens that the last two factorsin the distribution 26 are nearly constant, such that the augmentations areapproximately independent in the posterior.

The following example gives a more realistic illustration of ASIS. For thefull details and numerical results, see the paper To centre or not to centre: that isnot the question by Yu and Meng.

69 example (Poisson time series). The data are a sequence of photon counts(Yt)t∈T measured in a telescope at a range of time points T . We will modelthe data by Poisson regression

Yt ∼ Poisson(dt exp(xTt β + ξt)) independent for t ∈ T .

The independent variables are xt = (1, t/T), and the coefficient β1 measuresan exponential trend of the Poisson intensity in time. We assume dt are fixedoffsets, e.g. the length of the measurement window at time t. The term ξt is arandom effect which is assumed to have correlations in time; this is modelled

54

9. Expectation Maximisation and Variational Inference

using a stationary autoregressive process

ξt | ξ−t ∼ N(ρξt−1, δ2) for t > 1,

ξ1 ∼ N(0, δ2/(1− ρ2)).

The parameters of interest are β, ρ, and δ and we choose a flat prior µ(β, ρ, δ) =1. The standard structure of a Gibbs sampler iterates sampling

1. ξ | β, ρ, δ, Y

2. β | ξ, ρ, δ, Y

3. ρ, δ | ξ

These full conditionals are not simple, so we may require Metropolis–Hastingsproposals. For example, in Step 2 we can use any routine to sample the coef-ficients in Poisson regression. Sampling the autoregressive process in Step 1

is done one time point at a time. However, we will develop the ASIS strategywith the overall structure above in mind, even if this is not strictly a Gibbssampler.

The process ξ = (ξt)t∈T is the augmentation used in the sampler above.This variable is ancillary for β, as the prior of β and ξ are independent. Onthe other hand, ξ is sufficient for δ, ρ; clearly, Yt is independent of δ, ρ given ξ.

In order to design an ASIS sampler, we then require a sufficient augmen-tation for β and an ancillary augmentation for ξ. The variables

ηt = ξt + x>t β

for t ∈ T are sufficient for β. The following variables form an ancillary aug-mentation for ρ and δ

κ1 =

√1− ρ2

δξ1 κt =

ξt − ρξt−1

δ.

The variables κ are N(0, 1), and are a priori independent of δ and ρ.An ASIS strategy interweaves sampling ξ, η, and κ with the parameters β,

δ and ρ. Various strategies are compared in the cited paper, which are severalorders of magnitude more efficient than the original Gibbs sampler. This isdespite the fact that the sampler is not a strict Gibbs sampler, and especially,the vectors ξ, η, and κ are sampled one element at a time.

9 Expectation Maximisation and Variational Inference

Expectation maximisation (EM) has the goal of maximising the posterior µ(θ |Y), to find the maximum a posteriori or MAP estimate

θMAP = arg maxθ∈Θ

µ(θ | Y).

Note that when the prior is constant µ(θ) = 1, the posterior is proportionalto the likelihood µ(Y | θ) and the MAP estimate is simply the MLE for θ.

55


EM is used in models with latent variables, where the likelihood µ(Y | θ) isintractable; generally, it can be written as an integral

µ(Y | θ) =∫

µ(Y, Z | θ)dZ

over a latent variable Z which may be high-dimensional or combinatorial. Thealgorithm iterates the following 2 steps,

1. E-step: Compute Q(θ | θ(t)) = EZ|θ(t),Y log µ(θ, Z, Y).

2. M-step: Set θ(t+1) = arg maxθ Q(θ | θ(t)).

In the E-step, we take θ to be fixed and compute the expectation with respectto the latent variable Z with distribution µ(Z | θ(t), Y). It can be shown thateach iteration in this algorithm can only improve the log posterior; therefore,the algorithm will converge to a local maximum of the posterior.

70 lemma. log µ(θ(t+1) | Y) ≥ log µ(θ(t) | Y).

Proof. By Bayes rule

log µ(θ, Z, Y) = log µ(Y) + log µ(θ | Y) + log µ(Z | Y, θ).

Taking the expectation of both sides with respect to Z | θ(t), Y,

Q(θ | θ(t)) = log µ(Y) + log µ(θ | Y) + EZ|θ(t),Y log µ(Z | Y, θ).

As this is valid for any θ, take the difference of both sides evaluated at θ =θ(t+1) and θ = θ(t),

Q(θ(t+1) | θ(t))−Q(θ(t) | θ(t)) = log µ(θ(t+1) | Y)− log µ(θ(t) | Y)+ EZ|θ(t),Y log µ(Z | Y, θ(t+1))−EZ|θ(t),Y log µ(Z | Y, θ(t)).

The left hand side is non-negative by definition, as θ(t+1) is chosen to max-imise Q(· | θt)) in the M-step. On the other hand, the last two terms on theright hand side are

EZ|θ(t),Y logµ(Z | Y, θ(t+1))

µ(Z | Y, θ(t))≤ log EZ|θ(t),Y

µ(Z | Y, θ(t+1))

µ(Z | Y, θ(t))= log 1 = 0

by Jensen’s inequality.

The EM algorithm is easy to implement in hierarchical models with con-jugate exponential family conditionals, as the E-step can be solved in closedform. For example, suppose the distribution of Y is exponential

µ(Y | Z) = eT(Y)>Z−K(Z)

56


with natural parameter Z, and the distribution of Z is exponential

µ(Z | θ) = eZ>θ1−θ2K(Z)−K(θ)

with natural parameter θ> = (θ>1 , θ2). Then, the E-step for maximum likeli-hood estimation (flat prior µ(θ) = 1) can be solved analytically

Q(θ | θ(t)) = EZ|θ(t),Y[log µ(Z | θ) + log µ(Y | Z)]

= EZ|θ(t),Y[Z>θ1 − θ2K(Z)− K(θ) + T(Y)>Z− K(Z)]

= (θ1 + T(Y))>EZ|θ(t),Y[Z]− (θ2 + 1)EZ|θ(t),Y[K(Z)]− K(θ).

Usually the moments in this expression will have a closed form in terms ofθ(t) and it will be possible to maximise it with respect to θ.

71 remark. When the M-step is not easy to implement, it can be replaced bya gradient ascent step,

θ(t+1) = θ(t) + δt∇θQ(θ | θ(t)),

for some step size schedule (δt)t≥0. This is known as gradient EM.

72 example (Mixture of normals). Consider the mixture model defined inExample 61, and let

Yi ∼ N(µZi , ΣZi ).

Here the parameters are θ = (pj, µj, Σj)1≤j≤k; for simplicity, we take the prioron the parameters to be constant. The E-step leads to the following objective

Q(θ | θ(t)) = EZ|θ(t),Y

[n

∑i=1

log(pZi )− log 2π − 12

log det ΣZi

− 12(yi − µZi )

>Σ−1Zi

(yi − µZi )

]

=n

∑i=1

EZi |θ(t),Yi

[log(pZi )− log 2π − 1

2log det ΣZi

− 12(yi − µZi )

>Σ−1Zi

(yi − µZi )

].

The distribution of Zi conditional on θ(t) and Yi is known:

Pr(Zi = j | Yi, θ(t)) =p(t)j f (Yi; µ

(t)j , Σ(t)

j )

∑k`=1 p(t)` f (Yi; µ

(t)` , Σ(t)

` ):= wij.

57


So we can write,

Q(θ | θ(t)) =n

∑i=1

k

∑j=1

wij

[log(pj)− log 2π − 1

2log det Σj

− 12(yi − µj)

>Σ−1j (yi − µj)

].

Now, in the M-step, we maximise this expression with respect to the param-eters θ = (p, µ, Σ); the weights wij only depend on θ(t), so they can be con-sidered fixed. The maximisation with respect to p subject to the constraint∑k

j=1 pj = 1 can be done by defining the Lagrangian

L(p, λ) =k

∑j=1

wj log(pj) + λ(∑j

pj − 1)

where wj = ∑ni=1 wij. Setting ∇pL = (∂/∂λ)L = 0, leads to

p(t+1)j =

wj

nfor j = 1, . . . , k.

Maximising Q(θ | θ(t)) with respect to µj can be done by setting the gradientto 0,

∇µj Q(θ | θ(t)) =n

∑i=1

wij

2Σ−1

j (yi − µj) = 0

which leads to the update µ(t+1)j = w−1

j ∑ni=1 wijyi. Finally, we maximise with

respect to Σj by taking the gradient (brushing up on matrix calculus is usefulhere):

∇Σj Q(θ | θ(t)) = −wj

2Σ−1

j +12

Σ−1j

[n

∑i=1

wij(yi − µj)(yi − µj)>]

Σ−1j

which is equal to 0 at

Σ(t+1)j = w−1

j

n

∑i=1

wij(yi − µj)(yi − µj)>.

9.1 Variational Inference

Variational inference frames posterior inference as an optimisation problem.The goal is to approximate the posterior distribution µ(θ | Y) inside a familyof distributions Q, in terms of KL divergence:

q∗ = arg minq∈Q

KL(q‖µ(· | Y))

58


= arg minq∈Q

∫q(θ) log

q(θ)µ(θ | Y)dθ

= arg minq∈Q

∫q(θ) log

q(θ)µ(θ, Y)

dθ + log µ(Y)

As a consequence of Jensen’s inequality, the KL divergence is always non-negative, and it is equal to 0 when q is equal to the posterior q-almost every-where. As the posterior µ(θ | Y) is usually only known up to a constant, wewill find it useful to write the problem as in the final line. Noting that the sec-ond term does not depend on q, it is easy to see that the variational problemequivalent to maximising the quantity

L(q) =∫

q(θ) log µ(θ, Y)dθ −∫

q(θ) log q(θ)dθ

=∫

q(θ) log µ(θ, Y)dθ + H(q).

where we’ve defined the entropy H(q) of q. The function L(q) is known asthe evidence lower bound or ELBO, because the positivity of the KL divergenceimplies

L(q) ≤ log µ(Y)

and the term on the right hand side is often called the weight of the evidence—this quantity is used, for example, in computing Bayes factors for model com-parison.

If θ = (θ1, . . . , θp) is a vector of parameters, the mean field approximationconstrains the optimisation to the set of distributions in which each coordinateof θ is independent,

Q =

p

∏i=1

qi(θi) ; qi ∈ Qi for i = 1, . . . , p

.

Fix all the marginals q2, . . . , qp, and consider the optimisation of the ELBOwith respect to the marginal distribution q1 of θ1. The ELBO can be written

L(q) = Eq[log µ(θ, Y)] + H(q)

= Eq1 [Eq(log µ(θ, Y) | θ1)] + H(q1) +p

∑i=2

H(qp).

Now, defining the distribution q∗1(θ1) = A−1 exp(Eq(log µ(θ, Y) | θ1)), forsome normalising constant A, lets us express the ELBO

L(q) = −KL(q1‖q∗1) + C(q2, . . . , qp)

where C(q2, . . . , qp) collects terms which do not depend on the distributionq1. As the KL divergence is minimised at q1 = q∗1 , the maximum of the ELBOwith respect to q1 is achieved at q∗1 .

59


This observation leads to the Coordinate Ascent Variational Inference algo-rithm, in which we iteratively optimise the ELBO with respect to the marginalof each variable, keeping all the others fixed. In order for this algorithm to bepractical, the conditional expectations

Eq[log µ(θ, Y) | θi]

must be simple. This is the case in hierarchical models with conjugate expo-nential family conditionals. When the terms in log µ(θ, Y) which depend onθi are of the form T(θi)

>η(θ−i, Y), then the variational approximation of themarginal qi will be in an exponential family with sufficient statistic T(θi).

It is possible to establish a relationship between variational inference andEM. Consider an EM problem with latent variables Z. We can approximatethe posterior of the pair (Z, θ) through mean field variational inference. Inaddition to the mean field assumption, restrict the marginal distribution qZ tobe in the family µ(Z | Y, θ) ; θ ∈ Θ. Then, the variational problem for themarginal qθ , fixing qZ, becomes

Maximiseqθ

EqθEZ|θ,Y(log µ(Z | θ, Y)) + H(qθ)

or

Maximiseqθ

EqθQ(θ | θ) + H(qθ).

This is similar to the M-step in expectation maximisation, but rather thanfinding a point estimate for θ we are optimising over distributions. Optimisingthe first term alone would lead to a point mass at the maximiser of Q(· | θ).The entropy term encourages the distribution qθ to be spread out.

9.2 Stochastic Variational Inference

In many models it is not possible to perform coordinate ascent for mean fieldvariational inference. One might also be interested in using more complexvariational approximations with dependence between the coordinates of θ. Weshall consider variational approximation in a parametric family qβ ; β ∈ Rdwhere qβ(θ) is differentiable with respect to β. The ELBO gradient is

∇βL(qβ) = ∇βEqβ[log µ(θ, Y)− log qβ(θ)]

= ∇βEqβ[log µ(θ, Y)] +∇β H(β).

Computing the ELBO gradient can be difficult, so it might be necessary toresort to stochastic optimisation algorithms, which employ estimates of the gra-dient. The structure of the algorithm is iterative, and each iteration performsan update of the variational parameters β.

β(t+1) = β(t) + ∇βL(qβ(t))γt.

where ∇βL(qβ(t)) is a typically unbiased estimator of the gradient, and (γt)t≥0

60


is a step-size schedule in which γt may or may not depend on previous iter-ates β(0), β(1), . . . , β(t). The following theorem defines one of the classical al-gorithms for stochastic optimisation and gives conditions for its convergenceto a local maximum of the objective.

73 theorem (Robbins–Monro algorithm). Let F : Rd → R be a Lipschitz con-tinuous, differentiable optimisation objective, and define a step size schedule (γt)t≥0satisfying

∞

∑i=1

γt = ∞,∞

∑i=1

γ2t < ∞,

for example, γt = 1/t. The Robbins–Monro update is defined by

Xt+1 = Xt + γt [∇F(Xt) + Dt] ,(27)

where we assume

1. Dt is a stochastic process adapted to the filtration Ft = σ(X0, D1, . . . , Dt),with mean E[Dt | Ft−1] = 0, and E[‖Dt‖2 | Ft−1] ≤ K(1 + ‖Xt−1‖) a.s. forsome K < ∞.

2. lim supt ‖Xt‖ < ∞ a.s.

Then, as t→ ∞, Xt converges a.s. to a stationary point of F, if one exists.

74 remark. This formulation of the algorithm does not require the gradientestimates to be independent. Indeed, the errors Dt can be dependent, but theymust be a martingale increment sequence.

75 remark. The choice of the step size schedule has an important effect onthe speed of convergence. There are several variants of this algorithm whichadapt the step size using a moving average of the magnitude of the stochasticgradient, thus taking larger steps when the gradient is small and smaller stepswhen the gradient is large. Adagrad, Adam, and RMSprop are among themost popular and enjoy theoretical guarantees.

The proof of Theorem 73 applies martingale theory to show that the pro-cess (Xt)t≥0 converges to a flow of the differential equation x = ∇F(x). Wewill prove a one-dimensional version of this theorem which serves to illustratethe probabilistic ideas in the proof.

76 theorem. Let F : R → R be differentiable and K-Lipschitz, with a uniquemaximum at 0, and inf|x|>b |∇F(x)| > 0 for all b > 0. Let (Xt)t≥1 be defined by theiteration in Eq. 27, where (Dt)t≥0 satisfies E(Dt | Ft−1) = 0 and E(D2

t | Ft−1) <c < ∞. Then Xt → 0 a.s.

Proof. We will prove that for any a > 0, (Xt)t≥0 can only visit (−∞,−a) ∪(a, ∞) finitely often a.s. As we can cover the set Rd \ x0 with countablymany such sets, this implies Xt → x0 a.s.

61


Define the martingale Mt = ∑t−1i=1 γtDt, and write Xt = X1 + Mt +

∑t−1i=1 γi∇F(Xi). The process Mt has

EM2t = E

[t

∑i=1

γ2i E(D2

i | Fi−1)

]< c

t

∑i=1

γ2i < c

∞

∑i=1

γ2i < ∞.

So, Doob’s martingale convergence theorem implies Mt → M∞ a.s. Note thatwhenever Xt > 0, ∇F(Xt) < 0, and whenever Xt < 0, ∇F(Xt) < 0. Fur-thermore, supx∇F(x) ≤ K. With probability 1, there is an N < ∞ such that|Mt −M∞| < a/4 and |γt∇F(Xt)| < a/4 for all t > N. So if Xt ∈ [−a/2, a/2]for t > N, the process never exits the interval [−a, a] thereafter. Finally, as∑t

i=1 γi diverges and |∇F(x)| is bounded below outside [−a/2, a/2], everytime Xt is outside [−a, a], the process returns to [−a/2,−a/2] eventually withprobability 1.

Estimating the ELBO gradient

In variational inference, a number of unbiased gradient estimators are readilyavailable. We discuss a few options.

1. Score Gradient. The ELBO gradient is

∇βL(qβ) = ∇βEqβ[log µ(θ, Y)] +∇βH(β).

Often, the entropy can be differentiated analytically. The first term canbe written

∇β

∫qβ(θ) log µ(θ, Y)dθ =

∫[∇β log qβ(θ)] log µ(θ, Y)qβ(θ)dθ

= Eqβ[∇β log qβ(θ) log µ(θ, Y)],

assuming the regularity needed to differentiate under the integral. Theright hand side can be estimated by Monte Carlo. Sample θ(1), . . . , θ(n)

i.i.d. from qβ and define the estimator

∇βL(qβ) =1n

n

∑i=1∇β log qβ(θ

(i)) log µ(θ(i), Y) +∇β H(β).

While unbiased, this Monte Carlo estimator tends to have high variance,as it does not exploit information about the gradient of log µ(θ, Y) withrespect to θ. It is possible to apply variance reduction techniques such asimportance sampling or control variates, but their effectiveness is vari-able. When the entropy term cannot be differentiated analytically, wecan apply the same trick, noting

∇β

∫qβ(θ) log qβ(θ)dθ =

∫[∇βqβ(θ)] log qβ(θ) + qβ(θ)∇β log qβ(θ)dθ

=∫[∇βqβ(θ)] log qβ(θ)dθ +

∫∇βqβ(θ)dθ

62


=∫[∇βqβ(θ)] log qβ(θ)dθ +∇β

∫qβ(θ)dθ

=∫[∇β log qβ(θ)] log qβ(θ)qβ(θ)dθ,

where in the last equality we used the fact that qβ is a density and inte-grates to 1. The last expression is an expectation which can be estimatedby Monte Carlo.

2. Reparametrisation Trick. We express qβ as a transformation of a stan-dard distribution π, like a multivariate normal or a uniform, such thatif Z ∼ π, then f (Z; β) ∼ qβ, for some function f (·; β). For a univariatevariable, we can let π be Uniform(0, 1) and f (·; β) be the inverse CDF ofqβ. Then,

∇βL(qβ) = ∇βEπ [log µ( f (Z; β), Y)− log qβ( f (Z; β))]

= Eπ

[∇β[log µ( f (Z; β), Y)− log qβ( f (Z; β))]

].

Now, applying the chain rule,

∇βL(qβ) = Eπ

[∇θ log µ( f (Z; β), Y)∇β f (Z; β)−∇β log qβ( f (Z; β))

].

And this can be estimated by Monte Carlo by drawing samplesZ(1), . . . , Z(n) i.i.d. from π, and defining

∇βL(qβ) =1n

n

∑i=1

[∇θ log µ( f (Z(i); β), Y)∇β f (Z(i); β)

−∇β log qβ( f (Z(i); β))].

Like the score gradient, this estimator is unbiased, but it tends to havemuch better variance, because it uses the gradient of log µ(θ, Y). Thisidea has been applied to very complex variational approximations, forexample, transformations of a normal distribution through a deep neuralnetwork, and in many cases a single sample n = 1 is enough to obtainan estimate with sufficient accuracy.

3. Minibatches. In many applications, the data Y1, . . . , Yn are conditionallyindependent given the parameter θ, so the ELBO gradient can be writtenas a sum

∇βL(qβ) = ∇βEqβ[log µ(θ, Y)] +∇β H(β)

=n

∑i=1∇βEqβ

[log µ(Yi | θ)] +∇βEqβlog µ(θ) +∇β H(β).

If the data is ‘big’, with n very large, computing the sum could be ex-pensive, even if it were possible to do it analytically. Thus, we can ap-proximate the gradient by taking a subsample B ⊂ 1, . . . , n chosen

63

10. Langevin Dynamics and Hamiltonian Monte Carlo

uniformly without replacement and defining the estimator

∇βL(qβ) =n|B| ∑i∈B

∇βEqβ[log µ(Yi | θ)] +∇βEqβ

log µ(θ) +∇β H(β).

This estimator is unbiased. In many cases, the loss of accuracy in theestimator is compensated by the ability to perform more gradient com-putations and thus take more steps of stochastic gradient ascent. Theterms in the sum in the expression above can be replaced by unbiasedestimators of the corresponding partial gradients.

10 Langevin Dynamics and Hamiltonian Monte Carlo

This section deals with MCMC samplers derived from continuous timestochastic processes which preserve the target posterior stationary. The ma-terial of this section is largely agnostic to the statistical model used, so wedenote the target posterior density µ(x) and the parameter of interest X. Wewill find it useful to express µ(x) ∼= exp(−H(x)), where the function H isknown as the Hamiltonian.

We will state a very general theorem which gives necessary and sufficientconditions for an Ito diffusion process on Rd to have stationary distributionµ. Then, we will list a number of special cases and explain how to constructa MCMC sampler from a diffusion. An Ito diffusion is the solution of thestochastic differential equation

dXt = b(Xt)dt + σ(Xt)dBt,(28)

where (Bt)t≥0 is a Wiener process, b is a Lipschitz continuous drift function,and σ is a Lipschitz continuous diffusion coefficient.

77 theorem. Suppose the diffusion coefficient σ(Xt) has

σ(x)σ(x)>

2= D(x)

for some positive semidefinite matrix D(x), and define the drift

b(x) = −[D(x) + Q(x)]∇H(x) + Γ(x),(29)

Γi(x) =d

∑j=1

∂

∂xj(Dij(x) + Qij(x)).

where Q is a skew-symmetric matrix. Then, the distribution µ(x) ∼= exp(−H(x)) isstationary for the Ito diffusion with parameters (b, σ). Conversely, if a diffusion withparameters (b, σ) has a unique stationary distribution µ, and

bi(x)−d

∑j=1

∂

∂xjDij(x)

64


is µ-integrable for each i, then there is a skew-symmetric matrix Q such that b can bewritten as in Eq. 29.

78 exercise. Prove the first claim in this theorem, using the Fokker-Planckequation describing the evolution of the density µt of Xt:

∂

∂tµt(x) = −

d

∑i=1

∂

∂xibi(x)µt(x) +

d

∑i=1

d

∑j=1

∂2

∂xi∂xjDij(x)µt(x).

79 example. The diffusion with constant coefficient, D(x) = D, and Q = 0is known as (overdamped) Langevin dynamics. When D(x) is allowed to dependon x and is the metric tensor of a Riemannian manifold, we obtain RiemannianLangevin Dynamics.

In the case when D is non-zero, we can define a Metropolis–Hastingsalgorithm with target µ, by setting the proposal distribution to be a Euler–Maruyama discretisation of a diffusion with stationary distribution µ. Namely,given a step size δ > 0 and the state of the Markov chain Xt at time t, we pro-pose a transition to

x′ = Xt + δb(Xt) +√

δσ(Xt)Z

where Z ∼ N(0, I), and accept or reject according to the Metropolis–Hastingscriterion.

80 exercise. Derive the Metropolis–Hastings acceptance ratio for a samplerbased on Langevin dynamics in terms of the Hamiltonian H, and the tensorD.

If the step size δ is small, the proposal distribution will be close to the tran-sition kernel Kδ(Xt, x′) in the corresponding diffusion process. Furthermore,if Q(x) = 0, the diffusion process is µ-reversible, and this implies that theMetropolis–Hastings acceptance rate is close to 1. In fact, if we were able topropose exactly from a kernel q which is µ-reversible, the Metropolis–Hastingsacceptance probability would be

min

1,µ(x)q(x, Xt)

µ(Xt)q(Xt, x)

= 1

and every proposal would get accepted.

10.1 Hamiltonian Monte Carlo

When the tensor D(x) = 0, the stochastic differential equation 28 becomes afirst-order ordinary differential equation. In this case, constructing an MCMCalgorithm which leaves µ invariant requires more sophisticated techniques.Hamiltonian Monte Carlo augments the random variable X with a vectorof momenta P, also in Rd, which is independent of X and has a N(0, M)distribution. Letting U(x) = − log µ(x), we will denote the joint density

65


µ(x, p) ∼= exp(−H(x, p)), with

H(x, p) = U(x) +p>M−1 p

2.

A process in the class defined by Theorem 77 with D(x) = 0 and

Q =

[0I−I0

]defines a Hamiltonian dynamics on (Xt, Pt)t≥0, which can be equivalentlydescribed by the more familiar equations

∂

∂tXt = M−1Pt

∂

∂tPt = −∇U(Xt).

Hamiltonian dynamics satisfies three properties which are useful for thedesign of MCMC algorithms. The first is time reversibility. Define the operatorTt : R2d → R2d which maps X0, P0 7→ Xt, Pt. In the context of Hamiltoniandynamics, reversibility means

Tt(x, p) = (x′, p′) =⇒ Tt(x′,−p′) = (x,−p).

The second useful property is energy conservation, which means that H(Xt, Pt)is invariant with respect to time t. This is a simple consequence of Hamilton’sequations; by the chain rule

dH(Xt, Pt)

dt=

∂Xt

∂t∇U(Xt) +

∂Pt

∂tM−1Pt = 0.

The third property is volume conservation, or the fact that Tt is a symplectic map,which is known as Liouville’s theorem.

81 theorem (Liouville’s theorem). Suppose (X0, P0) is a random vector with den-sity ν0 with respect to the Lebesgue measure in R2d. Let νt be the density of (Xt, Pt);i.e. the pushforward measure of ν0 through Tt. Then, for any point x, p, and any timet > 0, we have ν0(x, p) = νt(Tt(x, p)).

Proof sketch. A first-order approximation of the dynamics is

Tdt(x, p) = (x +∇p H(x, p)dt + O(dt2), p−∇x H(x, p)dt + O(dt2)).

From this, we can check the Jacobian determinant of the Hamiltonian map(x, p) 7→ Tdt(x, p) is

detJ = 1 +d

∑i=1

(∂2H(x, p)

∂xi∂pi− ∂2H(x, p)

∂pi∂xi

)dt + O(dt2) = 1 + O(dt2).

Thus d(detJ )/dt = 0. This implies that the volume of an infinitesimal element

66


is preserved by the dynamics, and the density νt(Tt(x, p)) is constant.

If we were able to compute the Hamiltonian map Tt, we could define anMCMC sampler as follows. Given a state (Xn, Pn) at iteration n,

1. Sample Pn+1 ∼ N(0, M) and set Xn+1 = Xn,

2. Define (x′,−p′) = Tt(Xn+1, Pn+1). Set (Xn+2, Pn+2) = (x′, p′) with prob-ability min1, exp(H(Xn+1, Pn+1)− H(x′, p′)), and set (Xn+2, Pn+2) =(Xn+1, Pn+1) otherwise.

82 lemma. If H is continuous and differentiable, this algorithm has stationary dis-tribution µ(x, p) ∼= exp(−H(x, p)).

Proof. We show that each transition is µ-stationary. The first step just resam-ples the momenta from their marginal distribution; as X and P are indepen-dent in µ, this leaves µ stationary.

Let K be the Markov kernel of step 2,

K((x, p), ·) = (1− α)δ(x,p)(·) + αδ(x∗ ,p∗)(·)

where Tt(x, p) = (x∗,−p∗) and α = min1, exp(H(x, p) − H(x∗, p∗)). Weshall prove detailed balance∫

B

µ(dx, dp)K((x, p), A) =∫A

µ(dx, dp)K((x, p), B)

for any pair of disjoint, compact, convex sets A and B, which implies that Kleaves µ stationary. Let B∗ be the image of B through the mapping (x, p) 7→(x∗, p∗), and let C = A∩ B∗. By the reversibility of Hamiltonian dynamics, thismapping is an involution, and A∗ ∩ B = C∗. Thus, we just need to establish∫

C∗

µ(dx, dp)K((x, p), C) =∫C

µ(dx, dp)K((x, p), C∗).(30)

The right hand side is∫C∗

µ(dx, dp)min1, exp(H(x, p)− H(x∗, p∗))

=∫

C∗

exp(−H(x, p))min1, exp(H(x, p)− H(x∗, p∗))dxdp

=∫

C∗

minexp(−H(x, p)), exp(−H(x∗, p∗))dxdp.

The change of variables from (x, p) to (x∗, p∗) has Jacobian determinant 1 byLiouville’s theorem, so the integral in the final line is equal to∫

C

minexp(−H(x, p)), exp(−H(x∗, p∗))dx∗dp∗

67


which is identical to the left hand side of Eq. 30.

The property of energy conservation would make the acceptance probabil-ity equal to 1. This algorithm alternates resampling the momenta with surfingthrough curves of constant Hamiltonian. The second step can allow the posi-tions X to get far from their state of origin in a single iteration, which wouldmake the algorithm potentially much more efficient than diffusion-based sam-plers.

Unfortunately, it is rarely possible to evaluate the mapping Tt, so we mustresort to discretisation. The Leapfrog algorithm with step size ε consists of thefollowing iteration

p(t + ε/2) = p(t)− ε

2∇U(x(t))

x(t + ε) = x(t) + εM−1 p(t + ε/2)

p(t + ε) = p(t + ε/2)− ε

2∇U(x(t + ε)).

The name derives from the fact that in order to approximate the state at timet+ ε, we perform an intermediate update of the momenta to p(t+ ε/2), whichis used to compute x(t + ε) and p(t + ε). The sequence (x(nε), p(nε))n=1,...,Lapproximates the flow of the Hamiltonian dynamics sampled at intervals ε,(Xnε, Pnε)n=1,...,L.

83 exercise. Prove that the Leapfrog integrator is time reversible.

This is known as a symplectic integrator because it also defines a volume-preserving dynamical system. This is because it is the composition of threetransformations (x(t), p(t)) → (x(t), p(t + ε/2)), (x(t), p(t + ε/2) → (x(t +ε), p(t + ε/2)), and (x(t + ε), p(t + ε/2) → (x(t + ε), p(t + ε)), each of whichis a shear transform. This is a special instance of a technique called operatorsplitting, for Hamiltonians which are separable, i.e. which are the sum of aterm depending on x and one depending on p.

The Hamiltonian Monte Carlo algorithm is defined as in the previous page,but we replace the exact Hamiltonian mapping Tt for the result of applyingthe leapfrog iteration L times with step size ε (so t = Lε). The proof that µis a stationary distribution of this Markov chain follows the same argumentof Lemma 82. The only difference is that, as the integrator does not preserveenergy, the acceptance probability is not always 1.

10.2 No U-Turn Sampler

Choosing the step size ε and the number of steps L in Hamiltonian MonteCarlo can be a challenge. The leapfrog algorithm does not preserve the Hamil-tonian. Indeed, the energy drift, |H(x(0), p(0)) − H(x(Lε), p(Lε))|, which de-termines the acceptance ratio in Metropolis–Hastings is proportional to ε3 forL = 1, or ε2 when L is large.

If we want to keep Lε fixed, choosing ε too large will worsen the energydrift and produce more rejections. On the other hand, making ε too small

68


would be wasteful as we need more computation for a fixed simulation lengthLε. It is also important to tune the simulation time Lε, because beyond a certainlength, having longer simulations would be wasteful.

The No-U-Turn sampler combines a heuristic for choosing the number ofleapfrog steps L, with a stochastic optimisation method for the step size ε.

First, for a fixed ε, the choice of L is done by growing a trajectory from thenth iteration (Xn, Pn) forward and backwards in time, until the trajectory startsto “double back on itself”. This is defined as the time at which the distancefrom the origin Xn starts to decrease. Then, we set (Xn+1, Pn+1) to one of thetime points of this trajectory chosen at random. The details of the algorithmto grow the trajectory and to select the state (Xn+1, Pn+1) ensure that detailedbalance is respected. The interested reader can see the publication by Gelmanand Hoffman (Journal of Machine Learning Research, 2014).

The step size ε can then be tuned in order to make the acceptance probabil-ity close to 0.6, at which point the algorithm strikes the right balance betweenhaving a high acceptance rate while reducing the cost of each proposal. Themain idea is to reduce ε when the empirical acceptance rate is too low, andincrease ε when it is too high, following a Robbins–Monro schedule.

10.3 Riemannian Manifold HMC

The mass matrix M can be chosen to depend on the state x, without changingthe structure of Hamiltonian dynamics. Now, in the stationary measure µ themomenta P have a normal distribution whose covariance M(X) depends onX.

While the tensor M can be chosen in many ways, a particularly relevantchoice in statistical models where the target is a posterior µ(X | Y), is to setM(X) to the Fisher information

M(X) = −EY|X

[∂2

∂X>∂Xlog µ(Y | X)

],

which is always positive semidefinite. When the likelihood dominates theprior in the posterior distribution, this tensor approximates the curvature ofthe posterior. The resulting Hamiltonian dynamics will have higher velocity indirections in which the curvature is small, which reduces oscillatory motion.Of course, this comes at a computational price, because in order to simulatethe momenta, we need to solve a Cholesky decomposition of M(X) at eachiteration.

10.4 Stochastic gradient HMC

When the gradient of the log posterior U is not available analytically, it ispossible to approximate it by Monte Carlo, and combine the structure ofthe Robbins–Monro algorithm with a Langevin dynamics Metropolis sampler.This type of algorithm uses a decreasing schedule of step sizes (εn)n≥0, toensure that asymptotically, the process converges to a diffusion with the cor-rect stationary distribution. It is also possible to simulate Hamiltonian Monte

69


Carlo using a stochastic gradient. In big-data applications, simulating stochas-tic gradients can save significant computational resources per iteration whichsometimes offsets the effect of using stochastic gradients on the mixing timesof Hamiltonian Monte Carlo.

70

bayesian modelling and computationsb2116/bayesian_computation/bmc... · 2018-11-21 · bayesian...

Documents