marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf ·...

58
Marginal likelihood computation for model selection and hypothesis testing: an extensive review F. Llorente ? , L. Martino ?? , D. Delgado ? , J. Lopez-Santiago ? ? Universidad Carlos III de Madrid, Legan´ es (Spain). ?? Universidad Rey Juan Carlos, Fuenlabrada (Spain). Abstract This is an up-to-date introduction to, and overview of, marginal likelihood computation for model selection and hypothesis testing. Computing normalizing constants of probability models (or ratio of constants) is a fundamental issue in many applications in statistics, applied mathematics, signal processing and machine learning. This article provides a comprehensive study of the state-of-the-art of the topic. We highlight limitations, benefits, connections and differences among the different techniques. Problems and possible solutions with the use of improper priors are also described. Some of the most relevant methodologies are compared through theoretical comparisons and numerical experiments. Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection, hypothesis testing, quadrature rules, double-intractable posteriors, partition functions 1 Introduction Marginal likelihood (a.k.a., Bayesian evidence) and Bayes factors are the core of the Bayesian theory for testing hypotheses and model selection [45, 74]. More generally, the computation of normalizing constants or ratios of normalizing constants has played an important role in statistical physics and numerical analysis [85]. In the Bayesian setting, the approximation of normalizing constants is also required in the study of the so-called double intractable posteriors [44]. Several methods have been proposed for approximating the marginal likelihood and normalizing constants in the last decades. Most of these techniques have been originally introduced in the field of statistical mechanics. Indeed, the marginal likelihood is the analogous of a central quantity in statistical physics known as the partition function which is also closely related to another important quantity often called free-energy. The relationship between statistical physics and Bayesian inference has been remarked in different works [2, 42]. The model selection problem has been also addressed from different points of view. Several criteria have been proposed to deal with the trade-off between the goodness-of-fit of the model and its simplicity. For instance, the Akaike information criterion (AIC) or the focused information criterion (FIC) are two examples of these approaches [76, 15]. The Bayesian-Schwarz information 1

Upload: others

Post on 03-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Marginal likelihood computation for model selection andhypothesis testing: an extensive review

F. Llorente?, L. Martino??, D. Delgado?, J. Lopez-Santiago?

? Universidad Carlos III de Madrid, Leganes (Spain).

?? Universidad Rey Juan Carlos, Fuenlabrada (Spain).

Abstract

This is an up-to-date introduction to, and overview of, marginal likelihood computationfor model selection and hypothesis testing. Computing normalizing constants of probabilitymodels (or ratio of constants) is a fundamental issue in many applications in statistics, appliedmathematics, signal processing and machine learning. This article provides a comprehensivestudy of the state-of-the-art of the topic. We highlight limitations, benefits, connections anddifferences among the different techniques. Problems and possible solutions with the use ofimproper priors are also described. Some of the most relevant methodologies are comparedthrough theoretical comparisons and numerical experiments.

Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,hypothesis testing, quadrature rules, double-intractable posteriors, partition functions

1 Introduction

Marginal likelihood (a.k.a., Bayesian evidence) and Bayes factors are the core of the Bayesiantheory for testing hypotheses and model selection [45, 74]. More generally, the computation ofnormalizing constants or ratios of normalizing constants has played an important role in statisticalphysics and numerical analysis [85]. In the Bayesian setting, the approximation of normalizingconstants is also required in the study of the so-called double intractable posteriors [44].

Several methods have been proposed for approximating the marginal likelihood and normalizingconstants in the last decades. Most of these techniques have been originally introduced in the fieldof statistical mechanics. Indeed, the marginal likelihood is the analogous of a central quantityin statistical physics known as the partition function which is also closely related to anotherimportant quantity often called free-energy. The relationship between statistical physics andBayesian inference has been remarked in different works [2, 42].

The model selection problem has been also addressed from different points of view. Severalcriteria have been proposed to deal with the trade-off between the goodness-of-fit of the modeland its simplicity. For instance, the Akaike information criterion (AIC) or the focused informationcriterion (FIC) are two examples of these approaches [76, 15]. The Bayesian-Schwarz information

1

Page 2: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

criterion (BIC) is related to the marginal likelihood approximation, as discussed in Section 2.The deviance information criterion (DIC) is a generalization of the AIC, which is often usedin Bayesian inference [82, 83]. It is particularly useful for hierarchical models and it can beapproximately computed when the outputs of a Markov Chain Monte Carlo (MCMC) algorithmare given. However, DIC is not directly related to the Bayesian evidence [71]. Another differentapproach, also based on information theory, is the so-called minimum description length principle(MDL) [36]. MDL was originally derived for data compression, and then was applied to modelselection and hypothesis testing. Roughly speaking, MDL considers that the best explanation fora given set of data is provided by the shortest description of that data [36].

In the Bayesian framework, there are two main classes of sampling algorithms. The firstone consists in approximating the marginal likelihood of different models. The second samplingapproach extends the posterior space including a discrete indicator variable m, denoting the m-thmodel [8, 34]. Monte Carlo schemes working on this extended space (jumping in different spacesof parameters also with different dimensions) have been designed. We focus on the first approach.

In this work, we provide an extensive review of computational techniques for the marginallikelihood computation. The main contribution is to present jointly numerous computationalschemes (introduced independently in the literature) with a detailed description under thesame notation, highlighting their differences, relationships, limitations and strengths. It is alsoimportant to remark that parts of the presented material are also novel, i.e., no contained inprevious works. We have widely studied, analyzed and jointly described, with a unique notationand classification, the methodologies presented in a vast literature from 1990s to the recentproposed algorithms (see Table 1). We also discuss issues and solutions when improper priors areemployed. Therefore, this survey provides an ample covering of the literature, where we highlightimportant details and comparisons in order to facilitate the understanding of the interested readersand practitioners. The different techniques have been classified in four different families givenbelow, and then described in details.

1.1 Problem statement and main notation

In many applications, the goal is to make inference about a variable of interest, x = x1:Dx =[x1, x2, . . . , xDx ] ∈ X ⊆ RDx , where xd ∈ R for all d = 1, . . . , Dx, given a set of observedmeasurements, y = [y1, . . . , yDy ] ∈ RDy . In the Bayesian framework, one complete model M isformed by a likelihood function `(y|x,M) and a prior probability density function (pdf) g(x|M).All the statistical information is summarized by the posterior pdf, i.e.,

π(x|M) = p(x|y,M) =`(y|x,M)g(x|M)

p(y|M), (1)

where

Z = p(y|M) =

X`(y|x,M)g(x|M)dx, (2)

is the so-called marginal likelihood, a.k.a., Bayesian evidence. This quantity is important for modelselection purpose, as we show below. However, usually Z = p(y|M) is unknown and difficult to

2

Page 3: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

approximate, so that in many cases we are only able to evaluate the unnormalized target function,

π(x|M) = `(y|x,M)g(x|M). (3)

Note that π(x|M) ∝ π(x|M) [45, 74]. For the sake of simplicity, hereafter we use the simplifiednotation π(x) and π(x). Thus, note that

Z =

Xπ(x)dx. (4)

Model Selection and testing hypotheses. Let us consider now M possible models (orhypotheses), M1, ...,MM , with prior probability mass pm = P (Mm), m = 1, ...,M . Note that,

we can have variables of interest x(m) = [x(m)1 , x

(m)2 , . . . , x

(m)Dm

] ∈ Xm ∈ RDm , with possibly differentdimensions in the different models. The posterior of the m-th model is given by

p(Mm|y) =pmp(y|Mm)

p(y)∝ pmZm (5)

where Zm = p(y|Mm) =∫X `(y|xm,Mm)g(xm|Mm)dxm, and p(y) =

∑Mm=1 p(Mm)p(y|Mm).

Moreover, the ratio of two marginal likelihoods ZmZm′

ZmZm′

=p(y|Mm)

p(y|Mm′)=

p(Mm|y)/pmp(Mm′|y)/pm′

, (6)

also known as Bayes factors, represents the posterior to prior odds of models m and m′. If somequantity of interest is common to all models, the posterior of this quantity can be studied viamodel averaging [38], i.e., a the complete posterior distribution as a mixture of M partial posteriorslinearly combined with weights proportionally to p(Mm|y) (see, e..g, [61, 87]). Therefore, in allthese scenarios, we need the computation of Zm for all m = 1, ...,M . In this work, we describedifferent computational techniques for calculating Zm, mostly based on Markov Chain MonteCarlo (MCMC) and Importance Sampling (IS) algorithms [74]. Hereafter, we assume proper priorg(x|Mm). Regarding the use of improper priors see Section 6. Moreover, we usually denote Z, X ,M, omitting the subindex m, to simplify notation. It is important also to remark that, in somecases, it is also necessary to approximate normalizing constants (that are also functions of theparameters) in each iteration of an MCMC algorithm, in order to allow the study of the posteriordensity. For instance, this is the case of the so-called double intractable posteriors [44].

Reversible jump approach. Other sampling approaches include a discrete model indicatorvariable m which denotes the m-th model, i.e., considering an extended posterior space [8, 34].An MCMC method working on the extended space is then designed and run. In the well-knownreversible jump MCMC [34], a Markov chain is generated allowing jumps between models withparameter spaces with possibly different dimensions. However, generally, these methods aredifficult to tune and the mixing of the chain can be poor [37]. For further details, see alsothe interesting works [18, 33, 16]. In this work, we focus on the direct marginal likelihoodapproximation.

3

Page 4: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

1.2 A general overview

After a depth revision of the literature, we have recognized four main families of techniques,described below. We list them in order of complexity, from the simplest to the most complexunderlying main idea. However, each class can contain both simple and very sophisticatedalgorithms.

Family 1: Deterministic approximations. These methods consider an analytical approximationof the function π(x). The Laplace method and the Bayesian Information Criterion (BIC), belongsto this family.

Family 2: Methods based on density estimation. This class of algorithms uses the equality

Z =π(x∗)

π(x∗), (7)

where π(x∗) ≈ π(x∗) represents an estimation of the density π(x) at some point x∗. Generally,the point x∗ is chosen in a high-probability region. The techniques in this family differ for theprocedure employed for obtaining the estimation π(x∗). Some famous example is the Chib’smethod [11].

Family 3: Importance sampling (IS) schemes. The IS methods are based on rewriting Eq. (2) as

an expected value w.r.t. a simpler normalized density q(x), i.e., Z =∫X π(x)dx = Eq

[π(x)q(x)

]. This

is the most considered class of methods in the literature, containing numerous variants, extensionsand generalizations. We devote Sections 3-4 to this family of techniques.

Family 4: Methods based on a vertical representation. These schemes rely on changingthe expression of Z =

∫X `(y|x)g(x)dx (that is a multidimensional integral) to equivalent

unidimensional integrals [70, 89, 80]. Then, a quadrature scheme is applied to approximate thisunidimensional integral. The most famous example is the nested sampling algorithm [80]. Section5 is devoted to this class of methods.

1.3 Other reviews and software packages

The related literature is rather vast. In this section, we provide a brief summary that intends to beillustrative rather than exhaustive, by means of Table 1. The most relevant (in our opinion) andrelated surveys are compared according to the topics, material and schemes described in the work.The proportion of covering and overlapping with this work is roughly classified as “partial” 3,“complete”

√, “remarkable” or “more exhaustive” work with F. From Table 1, we can also notice

that completeness of this work. We take into account also the completeness and the depth of detailsprovided in the different derivations. The Christian Robert’s blog deserves a special mention(https://xianblog.wordpress.com), since Professor C. Robert has devoted several entries ofhis blog with very interesting comments regarding the marginal likelihood estimations and relatedtopics.

4

Page 5: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Software packages. Currently, there is specific software aimed at performing Bayesian modelchoice, and more specifically, the computation of marginal likelihoods, for general models.Different R packages are available. We can find the bridgesampling package [35], which implementsthe bridge sampling estimator (see Sect. 3.3) for computing marginal likelihoods given onlya posterior sample and the log posterior function. More generally, under CRAN Task View:Bayesian Inference, there are several packages for performing Bayesian inference that also includefunctions to estimate the marginal likelihood, for instaince, MCMCpack [48] and LaplacesDemon[84]. MCMCpack allows for fitting a different number of models and calculate the correspondingmarginal likelihoods mainly by Laplace approximation and Chib’s method (see Section 2).The LaplacesDemon’s main function produces an estimate of the marginal likelihood using theharmonic mean, generalized harmonic mean, Laplace Metropolis or importance sampling (seeSections 2 and 3). Additionolly, the review by [26] provides with the codes in R used to implementthe different methods discussed in that review.

Structure of the paper. Section 2 is devoted to describe methods belonging to family 1 andfamily 2. Section 3 and Section 4 introduce the methods in family 3. More specifically, in Section3 we consider IS approaches using one, two or more proposal densities. Some of them requirealso the use of MCMC techniques. In Section 4, we describe more sophisticated methods thatcombine the IS schemes with the MCMC algorithms. The vertical approach corresponding toFamily 4 above is described in Section 5. In Section 6, we present how to deal with hypothesestesting and model selection problems when the employed prior densities are improper. Section 7contains some theoretical example and numerical experiments. In Section 8, we conclude with afinal summary and discussion.

2 Methods based on deterministic approximations and

density estimation

In this section, we consider approximations of π(x), or its unnormalized version π(x), in order toobtain an estimation Z. In a first approach, the methods consider π(x) or π(x) as a function,and try to obtain a good approximation given another parametric or non-parametric family offunctions. Another approach consists in approximating π(x) only in one specific point x∗, i.e.,π(x∗) ≈ π(x∗) (x∗ is usually chosen in high posterior probability regions), and then using theidentity,

Z =π(x∗)

π(x∗). (8)

The latter scheme is often called candidate’s estimation.

Laplace’s method. Let us define xMAP ≈ xMAP = arg max π(x) and consider a Gaussianapproximation of π(x) around xMAP, i.e.,

π(x) = N (x|xMAP, Σ), (9)

5

Page 6: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 1: Covering of the considered topics of other surveys or works (3: partial,√

: complete, F: remarkable ormore exhaustive). We take into account also the completeness and the depth of details provided in the differentderivations. To be more precise, in the case of Section 4.1, we have also considered the subsections.

Surveys Sect. 2Sect. 3 Sect. 4 Sect. 5

Sect. 63.1 3.2 3.3 4.1 4.2 4.3 5.1 5.2 5.3

[27, 39] 3

[31, Ch. 10] 3 3

[63] 3 F 3

[19] F[9][10, Ch. 5] 3 3

[28] F F[4][88] 3 3 3

[47] 3 3

[75][26] 3 3 3

[1] 3 3

[70] 3 3 F F[77] 3

[40] 3 3

[46] 3 3 F[90] F[49] 3 F

[5, 6] 3 F[72] 3 3 3 3

[69, 3] F

with Σ = (−D2x log π(xMAP))−1, which is the negative inverse Hessian matrix at xMAP. Replacing

in Eq. (8), with x∗ = xMAP, we obtain the Laplace approximation

Z =π(xMAP)

N (xMAP|xMAP, Σ)= (2π)

Dx2 |Σ|

12π(xMAP). (10)

This is equivalent to the classical derivation of Laplace’s estimator, which is based on expandingthe log π(x) = log(`(y|x)g(x)) as quadratic around xMAP and substituting in Z =

∫π(x)dx, that

is,

Z =

∫π(x)dx =

∫exp{log π(x)}dx (11)

≈∫

exp

{log π(xMAP)− 1

2(x− xMAP)T Σ−1(x− xMAP)

}dx (12)

= (2π)Dx2 |Σ|

12π(xMAP). (13)

6

Page 7: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

In [43], it is suggested using the posterior simulation output to estimate the quantities xMAP and

Σ using samples generated by a Metropolis-Hastings algorithms. The resulting method is called“Laplace-Metropolis” estimator. The authors in [19] present different variants of the Laplace’sestimator.

Bayesian-Schwarz information criterion (BIC). Let us define xMLE ≈ xMLE =arg max `(y|x). The following quantity

BIC = Dx logDy − 2 log `(y|xMLE), (14)

was introduced by Gideon E. Schwarz in [79], where Dx represents the number of parameters ofthe model (x ∈ RDx), Dy is the number of data, and `(y|xMLE) is the estimated maximum valueof the likelihood function. The value of xMLE can be obtained using an MCMC scheme. TheBIC expression can be derived similarly to the Laplace’s method, but this time with a second-order Taylor expansion of the log likelihood around its maximum xMLE and considering uniformimproper priors over the parameters, resulting in

Z ≈ Z = exp

(log `(y|xMLE)− Dx

2logDy

)= exp

(−1

2BIC

), (15)

and BIC ≈ −2 logZ. Then, smaller BIC values are associated to better models. Note that BICclearly takes into account the complexity of the model since higher BIC values are given to modelswith more number of parameters Dx. Namely the penalty Dx logDy discourages overfitting, sinceincreasing the number of parameters virtually always improves the goodness of the fit. Othercriteria can be found in the literature, such as the well-known Akaike information criterion (AIC),

AIC = 2Dx − 2 log `(y|xMLE).

However, they are not an approximation of the marginal likelihood Z and they are usually foundedon information theory derivations. Generally, they have the form of cp−2 log `(y|xMLE) where thepenalty term cp of the model complexity changes in each different criterion (e.g., cp = Dx logDy

in BIC and cp = 2Dx in AIC). Another example that uses MCMC samples is the DevianceInformation Criterion (DIC), i.e.,

DIC = − 4

N

N∑

i=1

log `(y|xn)− 2 log `(y|x), where x =1

N

N∑

i=1

xn, (16)

and {xn}Nn=1 are outputs of an MCMC algorithm [82]. In this case, note that cp =

− 4N

∑Ni=1 log `(y|xn). DIC is considered more approximate for hierarchical models than AIC,

BIC [82] (but is not directly related to the marginal likelihood [71]).

Kernel density estimation (KDE). KDE can be used to approximate the value of the posterior

density at a given point x∗, and then consider Z ≈ π(x∗)π(x∗)

. For instance, we can build a kernel density

estimate (KDE) of π(x) based on M samples distributed according to the posterior (obtained via

7

Page 8: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

an MCMC algorithm, for instance) by using M normalized kernel functions k(x|µm, h) (with∫X k(x|µm, h)dx = 1 for all m) where µm is a location parameter and h is a scale parameter,

π(x∗) =1

M

M∑

m=1

k(x∗|µm, h), {µi}Mm=1 ∼ π(x) (e.g., via MCMC). (17)

The estimator is Z = π(x∗)π(x∗)

where the point x∗ could be xMAP. If we consider N different points

x1, ...,xN (selected without any specific rule) we can also write a more general approximation,

Z =1

N

N∑

n=1

π(xn)

π(xn). (18)

Remark. It is very important to note that the estimator above is biased depending on the choicesof (a) of the points x1, ...,xN , (b) the scale parameter h, and (c) the number of samples M forbuilding π(x). A improved version of this approximation can be obtained by the importancesampling approach described below, where x1, ...,xN are drawn from the KDE mixture π(x). Inthis case, the resulting estimator is unbiased.

Chib’s method. In [11, 12], the authors present more sophisticated methods to estimate π(x∗)using outputs from Gibbs sampling and the Metropolis-Hastings (MH) algorithm respectively [74].Here we only present the latter method, since it can be applied in more general settings.In [12], the authors propose to estimate the value of the posterior in one point x∗, i.e., π(x∗), usingthe output from a MH sampler. More specifically, let us denote the current state as x. To drawa possible candidate as future state z ∼ ϕ(z|x) (where ϕ(z|x) the proposal density used withinMH), then the probability of accepting the new state in a MH scheme is [51, 74]

α(x, z) = min

{1,π(z)ϕ(x|z)

π(x)ϕ(z|x)

}. (19)

Note that this α satisfies the detailed balance condition, i.e.,

α(x, z)ϕ(z|x)π(x) = α(z,x)ϕ(x|z)π(z). (20)

By integrating in x both sides, we obtain∫

Xα(x, z)ϕ(z|x)π(x)dx =

Xα(z,x)ϕ(x|z)π(z)dx (21)

Xα(x, z)ϕ(z|x)π(x)dx = π(z)

Xα(z,x)ϕ(x|z)dx, (22)

hence finally we can solve with respect to π(z) obtaining

π(z) =

∫X α(x, z)ϕ(z|x)π(x)dx∫X α(z,x)ϕ(x|z)dx

. (23)

8

Page 9: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

This suggests the following estimate of π(x∗) at a specific point x∗ (note that x∗ plays the role ofz in the equation above),

π(x∗) =

1

N1

∑N1

i=i α(xi,x∗)ϕ(x∗|xi)

1

N2

∑N2

j=1 α(x∗,vj), {xi}N1

i=1 ∼ π(x), {vj}N2j=1 ∼ ϕ(x|x∗). (24)

The same outputs of the MH scheme can be considered as {xi}N1i=1. The final estimator is again

Z = π(x∗)π(x∗)

, i.e.,

Z =π(x∗)

1

N2

∑N2

j=1 α(x∗,vj)

1

N1

∑N1

i=i α(xi,x∗)ϕ(x∗|xi), {xi}N1

i=1 ∼ π(x), {vj}N2j=1 ∼ ϕ(x|x∗). (25)

The point x∗ is usually chosen in an high probability region. Interesting discussions are containedin [64, 65], where the authors also show that this estimator is a special case of bridge samplingidea described in Section 3.2.

Interpolative approaches. Another possibility is to approximate Z by substituting the trueπ(x) with an interpolative or a regression function π(x) in the integral (4). For simplicity, wefocus on the interpolation case, but all the considerations can be easily extended for a regressionscenario. Given a set of nodes {x1, . . . ,xN} ⊂ X and N nonlinear functions k(x,x′) : X ×X → Rchosen in advance by the user (generally, centered around x′), we can build the interpolant ofunnormalized posterior π(x) as follows

πu(x) =N∑

i=1

βik(x,xi), (26)

where βi ∈ R and the subindex u denotes that is an approximation of unnormalized functionπ(x). The coefficients βi are chosen such that π(x) interpolates the points {xn, π(xn)}, that is,π(xn) = π(xn). Then, we desire that

N∑

i=1

βik(xn,xi) = π(xn),

for all n = 1, ..., N . Hence, we can write a N ×N linear system where the βi are the N unknowns,i.e.,

k(x1,x1) k(x1,x2) . . . k(x1,xN)k(x2,x1) k(x2,x2) . . . k(x2,xN)

.... . .

...k(xN ,x1) k(xN ,x2) . . . k(xN ,xN)

β1

β2...βN

=

π(x1)π(x2)

...π(xN)

(27)

9

Page 10: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

In matrix form, we haveKβ = y, (28)

where (K)i,j = k(xi,xj) and y = [π(x1), . . . , π(xN)]>. Thus, the solution is β = K−1y. Now the

interpolant πu(x) =∑N

i=1 βik(x,xi) can be used to approximate Z as follows

Z =

Xπu(x)dx =

N∑

i=1

βi

Xk(x,xi)dx. (29)

If we are able to compute analytically∫X k(x,xi)dx, we have an approximation Z. Some suitable

choices of k(·, ·) are rectangular, triangular and Gaussian functions. More specifically, if all the

nonlinearities k(x,xi) are normalized, the approximation of Z is Z =∑N

i=1 βi. This approach isrelated to the so-called Bayesian quadrature (using Gaussian process approximation) [73] and thesticky proposal constructions within MCMC or rejection sampling algorithms [32, 30, 50, 62].Adaptive schemes adding sequentially more nodes could be also considered, improving theapproximation Z [30, 50].

3 Techniques based on IS

Most of the techniques for approximating the marginal likelihood are based on the importancesampling (IS) approach. Other methods are directly or indirectly related to the IS framework. Inthis sense, this section is the core of this survey. The standard IS scheme relies on the followingequality,

Z =

Xπ(x)dx =

X

π(x)

q(x)q(x)dx = Eq

[π(x)

q(x)

](30)

=

X

`(y|x)g(x)

q(x)q(x)dx, (31)

where q(x) is a simpler normalized proposal density,∫X q(x)dx = 1. Drawing N independent

samples from proposal q(x), the unbiased IS estimator of Z is

ZIS1 =1

N

N∑

i=1

π(xi)

q(xi)(32)

=1

N

N∑

i=1

wi, (33)

=1

N

N∑

i=1

`(y|xi)g(xi)

q(xi)=

1

N

N∑

i=1

ρi`(y|xi), {xi}Ni=1 ∼ q(x), (34)

where wi = π(xi)q(xi)

are the standard IS weights and ρi = g(xi)q(xi)

. An alternative IS estimator (denoted

as IS vers-2) is given by, considering a possibly unnormalized proposal pdf q(x) ∝ q(x) (the case

10

Page 11: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

q(x) = q(x) is also included),

ZIS2 =1

∑Nn=1

g(xn)q(xn)

N∑

i=1

g(xi)

q(xi)`(y|xi), (35)

=1∑N

n=1 ρn

N∑

i=1

ρi`(y|xi), (36)

=N∑

i=1

ρi`(y|xi), {xi}Ni=1 ∼ q(x). (37)

The estimator above is biased. However, it is a convex combination of likelihood values `(y|xi)since

∑Ni=1 ρi = 1. Hence, in this case min

i`(y|xi) ≤ Z ≤ max

i`(y|xi). Moreover, the estimator

allows the use of an unnormalized proposal pdf q(x) ∝ q(x) and ρi = g(xi)q(xi)

. For instance, one

could consider q(x) = π(x), i.e., generate samples {xi}Ni=1 ∼ π(x) by an MCMC algorithm and

then evaluate ρi = g(xi)π(xi)

. Table 2 summarizes the IS estimators and shows some important specialcases that will be described in the next section.

Table 2: IS estimators Eqs. (32)-(35) and relevant special cases.

ZIS1 = 1N

∑Ni=1

g(xi)q(xi)

`(y|xi) = 1N

∑Ni=1 ρi`(y|xi)

Name Estimator q(x) q(x) Need of MCMC Unbiased

Naive Monte Carlo 1N

∑Ni=1 `(y|xi) g(x) g(x) —

ZIS2 = 1∑Nn=1

g(xn)q(xn)

∑Ni=1

g(xi)q(xi)

`(y|xi) =∑N

i=1 ρi`(y|xi)

Name Estimator q(x) q(x) Need of MCMC Unbiased

Naive Monte Carlo 1N

∑Ni=1 `(y|xi) g(x) g(x) —

Harmonic mean(

1N

∑Ni=1

1`(y|xi)

)−1

π(x) π(x) —

Two different sub-families of IS schemes are commonly used for computing normalizing constants(see also [10, chapter 5]). The first approach uses draws from a proposal density q(x) that iscompletely known (i.e. direct sampling and evaluate). Sophisticated choices of q(x) frequentlyimply the use of MCMC algorithms to sample from q(x) and that we can only evaluate q(x) ∝ q(x).The second class is formed by methods which uses more than one proposal density. We describethe methods in an increasing order of complexity.

11

Page 12: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

3.1 Techniques using draws from one proposal density

In this section, all the techniques are IS schemes which use a unique proposal pdf, and are basedon the identity Eq. (30). The techniques differ for the choice of q(x). Note that the optimalproposal choice for IS should be q(x) = π(x) = 1

Zπ(x). This choice is clearly difficult for two

reasons: (a) we have to draw from π and (b) we do not know Z, hence we cannot evaluate q(x)but only q(x) = π(x) (where q(x) ∝ q(x)). However, there are some methods based on this idea,as shown below.

Naive Monte Carlo (arithmetic mean estimator). It is straightforward to note that theintegral above can be expressed as Z = Eg[`(y|x)], then we can draw N samples {xi}Ni=1 from theprior g(x)) and compute the following estimator

Z =1

N

N∑

i=1

`(y|xi), {xi}Ni=1 ∼ g(x). (38)

Namely a simple average of the likelihoods of a sample from the prior. Note that Z will be veryinefficient (large variance) if the posterior is much more concentrated than the prior (i.e., smalloverlap between likelihood and prior pdfs). Therefore, alternatives have been proposed, see below.It is a special case of the IS estimator with the choice q(x) = g(x) (i.e., the proposal pdf is theprior).

Self-normalized Importance Sampling (Self-IS). Let us consider that the proposal is notnormalized, and we can evaluate it up to a normalizing constant q(x) ∝ q(x). We also denotec =

∫X q(x)dx. Note that this also occurs in the ideal case of using q(x) = π(x) = 1

Zπ(x) where

c = Z and q(x) = π(x). In this case, we have

Z

c=

1

N

N∑

i=1

π(xi)

q(xi), {xi}Ni=1 ∼ q(x). (39)

Therefore, we need an additional estimation of c. We can also use IS for this goal, considering anew normalized reference function f(x), i.e.,

∫X f(x)dx = 1. Now,

1

c= Eq

[f(x)

q(x)

]=

X

f(x)

q(x)q(x)dx ≈ 1

N

N∑

i=1

f(xi)

q(xi), {xi}Ni=1 ∼ q(x). (40)

Thus, the self-normalized IS estimator is

Z =1

∑Ni=1

f(xi)q(xi)

N∑

i=1

π(xi)

q(xi). {xi}Ni=1 ∼ q(x). (41)

We show later that the self-normalized IS estimator is strictly related to the umbrella samplingidea described below, and contains as special cases the rest of estimators included in this section

12

Page 13: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

(see Table 5).

Reverse Importance Sampling (RIS). The RIS scheme [27], also known reciprocal IS, canbe derived from the identity

1

Z= Eπ

[f(x)

π(x)

]=

X

f(x)

π(x)π(x)dx (42)

where we consider an auxiliary normalized function f(x), i.e.,∫X f(x)dx = 1. Then, one could

consider the estimator

Z =

(1

N

N∑

i=1

f(xi)

π(xi)

)−1

=

(1

N

N∑

i=1

f(xi)

`(y|xi)g(xi)

)−1

, xi ∼ π(x) (via MCMC). (43)

The estimator above is consistent but biased. Indeed, the expression 1N

∑Ni=1

f(xi)π(xi)

is a unbiased

estimator of 1/Z, but Z in the Eq. (43) is not an unbiased estimator of Z. Note that π(x)plays the role of importance density from which we need to draw from. Therefore, anothersampling technique must be used (such as a MCMC method) in order to generated samples fromπ(x). In this case, we do not need samples from f(x), although its choice affects the precisionof the approximation. Unlike, in the standard IS approach, f(x) must have lighter tails thanπ(x) = `(y|x)g(x). For further details, see the example in Section 7.1. The RIS estimator is aspecial case of self-normalized IS estimator in Eq. (41) when q(x) = π(x) and q(x) = π(x).

Harmonic mean (HM) estimators. The HM estimator can be directly derived from thefollowing expected value,

Eπ[

1

`(y|x)

]=

X

1

`(y|x)π(x)dx, (44)

=1

Z

X

1

`(y|x)`(y|x)g(x)dx =

1

Z

Xg(x)dx =

1

Z. (45)

The main idea is again to use the posterior itself as proposal. Since direct sampling from π(x) isgenerally impossible, this task requires the use of MCMC algorithms. Thus, the HM estimator is

Z =1

1N

∑Ni=1

1`(y|xi)

, {xi}Ni=1 ∼ π(x) (via MCMC). (46)

The estimator above is a special case of RIS when f(x) = g(x) in Eq. (43) (i.e., the prior pdf).Moreover, the HM estimator is also a special case of Self-IS in Eq. (41), setting again f(x) = g(x)and q(x) = π(x) (so that q(x) = π(x)). The HM estimator converges almost surely to the correct

value, but the variance of Z−1 is often infinite. . This manifests itself by the occasional occurrenceof a value of xi with small likelihood and hence large effect, so that the estimator can be somewhatunstable.1 Generally, the HM estimator tends to overestimate the marginal likelihood. For further

1See the comments of Radford Neal’s blog, https://radfordneal.wordpress.com/2008/08/17/

the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/, where R. Neal defines the HMestimator as “the worst estimator ever”.

13

Page 14: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

details, see the example in Section 7.1 recalling that the HM estimator is a special case of RIS.

Summary 1. Table 3 summarizes the estimators described above. Note that in the standard ISestimator the option q(x) = π(x) is not feasible, whereas it is possible for its second version.

Table 3: One-proposal estimators of Z

Name Estimator Proposal pdf Need of MCMC Unbiased

Standard IS 1N

∑Ni=1 ρi`(y|xi) Generic, q(x) —

Standard IS vers-2∑N

i=1 ρi`(y|xi) Generic, q(x) no, if q(x) 6= π(x) —

Naive Monte Carlo 1N

∑Ni=1 `(y|xi) Prior, g(x) —

Harmonic mean(

1N

∑Ni=1

1`(y|xi)

)−1

Posterior, π(x) —

RIS(

1N

∑Ni=1

f(xi)π(xi)

)−1

Posterior, π(x) —

Self-IS(∑N

i=1f(xi)q(xi)

)−1∑Ni=1

π(xi)q(xi)

Generic, q(x) no, if q(x) 6= π(x) —

Some of the estimators in Table 3 can be unified within a common formulation. Let us considerthe problem of estimating ratios of two normalizing constants c1/c2, where ci =

∫qi(x)dx and

qi(x) = qi(x)/ci, i = 1, 2. Assuming we can evaluate both q1(x), q2(x), and draw samples fromone of them, say q2(x), the importance sampling estimator of c1/c2 is

c1

c2

= Eq2[q1(x)

q2(x)

]≈ 1

N

N∑

i=1

q1(xi)

q2(xi), {xi}Ni=1 ∼ q2(x). (47)

This framework includes the estimators discussed in this section, which are based on simulatingfrom just one proposal density, as shown in Table 4.

Table 4: Summary of techniques considering the expression (47).

Name q1(x) q2(x) c1 c2 Proposal pdf q2(x) Estimator of c1/c2

Standard IS π(x) q(x) Z 1 q(x) ZNaive Monte Carlo π(x) g(x) Z 1 g(x) Z

Harmonic mean g(x) π(x) 1 Z π(x) 1/ZRIS f(x) π(x) 1 Z π(x) 1/Z

Below we consider an extension of Eq. (47) where an additional density q3(x) is employed forgenerating samples.

Umbrella Sampling (a.k.a. ratio importance sampling). The IS estimator of c1/c2 givenin Eq. (47) may be inefficient when there is little overlap between q1(x) and q2(x), i.e., when

14

Page 15: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

∫X q1(x)q2(x)dx is small. Umbrella sampling (originally proposed in the computational physics

literature, [86]; also studied under the name “ratio importance sampling” in [9]) is based on theidentity

c1

c2

=c1/c3

c2/c3

=Eq3[q1(x)q3(x)

]

Eq3[q2(x)q3(x)

] ≈∑N

i=1q1(xi)q3(xi)∑N

i=1q2(xi)q3(xi)

, {xi}Ni=1 ∼ q3(x) (48)

where q3(x) ∝ q3(x) represents a “middle” density, which is constructed to have large overlapswith both qi(x), i = 1, 2. The performance of umbrella sampling clearly depends on the choiceof q3(x). The optimal umbrella sampling density qopt

3 (x), that minimizes the asymptotic relativemean-square error, is

qopt3 (x) =

|q1(x)− q2(x)|∫|q1(x′)− q2(x′)|dx′

=|q1(x)− c1

c2q2(x)|∫

|q1(x′)− c1c2q2(x′)|dx′

. (49)

Since this qopt3 (x) depends on the unknown ratio c1

c2is not available for a direct use. The following

two-stage procedure is often used in practice:

1. Stage 1: Draw N1 samples from an arbitrary density q(1)3 (x) and use them to obtain

r(1) =

∑N1

i=1q1(xi)

q(1)3 (xi)∑N1

i=1q2(xi)

q(1)3 (xi)

, {xi}N1i=1 ∼ q

(1)3 (x). (50)

and define

q(2)3 (x) ∝ |q1(x)− r(1)q2(x)|. (51)

2. Stage 2: Draw N2 samples from q(2)3 (x) via MCMC and define the umbrella sampling

estimator r(2) of c1c2

as follows

r(2) =

∑n2

i=1q1(xi)

q(2)3 (xi)∑n2

i=1q2(xi)

q(2)3 (xi)

, {xi}n2i=1 ∼ q

(2)3 (x). (52)

Summary 2. Considering q1(x) = π(x), q2(x) = q2(x) = f(x), c1 = Z, c2 = 1 and c3 ∈ R in Eq.(48), we obtain

Z =1

∑Ni=1

f(xi)q3(xi)

N∑

i=1

π(xi)

q3(xi). {xi}Ni=1 ∼ q3(x). (53)

which is the self-normalized IS (Self-IS) estimator in Eq. (53). Table 5 summarizes all thetechniques obtained from the identity (48) for c1

c2= Z. Note that Self-IS has the more general form

and includes the rest of estimators as special cases. In the next section, we discuss a generalizationof Eq. (47) for the case where we use samples from both q1(x) and q2(x).

15

Page 16: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 5: Summary of techniques considering the umbrella sampling identity (48) for computingc1c2

= Z. Note that Self-IS has the more general form and includes the rest of estimators as specialcases.

Name q1(x) q2(x) q3(x) c1 c2 c3 sampling from q3(x)

Self-IS π(x) f(x) q(x) Z 1 c3 q(x)

Naive Monte Carlo π(x) g(x) g(x) Z 1 1 g(x)Harmonic Mean π(x) g(x) π(x) Z 1 Z π(x)

RIS π(x) f(x) π(x) Z 1 Z π(x)Standard IS; Eq. (32) π(x) q(x) q(x) Z 1 1 q(x)

Standard IS vers-2; Eq. (35) π(x) g(x) q(x) Z 1 1 q(x)

3.2 Techniques using draws from two proposal densities

In the previous section we considered estimators of Z that use samples drawn from a singleproposal density. In this section we consider estimators of Z that generate samples from two

proposal densities, denoted as qi(x) =qi(x)

ci, i = 1, 2. All the techniques, that we will describe

below, are based on the following bridge sampling identity [63],

c1

c2

=Eq2 [q1(x)α(x)]

Eq1 [q2(x)α(x)]. (54)

Note that the expression above is an extension of the Eq. (47). Indeed, taking α(x) = 1q2(x)

, we

recover Eq. (47). Moreover, If we set q1(x) = π(x), c1 = Z, q2(x) = q(x) and c2 = 1, then theidentity becomes

Z =Eq [π(x)α(x)]

Eπ [q(x)α(x)]. (55)

Figure 1 summarizes the connections among the Eqs. (47), (48), (54), (55) and the correspondingdifferent methods. The standard IS and RIS schemes have been described in the previous sections,whereas the corresponding locally-restricted versions will be introduced below.

Locally-restricted IS and RIS. In the literature, there exist variants of the estimators in Eqs.(38) and (46). These corrected estimators are attempts to improve the efficiency (e.g., removethe infinite variance cases, specially in the harmonic estimator) by restricting the integration to asmaller subset of X (usually chosen in high posterior/likelihood-valued regions) generally denotedby B ⊂ X . As an example, B can be rectangular or ellipsoidal region centered at the MAPestimate xMAP.

Locally-restricted IS estimator. Consider the posterior mass of subset B

Π(B) =

Bπ(x)dx =

XIB(x)

`(y|x)g(x)

Zdx, (56)

16

Page 17: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Standard IS<latexit sha1_base64="etK60FBhS3QfwrExq5qZyedShwA=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiRafOwKbnRXqX1AG8pkMmmHTiZhZiItIb/ixoUibv0Rd/6NkzSIWg9cOJxzL/fe40aMSmVZn0ZpZXVtfaO8Wdna3tndM/erXRnGApMODlko+i6ShFFOOooqRvqRIChwGem50+vM7z0QIWnI79U8Ik6Axpz6FCOlpZFZHQZuOEvaCnEPCQ/ettORWbPqVg64TOyC1ECB1sj8GHohjgPCFWZIyoFtRcpJkFAUM5JWhrEkEcJTNCYDTTkKiHSS/PYUHmvFg34odHEFc/XnRIICKeeBqzsDpCbyr5eJ/3mDWPmXTkJ5FCvC8WKRHzOoQpgFAT0qCFZsrgnCgupbIZ4ggbDScVXyEK4ynH+/vEy6p3X7rN64a9SajSKOMjgER+AE2OACNMENaIEOwGAGHsEzeDFS48l4Nd4WrSWjmDkAv2C8fwH+PZSA</latexit>

RIS<latexit sha1_base64="xMqMoAtS6C6BV2DaQNRjQdz3nYs=">AAAB8XicbVDJSgNBEO2JW4xb1KOXxiB4CjMaXG4BL3qLSxZMhtDT6Uma9DJ094hhyF948aCIV//Gm39jz2QQtwcFj/eqqKoXRIxq47ofTmFufmFxqbhcWlldW98ob261tIwVJk0smVSdAGnCqCBNQw0jnUgRxANG2sH4LPXbd0RpKsWNmUTE52goaEgxMla67fFA3idXF9fTfrniVt0M8C/xclIBORr98ntvIHHMiTCYIa27nhsZP0HKUMzItNSLNYkQHqMh6VoqECfaT7KLp3DPKgMYSmVLGJip3ycSxLWe8MB2cmRG+reXiv953diEJ35CRRQbIvBsURgzaCRM34cDqgg2bGIJworaWyEeIYWwsSGVshBOUxx9vfyXtA6q3mG1dlmr1Gt5HEWwA3bBPvDAMaiDc9AATYCBAA/gCTw72nl0XpzXWWvByWe2wQ84b5+n3ZEA</latexit>

Locally-Restricted RIS<latexit sha1_base64="vSxDkTklN3z+YNHcJK6Khr82spE=">AAACBnicbVDLSsNAFJ3UV62vqEsRBovgxpJq8bEruFFwUat9QBvKZHLbDp1kwsxEDKErN/6KGxeKuPUb3Pk3Jm0RtR64cDjnXu69xwk4U9qyPo3MzOzc/EJ2Mbe0vLK6Zq5v1JUIJYUaFVzIpkMUcOZDTTPNoRlIIJ7DoeEMzlK/cQtSMeHf6CgA2yM9n3UZJTqROuZ223PEXXwpKOE82q+C0pJRDS6uXlwPO2beKlgj4GlSnJA8mqDSMT/arqChB76mnCjVKlqBtmMiNaMchrl2qCAgdEB60EqoTzxQdjx6Y4h3E8XFXSGT8jUeqT8nYuIpFXlO0ukR3Vd/vVT8z2uFuntix8wPQg0+HS/qhhxrgdNMsMskUM2jhBAqWXIrpn0iSZKDVLlRCKcpjr5fnib1g0LxsFC6KuXLpUkcWbSFdtAeKqJjVEbnqIJqiKJ79Iie0YvxYDwZr8bbuDVjTGY20S8Y719ezZkl</latexit>

Locally-Restricted IS<latexit sha1_base64="Guwo6KeyquI0JWbELaQE6wYZQ/Q=">AAACBXicbVDJSgNBEO1xjXEb9aiHxiB4MUw0uNwCXhQ8xCULJCH0dCpJk57uobtHHIZcvPgrXjwo4tV/8ObfODMJosYHBY/3qqiq5/qcaeM4n9bU9Mzs3HxmIbu4tLyyaq+tV7UMFIUKlVyquks0cCagYpjhUPcVEM/lUHMHp4lfuwWlmRQ3JvSh5ZGeYF1GiYmltr3V9Fx5F11ISjgP965AG8WogQ4+vx627ZyTd1LgSVIYkxwao9y2P5odSQMPhKGcaN0oOL5pRUQZRjkMs81Ag0/ogPSgEVNBPNCtKP1iiHdipYO7UsUlDE7VnxMR8bQOPTfu9Ijp679eIv7nNQLTPW5FTPiBAUFHi7oBx0biJBLcYQqo4WFMCFUsvhXTPlEkjkHpbBrCSYLD75cnSXU/XzjIFy+LuVJxHEcGbaJttIsK6AiV0Bkqowqi6B49omf0Yj1YT9ar9TZqnbLGMxvoF6z3L7KEmMk=</latexit>

Stand. Bridge Sampling<latexit sha1_base64="rAK0v3VzUcW5gU2PzFfCx2Pow+U=">AAACRXicbVDLSgMxFM34tr6qLt0Ei6CbMqPFx0IQ3bhUtCp0ynAnk2mDSWZMMmIZ5ufcuHfnH7hxoYhbTWupzwOBc8+5l3tzwpQzbVz3wRkaHhkdG5+YLE1Nz8zOlecXznSSKULrJOGJughBU84krRtmOL1IFQURcnoeXh50/fNrqjRL5KnppLQpoCVZzAgYKwVl3xdhcpOfGJBRFe8rFrUoPgFhd8tWUVrxgadtWM39MMY3xdquHysg+VWwMZAKW3mD6ipY/3KCcsWtuj3gv8Trkwrq4ygo3/tRQjJBpSEctG54bmqaOSjDCKdFyc80TYFcQos2LJUgqG7mvRQKvGKVCMeJsk8a3FO/T+QgtO6I0HYKMG392+uK/3mNzMTbzZzJNDNUks9FccaxSXA3UhwxRYnhHUuAKGZvxaQNNidjgy/1QtjpYnPw5b/kbL3qbVRrx7XKXq0fxwRaQstoFXloC+2hQ3SE6oigW/SIntGLc+c8Oa/O22frkNOfWUQ/4Lx/AKJHsTc=</latexit>

Z =Eq[⇡(x)↵(x)]

E⇡[q(x)↵(x)]

<latexit sha1_base64="BVejasbx7URn0jAeGxiig159ncs=">AAAC33iclZJLaxsxEMe125fjvpzm2IuoMbgXs2sW0ovBEAo5JhDHprvLMitrbRHtw5K2xCy65NJDSum1Xyu3fJGcq7VNaO360AGJv2bmx4xGigvOpHKce8t+8vTZ8xeNg+bLV6/fvG0dvruUeSkIHZGc52ISg6ScZXSkmOJ0UggKaczpOL46qePjr1RIlmcXalnQMIVZxhJGQBlX1Hr4MggSAaQKUlDzOK4+66gKYhDVQms/KFjXnBJ8rT8GwIs5PB5DvYuY9Bpa0/vBZmddkkSuNltf4wHe24QJ+4vI/Z82DOTWUH8/FLXaTs9ZGd4V7ka00cbOotZdMM1JmdJMEQ5S+q5TqLACoRjhVDeDUtICyBXMqG9kBimVYbV6H407xjPFSS7MyhReef8kKkilXKaxyazvIrdjtfNfMb9UyaewYllRKpqRdaGk5FjluH5sPGWCEsWXRgARzPSKyRzMpJX5Ek0zBHf7yrvist9zvZ537rWH3mYcDfQefUBd5KJjNESn6AyNELEC68a6tb7bYH+zf9g/16m2tWGO0F9m//oNheTqMg==</latexit>

c1

c2= Eq2

q1(x)

q2(x)

<latexit sha1_base64="tWxP3vVlkkc5Viyz3h6DKDhxABA=">AAADOHicjVLPa9RAFJ7Eqm1a7VaPXgaXhfayJMtCvRQKIniSCm5bzITwMjvZHTr50ZmJuIT5s7z4Z3gTLx4U8dq/wMlmaWt3F/og4Xvv+77Jey+TlIIr7fvfHffBxsNHjze3vO2dJ093O3vPTlVRScpGtBCFPE9AMcFzNtJcC3ZeSgZZIthZcvG64c8+Mal4kX/Qs5JFGUxynnIK2pbiPecdSSXQmsaBsa+BwUeYZKCnSVK/MXFNEpD1pbEEESzVYau+jIN9S6X4szkwNhvcZETyyVRHXu/jUatdcZoJScmvLQREOYXrNDLLFitvTK17vdHrrRhmXROWDm/PcY82rCloTIP1prjT9fv+PPAyCBagixZxEne+kXFBq4zlmgpQKgz8Ukc1SM2pYMYjlWIl0AuYsNDCHDKmonr+4w3u2coYp4W0T67xvHrbUUOm1CxLrLKZRd3lmuIqLqx0+iqqeV5WmuW0/VBaCawL3NwiPOaSUS1mFgCV3PaK6RTsprW9a55dQnB35GVwOugHw/7w/bB7PFysYxO9QC/RPgrQITpGb9EJGiHqfHF+OL+c3+5X96f7x/3bSl1n4XmO/gv36h+urQ1O</latexit>

c1

c2=

Eq2 [q1(x)↵(x)]

Eq1[q2(x)↵(x)]

<latexit sha1_base64="r3yQ188yVu67IJy6/9LqYgiMl9U=">AAADOHicjVJNa9tAEF0p/UjcL6c99rLUGNKLkYyhvQQCpdBTSaFOQiUhRuuVvWT1kd1VqRH7s3LJz8it9NJDS+k1v6Ajy6RpbEMHJN7Me281M9qklEIbz/vquFt37t67v73TefDw0eMn3d2nR7qoFONjVshCnSSguRQ5HxthJD8pFYcskfw4OX3T8MefudKiyD+aecmjDKa5SAUDg6V413nfD1MFrGaxb/E1tHSfhhmYWZLUb21chwmo+swiEUqemqBVn8X+HlIp/WJfWsyGf7NQienMRJ3+p/1Wu+Y0G4SluLaEIMsZXKeRXbWgvDG17s3GzppZNvWAdHBzjP/oAk1+YxpuNsXdnjfwFkFXgb8EPbKMw7h7GU4KVmU8N0yC1oHvlSaqQRnBJMeJKs1LYKcw5QHCHDKuo3rx4y3tY2VC00Lhkxu6qN501JBpPc8SVDaz6NtcU1zHBZVJX0e1yMvK8Jy1H0orSU1Bm1tEJ0JxZuQcATAlsFfKZoCbNnjXOrgE//bIq+BoOPBHg9GHUe9gtFzHNnlOXpA94pNX5IC8I4dkTJhz7nxzfjg/3Qv3u/vL/d1KXWfpeUb+CffqD38jDU4=</latexit>

c1

c2=

Eq2 [q3(x)q2(x) ]

Eq1[ q3(x)q1(x) ]

<latexit sha1_base64="VWvUcjyv7CZVGikQDtYtIf6t9+g=">AAADyHiclVJLj9MwEHYTHkt5deHIxaKqtFyqpFSCy0oroZUQp0WiuyuSKHJcp7XqPNaeAFXkCz+RGxd+C5OmLGXbIhgp1sx88znfjCcplTTged87jnvr9p27B/e69x88fPS4d/jk3BSV5mLCC1Xoy4QZoWQuJiBBictSC5YlSlwkizcNfvFJaCOL/AMsSxFlbJbLVHIGmIoPOz8GYaoZr3nsWzxGlh7TMGMwT5L61MZ1mDBdX1kEQiVSCNrqq9g/QiilX+wLi9HodxRqOZtD1B18PG5rd9xmg7CU15SQqXLOrsPIblOwvCG17P3E7q5m9olAONjs4x9kIMlvSKO/SPhPBb/m+XLfPPfr2E3deBga2bjX94beyui246+dPlnbWdz7Fk4LXmUiB66YMYHvlRDVTIPkSmB7lREl4ws2EwG6OcuEierVIlo6wMyUpoXGLwe6ym4yapYZs8wSrGxaMjexJrkLCypIX0e1zMsKRM7bH6WVolDQZqvpVGrBQS3RYVxL1Er5nOF4AHe/i0Pwb7a87ZyPhv54OH4/7p+M1+M4IM/Ic3JEfPKKnJC35IxMCHdOnYUDTuW+c0v3s7tsS53OmvOU/GHu158ePj+o</latexit>

q1(x) = ⇡(x)

<latexit sha1_base64="I4G66xlIE4W1PqaXpYLCpKTgbJU=">AAAD43icrVNNb9MwGHYTPkb46saRi0VVaVyquKsEl0qT0CSOQ6LbRBwFx3Vaa87HbAdRRb5y2WEIceVPceOncMNpyghrK3HglWK9H8/jPO9rOy4EV9r3f3Qc99btO3d37nn3Hzx89Li7u3ei8lJSNqG5yOVZTBQTPGMTzbVgZ4VkJI0FO43PX9X10w9MKp5nb/WiYGFKZhlPOCXapqLdzs+LCO1XOE7gR/N8jAt+HXh9nEhCKxohY5ehgWOIU6LncVwdmcjCiKwujC1gwRIdNOj2dsZGwz8Rlnw216HXfzdusBt2M0FbAiaimJPrMDTrFAuvSQ17O3FjM9tE2HLQ7uMfZFgSqknD/yjh90APtg10u5DN1NbJwNBE3Z4/8JcG1x20cnpgZcdR9zue5rRMWaapIEoFyC90WBGpORXMeLhUrCD0nMxYYN2MpEyF1fKOGti3mSlMcmm/TMNlts2oSKrUIo0tsm5J3azVyU21oNTJy7DiWVFqltHmR0kpoM5hfeHhlEtGtVhYh1DJrVZI58SOR9tn4dkhoJstrzsnwwEaDUZvRr1DtBrHDngKnoF9gMALcAheg2MwAdR573xyrpzPLnMv3S/u1wbqdFacJ+Avc7/9AmzPSV8=</latexit>

q2(x) = q(x)

<latexit sha1_base64="NXt0DLaO9OkVZ1kBNQ4Wyo5kbms=">AAAD53icrVPLitswFFXsPqbuK9MuuxENgekm2JlAuwkMlIFZTqGZGWoZIytyIkZ+RJJLg9APdDOLDqXb/lJ3/ZhC5cQNaR7QRS/YHN97jnTutZSUnEnl+z9bjnvn7r37Bw+8h48eP3naPnx2IYtKEDoiBS/EVYIl5SynI8UUp1eloDhLOL1Mrt/W9cuPVEhW5O/VvKRRhic5SxnByqbiw9avWdw/0ihJ4SfzaogSLPTMrBJeF6UCE03iwNhX38AhRBlW0yTRpybWDd8WEKepCpfsWRysVjB6fQODBJtMVeR1PwyX3B2rmRCVbCVBmJdTvPqMzLbE0mvRhvct4c5m9pmw5XC9j3+wYUVBLer/Rwt/Bnq8b6D7jeyWrv0ZGJm43fF7/iLgNgga0AFNnMftH2hckCqjuSIcSxkGfqkijYVihFPjoUrSEpNrPKGhhTnOqIz04pwa2LWZMUwLYZ9cwUV2XaFxJuU8Syyzbklu1urkrlpYqfRNpFleVormZLlRWnGoClgfejhmghLF5xZgIpj1CskU2/EoezU8O4Rgs+VtcNHvBYPe4N2gcxI04zgAL8BLcAQC8BqcgDNwDkaAOGPns/PFuXWZe+N+db8tqU6r0TwHf4X7/Tc4rktN</latexit>

c1 = Z

<latexit sha1_base64="sZuWh+TWeFfELvrSkGV+AEKgfco=">AAAD7nicrVPLitswFNXY03bqvjLtshvREJhugp0GZjaBgVLocgrNzDCWMbIiJyLyI5JcGoQ+opsuZijd9nu6699UTtyQ5gFd9ILN8b3nSOdeS0nJmVS+/+vAcQ/v3X9w9NB79PjJ02et4+eXsqgEoUNS8EJcJ1hSznI6VExxel0KirOE06tk+rauX32iQrIi/6jmJY0yPM5ZyghWNhUfO4ckDgY3XmcW9040SlL42bweoAQLPTOrhNdBqcBEW66xr56BA4gyrCZJot+ZWDd8W0CcpipcsmdxsFrB6PUNDBJsPFGR17kZLLk7VjMhKtlKgjAvJ3j1GZltiaXXog3vW8KdzewzYcvheh//YMOKglrU+48W/gz0zb6B7jeyW7r2Z2Bk4lbb7/qLgNsgaEAbNHERt36iUUGqjOaKcCxlGPilijQWihFOjYcqSUtMpnhMQwtznFEZ6cVxNbBjMyOYFsI+uYKL7LpC40zKeZZYZt2S3KzVyV21sFLpWaRZXlaK5mS5UVpxqApYn304YoISxecWYCKY9QrJBNvxKHtDPDuEYLPlbXDZ6wb9bv9Dv30eNOM4Ai/BK3ACAnAKzsF7cAGGgDhT54tz69y5pfvV/eZ+X1Kdg0bzAvwV7o/f3BhMTQ==</latexit>

c2 = 1

<latexit sha1_base64="/CxZVvqDjW+H7U8vRsDVcMnl09Y=">AAAD7nicrVPLjtMwFPUkAwzh1YElG4uq0rCpkkwl2FQaCSGxHCQ6MyKOIsd1WqvOo7aDqCx/BBsWIMSW72HH3+C0oSp9SCy4UqKTe8+xz72x04ozqXz/15HjHt+6fefkrnfv/oOHjzqnj69kWQtCR6TkpbhJsaScFXSkmOL0phIU5ymn1+nsVVO//kCFZGXxTi0qGud4UrCMEaxsKjl1jkkSDgOvN0/CM43SDH40z4coxULPzTrh9VAmMNEkCYx9hQYOIcqxmqapfm0S3fJtAXGaqWjFnifBegWjNzcwSLDJVMVe7/1wxd2zmolQxdYShHk1xevP2OxKLL0RbXnfEe5t5pAJW442+/gHG1YUNKLwP1r4M9DzQwM9bGS/dOPPwNgkna7f95cBd0HQgi5o4zLp/ETjktQ5LRThWMoo8CsVaywUI5waD9WSVpjM8IRGFhY4pzLWy+NqYM9mxjArhX0KBZfZTYXGuZSLPLXMpiW5XWuS+2pRrbKXsWZFVStakNVGWc2hKmFz9uGYCUoUX1iAiWDWKyRTbMej7A3x7BCC7ZZ3wVXYDwb9wdtB9yJox3ECnoJn4AwE4AW4AG/AJRgB4sycT84X56tbuZ/db+73FdU5ajVPwF/h/vgNT6JMJQ==</latexit>

↵(x) =1

q(x)

<latexit sha1_base64="rW6goapUK0uC5TLWJcd32jRL7xs=">AAAEGnicrVPPi9NAFJ5NVl3jr64evQxbAuulJLWgl8KCCB5XsLuLmRAm00k77ORHZyZiCfN3ePFf8eJBEW/ixf/GSZtdY9qCBwcGHu/7vve+95KJC86k8rxfe5a9f+PmrYPbzp279+4/6B0+PJN5KQidkJzn4iLGknKW0YliitOLQlCcxpyex5cvavz8HRWS5dkbtSxomOJZxhJGsDKp6NDyEObFHB9XKE7ge/1kjBKBSeVrk8CiWuhrRDsuiYZj33EX0bDF79Acd12BRKaGEWg4hijFah7H1UsdXZU1AOI0UcGavYj8P42qdgONBJvNVei4bxtvW6rpABXsWtIZKdSbEkOvRR3vG8Ktw+wyYeCgPcc/2DAivxYN/6OFq4U+3bXQ3Ua2S1tfBoY66vW9gbc6cDPwm6APmnMa9X6gaU7KlGaKcCxl4HuFCissFCOcageVkhaYXOIZDUyY4ZTKsFr92hq6JjOFSS7MzRRcZduKCqdSLtPYMOuRZBerk9uwoFTJ87BiWVEqmpF1o6TkUOWwfidwygQlii9NgIlgxiskc2zWo8xrcswS/O7Im8HZcOCPBqPXo/7JoFnHAXgMjsAx8MEzcAJegVMwAcT6YH2yvlhf7Y/2Z/ub/X1NtfYazSPw17F//galXV3F</latexit>

↵(x) =1

⇡(x)

<latexit sha1_base64="uqk9hDlj1VGdiq+B1s3G/ioQiv4=">AAAEQnicrVNNa9tAEN1I/UjVL6c99rLUGNKLkRxDezEESiHHBOrEVBJitV7ZS1Yf3l2VmmV/Wy/9Bb3lB/TSQ0PptYeubKVVZRt66IJgmHlv5r0RExeMCum6V3uWfev2nbv795z7Dx4+etw5eHIu8pJjMsY5y/kkRoIwmpGxpJKRScEJSmNGLuLL11X94j3hgubZW7ksSJiiWUYTipE0qejAmgSIFXN0qII4gR/0i1GQcISVp1VQ0N9Z7fR242LE1UI3sTgajDynt4gGDXwLZjquOuDI9DAEDUcwSJGcx7F6o6ObtqYQMJJIf41eRN6fQao5QAeczuYydHrvam1bumm/6aplKdSbFAOvSC3tG8StZnaJMGW/6eMfZBiSV5EG/1HCzUKPdi10t5Dt1MafgaGOOl23764e3Ay8OuiC+p1Gnc/BNMdlSjKJGRLC99xChgpxSTEj2glKQQqEL9GM+CbMUEpEqFYnoGHPZKYwybn5MglX2SZDoVSIZRobZGVJtGtVclvNL2XyKlQ0K0pJMrwelJQMyhxW9wSnlBMs2dIECHNqtEI8R2Y90lydY5bgtS1vBueDvjfsD8+G3eN+vY598Aw8B4fAAy/BMTgBp2AMsPXR+mJ9s67tT/ZX+7v9Yw219mrOU/DXs3/+AlmAbXs=</latexit>

↵(x) =IB(x)

q(x)

<latexit sha1_base64="EmnpqZIe59fh3C3dChiZ6iTxwgY=">AAAEjnicrVNdi9NAFJ1toq5ZP7r66MtgKawvJVOLilJcFGF9W8HuLiYhTKaTdtjJR2cmYgnzb/xFvvlvnDRdiUkL++CFwOHec+7cc8ONcs6kct3fBz3LvnP33uF95+jBw0eP+8dPLmRWCEJnJOOZuIqwpJyldKaY4vQqFxQnEaeX0fXHqn75nQrJsvSrWuc0SPAiZTEjWJlUeNz76WOeL/FJ6Ucx/KFfTP1YYFL6CVbLKCo/67DGBPPyg9Z/edoALMpVI+MMd/dChpuz2/G6PUk4niJnuArHDX6LZjpuOpDQ9DACDafwxsGnykHNNwWf01h5NXsVooad5gPaF2yxVIEz/NbaR6Ob9pquWpYC3ZUYeiVqzd4R7jSzbwhT9po+bjGGEaFKNP6PI9ws9OW+he4fZLe08WdgoMP+wB25m4BdgLZgALZxHvZ/+fOMFAlNFeFYSg+5uQpKLBQjnGrHLyTNMbnGC+oZmOKEyqDcnJOGQ5OZwzgT5ksV3GSbihInUq6TyDArS7Jdq5K7al6h4jdBydK8UDQl9UNxwaHKYHWbcM4EJYqvDcBEMDMrJEts1qPMBTtmCahtuQsuxiM0GU2+TAano+06DsEz8BycAAReg1NwBs7BDBDryELWW+ud3bdf2VP7fU3tHWw1T8E/YZ/9AYCQhA0=</latexit>

↵(x) =IB(x)

⇡(x)

<latexit sha1_base64="JvWl2Op2Vdqyr3JBwldbMXMC4fw=">AAAE1nicrVRdi9NAFJ1toq7xq6uPvgyWwvpSOrWgL4VFEfRtRbtbTEKYTCftsJOPzkzEEsYHRXz1t/nmf/BHOGm6uzFtcREHAjf3nnPuPTdMwowzqfr9n3sty752/cb+TefW7Tt377UP7p/INBeEjknKUzEJsaScJXSsmOJ0kgmK45DT0/DsRVk//UCFZGnyTi0z6sd4lrCIEaxMKjho/fIwz+b4sPDCCH7Uj0deJDApvBireRgWr3VQxQTz4rnWFzhdeBm7fHO6/64TYlEs9N+10BV7oq2aJBiMkNNdBIMavgEziisFEhgNQ9BwBM8dvCwdVHhT8DiNlFuhFwGq2ak30J5gs7nyne77xj5qatqtu2pY8vUmxcBLUmP2DeJWM7uGMGW37uMKYxgSKkmD/zjC+UKf7Fro7kG2U2tfBvo6aHf6vf7qwM0ArYMOWJ/joP3Dm6Ykj2miCMdSuqifKb/AQjHCqXa8XNIMkzM8o64JExxT6Rera6lh12SmMEqFeRIFV9k6o8CxlMs4NMjSkmzWyuS2mpur6JlfsCTLFU1I1SjKOVQpLO84nDJBieJLE2AimJkVkjk261HmT+CYJaCm5c3gZNBDw97wzbBz1FuvYx88BI/AIUDgKTgCr8AxGANivbWW1mfriz2xP9lf7W8VtLW35jwAfxz7+28dXaGp</latexit>

↵(x) =1

q2(x)

<latexit sha1_base64="eCzmoBnkJYFEUQrTV4F4Q6qyjHE=">AAAE/nicrVRNi9NAGJ7dRF3jV1e9eRkshfVSOrWgl8KiCHpbwe4uJiFMppN22MlHZyZiGQb8K148KOLV3+HNf+Ok7Wo2bXEPOxB4ed/neeZ9npDEBWdS9Xq/d3Yd99r1G3s3vVu379y919q/fyzzUhA6IjnPxWmMJeUsoyPFFKenhaA4jTk9ic9eVvOTD1RIlmfv1LygYYonGUsYwcq2on3nYYB5McUHOogT+NE8GQaJwEQjo2dR/2/XeJ3NuCDFahrH+o2JljXBXL8w5h9TBwW7Gp0YCz0z/9dCl7wTbdQkUX+IvE7d/bAJs4oLBRJZDUswcAjPHbyqHCzxdhBwmih/iZ5FqGbnQryBYJOpCr3O+0YeNTXj1101LIVmnWLhFamx+xpxo5ltS9ixX/dxiTUsCVWk/hWucB7o022Bbl9kM7X2ZmBoola71+0tDlwv0Kpog9U5ilq/gnFOypRminAspY96hQo1FooRTo0XlJIWmJzhCfVtmeGUylAvPl8DO7Yzhkku7JMpuOjWGRqnUs7T2CIrS7I5q5qbZn6pkuehZllRKpqR5UVJyaHKYfUvgGMmKFF8bgtMBLO7QjLFNh5l/xieDQE1La8Xx/0uGnQHbwftw+4qjj3wCDwGBwCBZ+AQvAZHYASIo53Pzlfnm/vJ/eJ+d38sobs7K84DcOG4P/8A+GKwLQ==</latexit>

↵(x) =q3(x)

q1(x)q2(x)

<latexit sha1_base64="WSbGTCiuop/8X3+3U+W6XlckYzg=">AAAFFnicrVRNb9MwGPaWACN8dXDkYlFVGhKq4q4SXCpNICS4DYluE0kUOa7TWnM+ajvTqsi/ggt/hQsHEOKKuPFvcNpuZGmn7TBLkV6/7/s8fp9HjqOcM6lc9+/GpmXfun1n665z7/6Dh49a248PZFYIQock45k4irCknKV0qJji9CgXFCcRp4fR8ZuqfnhChWRZ+lHNchokeJyymBGsTCrctl44Pub5BO+UfhTDU/184McCk3Ia7p6ntNmh89007P2vOJ31cD/BahJF5XsdLmKCefla6xqnn7Ob4YmwKKf6ai50zTPRWk4S9gbI6dTVD5pthnHOQELDYQAaDuCZgreVgkW/Kficxso7sxpdsLpmry/YeKICp/Op4UeNTXt1VQ1JgV6FmPYK1Jh9BbhWzGVDmLJX13GNMQwIVaDeDY6w/u7WDL18kKuuvYaBDlttt+vOF1wN0DJog+XaD1t//FFGioSminAspYfcXAUlFooRTrXjF5LmmBzjMfVMmOKEyqCc/9YadkxmBONMmC9VcJ6tI0qcSDlLItNZSZLNWpVcV/MKFb8KSpbmhaIpWRwUFxyqDFZvBBwxQYniMxNgIpiZFZIJNvYo85I4xgTUlLwaHPS6qN/tf+i397pLO7bAU/AM7AAEXoI98A7sgyEg1mfrq/Xd+mF/sb/ZP+1fi9bNjSXmCbiw7N//AHUvuQQ=</latexit>

Figure 1: Graphical representation of the relationships among the Eqs. (47), (48), (54), (55) andthe corresponding different methods.

where IB(x) is an indicator function, taking value 1 for x ∈ B and 0 otherwise. It leads to thefollowing representation

Z =1

Π(B)

XIB(x)`(y|x)g(x)dx =

1

Π(B)Eq[IB(x)

`(y|x)g(x)

q(x)

]. (57)

We can estimate Π(B) considering N1 samples from π(x), and by taking the proportion of samplesinside B. The resulting locally-restricted IS estimator of Z

Z =

1N1

∑N1

i=1IB(zi)`(y|zi)g(zi)

q(zi)

1N2

∑N2

i=1 IB(xi), {zi}N1

i=1 ∼ q(x), {xi}N2i=1 ∼ π(x) (via MCMC). (58)

Note that the above estimator requires samples from two densities, namely the proposal q(x) andthe posterior density π(x) (via MCMC).

Locally-restricted RIS estimator. To derive the locally-restricted RIS estimator, consider the massof B under q(x)

Q(B) =

Bq(x)dx = Z · Eπ

[IB(x)

q(x)

`(y|x)g(x)

], (59)

which leads to the following representation

Z =Q(B)

Eπ[

IB(x)q(x)`(y|x)g(x)

] . (60)

17

Page 18: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Q(B) can be estimated using a sample from q(x) by taking the proportion of sampled values insideB. The locally-restricted RIS estimator is

Z =1N1

∑N1

i=1 IB(zi)

1N2

∑N2

i=1IB(xi)q(xi)`(y|xi)g(xi)

, {zi}N1i=1 ∼ q(x), {xi}N2

i=1 ∼ π(x). (61)

Other variants, where IB(z) corresponds to high posterior density regions, can be find in [75].

Optimal construction of bridge sampling. Identities as (54) are associated to the bridge

sampling approach. However, considering α(x) = γ(x)q2(x)q1(x)

in Eq. (54), bridge sampling can bealso motivated from the expression

c1

c2

=c3/c2

c3/c1

=Eq2[γ(x)q2(x)

]

Eq1[γ(x)q1(x)

] , (62)

where the density γ(x) ∝ γ(x) is in some sense “in between” q1(x) and q2(x). That is, insteadof applying directly (47) to c1

c2, we apply it to first estimate c3

c2and c3

c1and then take the ratio to

cancel c3. The bridge sampling estimator of c1c2

is then

c1

c2

≈1N2

∑N2

i=1γ(zi)q2(zi)

1N1

∑N1

i=1γ(xi)q1(xi)

, {xi}N1i=1 ∼ q1(x), {zi}N2

i=1 ∼ q2(x). (63)

We do not need to draw samples from γ(x), but only evaluate γ(x). It can be shown that theoptimal2 bridge density γ(x) can be expressed as a weighted harmonic mean of q1(x) and q2(x)(with weights being the sampling rates),

γopt(x) =1

N2

N1+N2[q1(x)]−1 + N1

N1+N2[q2(x)]−1

(64)

=1

c2

· 1

N2c1c2q−1

1 (x) +N1q−12 (x)

(65)

∝ γopt(x) =q1(x)q2(x)

N1q1(x) +N2c1c2q2(x)

, (66)

depending on the unknown ratio r = c1c2

. Therefore, we cannot even evaluate γopt(x). Hence, weneed to resort to the following iterative procedure to approximate the optimal bridge samplingestimator. Noting that

γopt(x)

q2(x)=

q1(x)

N1q1(x) + rN2q2(x),

γopt(x)

q1(x)=

q2(x)

N1q1(x) + rN2q2(x). (67)

The iterative procedure is formed by the following steps:

2In the sense of providing the most efficient estimator of the ratio c1c2

.

18

Page 19: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

1. Start with an initial estimate r(1) ≈ c1c2

(using e.g. Laplace’s).

2. For t = 1, ..., T :

(a) Draw {xi}N1i=1 ∼ q1(x) and {zi}N2

i=1 ∼ q2(x) and iterate

r(t+1) =

1N2

∑N2

i=1

q1(zi)

N1q1(zi) +N2r(t)q2(zi)

1N1

∑N1

i=1

q2(xi)

N1q1(xi) +N2r(t)q2(xi)

. (68)

Optimal bridge sampling for Z. Given the considerations above, an iterative bridge samplingestimator of Z is obtained by setting q1(x) = π(x), c1 = Z, q2(x) = q(x), so that

Z(t+1) =

1N2

∑N2

i=1

π(zi)

N1π(zi) +N2Z(t)q(zi)

1N1

∑N1

i=1

q(xi)

N1π(xi) +N2Z(t)q(xi)

, {zi}N2i=1 ∼ q(x) and {xi}N1

i=1 ∼ π(x). (69)

for t = 1, ..., T . When N1 = 0, that is when all samples are drawn from q(x), the estimator abovereduces to (non-iterative) standard IS scheme with proposal q(x). When N2 = 0, that is when allsamples are drawn from π(x), the estimator becomes the (non-iterative) RIS estimator. See [9]for a comparison of optimal umbrella sampling, bridge sampling and path sampling (described inthe next section). An alternative derivation of the optimal bridge sampling estimator is given in[75], by simulating from a mixture of type ψ(x) ∝ π(x) + rq(x). However, the resulting estimatoremploys the same samples drawn from ψ(x) in the numerator and denominator, unlike in Eq.(69).

Several techniques described in the last two subsections, including both umbrella and bridgesampling, are encompassed by the generic formula

c1

c2

= Eξ[q1(x)α(x)]/Eχ[q2(x)α(x)] (70)

as shown in Table 6.

3.3 IS based on multiple proposal densities

In this section we consider estimators of Z using samples drawn from more than two proposaldensities. Typically these densities form a sequence of functions that are in some sense “in themiddle” between the posterior π(x) and an easier-to-work-with density (e.g. the prior g(x) orsome other proposal density). The number of such middle densities must be specified by theuser, and in some cases, it is equivalent to the selection of a temperature schedule for linkingg(x) and π(x). The resulting pdfs are usually called as tempered posteriors and correspond toflatter, spread distributions. The use of the tempered pdfs usually improve the mixing of the

19

Page 20: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 6: Summary of the IS schemes (with one or two proposal pdfs), using Eq. (70).

c1c2

= Eξ[q1(x)α(x)]/Eχ[q2(x)α(x)]

Name α(x) ξ(x) χ(x) q1(x) q2(x) c1 c2 sampling from

Bridge Identity - Eq. (54) α(x) q2(x) q1(x) q1(x) q2(x) c1 c2 q1(x), q2(x)

Bridge Identity - Eq. (62) γ(x)q2(x)q1(x) q2(x) q1(x) q1(x) q2(x) c1 c2 q1(x), q2(x)

Identity - Eq. (47) 1q2(x) q2(x) q1(x) q1(x) q2(x) c1 c2 q2(x)

Umbrella - Eq. (48) 1q3(x) q3(x) q3(x) q1(x) q2(x) c1 c2 q3(x)

Self-norm. IS - Eqs. (41) (53) 1q3(x) q3(x) q3(x) π(x) f(x) Z 1 q3(x)

Bridge Identity - Eq. (55) α(x) q(x) π(x) π(x) q(x) Z 1 π(x), q(x)Standard IS 1/q(x) q(x) π(x) π(x) q(x) Z 1 q(x)

RIS 1/π(x) q(x) π(x) π(x) q(x) Z 1 π(x)Locally-Restricted IS IB(x)/q(x) q(x) π(x) π(x) q(x) Z 1 π(x), q(x)

Locally-Restricted RIS IB(x)/π(x) q(x) π(x) π(x) q(x) Z 1 π(x), q(x)

algorithm and foster the exploration of the space X . This idea is shared by path sampling, thepower posterior methods and stepping-stone sampling described below. However, we start witha general IS scheme considering different proposals qn(x)’s. Some of them could be temperedposteriors and the generation would be performed by an MCMC method in this case.

Multiple Importance Sampling (MIS) estimators. Here, we consider to generate samplesfrom different proposal densities, i.e.,

xn ∼ qn(x), n = 1, ..., N. (71)

In this scenario, different proper importance weights can be used [24, 23, 22]. The most efficientMIS scheme considers the following weights

wn =π(xn)

1N

∑Ni=1 qi(xn)

=π(xn)

ψ(xn), (72)

where ψ(xn) = 1N

∑Ni=1 qi(xn). Indeed, considering the set of samples {xn}Nn=1 drawn in a

deterministic order, xn ∼ qn(x), and given a sample x∗ ∈ {x1, ...,xN} uniformly chosen in {xn}Nn=1,then we can write x∗ ∼ ψ(xn). The standard MIS estimator is

Z =1

N

N∑

n=1

wn =1

N

N∑

n=1

π(xn)

ψ(xn)(73)

=1

N

N∑

n=1

g(xn)`(y|xn)

ψ(xn), (74)

=1

N

N∑

n=1

ηn`(y|xn), xn ∼ qn(x), n = 1, ..., N. (75)

20

Page 21: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

where ηn = g(xn)ψ(xn)

. The estimator is unbiased [24]. As in the standard IS scheme, an alternativebiased estimator is

Z =1

N

N∑

n=1

ηn`(y|xn), xn ∼ qn(x), n = 1, ..., N, (76)

where ηn = ηn∑Ni=1 ηi

, so that∑N

i=1 ηi = 1 and we have a convex combination of likelihood values

`(y|xn)’s.

Path sampling. The method of path sampling for estimating c1c2

relies on the idea of buildingand drawing samples from a sequence of distributions linking q1(x) and q2(x) (a continuous path).For the purpose of estimating the marginal likelihood, we set q2(x) = g(x) and q1(x) = π(x) andwe “link them” by an univariate “path” with parameter β. Let

π(x|β), β ∈ [0, 1], (77)

denote a sequence of (probably unnormalized except for β = 0) densities such π(x|β = 0) = g(x)and π(x|β = 1) = π(x). The path sampling method for estimating the marginal likelihood isbased on expressing logZ as

logZ = Ep(x,β)

[U(x, β)

p(β)

], with U(x, β) =

∂βlog π(x|β), (78)

where the expectation is w.r.t. the joint p(x, β) = π(x|β)Z(β)

p(β), being Z(β) the normalizing constant

of π(x|β) and p(β) represents a density for β ∈ [0, 1]. Indeed, we have

Ep(x,β)

[U(x, β)

p(β)

]=

X

∫ 1

0

1

p(β)

∂βlog π(x|β)

π(x|β)

Z(β)p(β)dxdβ, (79)

=

X

∫ 1

0

1

π(x|β)

∂βπ(x|β)

π(x|β)

Z(β)dxdβ, (80)

=

X

∫ 1

0

1

Z(β)

∂βπ(x|β)dxdβ, (81)

=

∫ 1

0

1

Z(β)

∂β

(∫

Xπ(x|β)dx

)dβ, (82)

=

∫ 1

0

1

Z(β)

∂βZ(β)dβ, (83)

=

∫ 1

0

∂βlogZ(β)dβ, (84)

= logZ(1)− logZ(0) = logZ, (85)

where we substituted Z(β = 1) = Z(1) = Z and Z(β = 0) = Z(0) = 1. Thus, using a sample{xi, βi}Ni=1 ∼ p(x, β), we can write the path sampling estimator for logZ

logZ =1

N

N∑

i=1

U(xi, βi)

p(βi), {xi, βi}Ni=1 ∼ p(x, β). (86)

21

Page 22: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

The samples may be obtained by first drawing β′ from p(β) (see [28] for guidelines on optimal p(β)in one dimension) and then applying some MCMC steps to draw from π(x|β′) given β′. Often theso-called geometric path is employed (see also [28] for extensions),

π(x|β) = g(x)1−βπ(x)β (87)

= g(x)`(y|x)β, β ∈ [0, 1]. (88)

Note that π(x|β) is the posterior with a powered, “less informative” -“wider” likelihood (for thisreason, π(x|β) is often called a “power posterior”). In this case, we have

U(x, β) =∂

∂βlog π(x|β) (89)

= log `(y|x), (90)

so the path sampling identity becomes

logZ = Ep(x,β)

[log `(y|x)

p(β)

], (91)

which is also used in the power posterior method of [25], described next.

Method of Power Posteriors. The previous expression (91) can also be converted into anintegral in [0, 1] as follows

logZ = Ep(x,β)

[log `(y|x)

p(β)

], (92)

=

∫ 1

0

X

log `(y|x)

p(β)

π(x|β)

Z(β)p(β)dx, (93)

=

∫ 1

0

Xlog `(y|β)

π(x|β)

Z(β)dx, (94)

=

∫ 1

0

Eπ(x|β) [log `(y|x)] dβ. (95)

The power posterior method aims at estimating the integral above by applying a quadrature rule.For instance, using the trapezoidal rule, choosing a discretization 0 = β0 < β1 < · · · < βI−1 <βI = 1, leads to an approximation of type

logZ =I−1∑

i=1

(βi+1 − βi)Eπ(x|βi+1) [log `(y|x)] + Eπ(x|βi) [log `(y|x)]

2, (96)

where the expected values w.r.t. the power posteriors can be independently approximated viaMCMC

Eπ(x|βi) [log `(y|x)] ≈ 1

N

N∑

k=1

log `(y|xk,i), {xk,i}Nk=1 ∼ π(x|βi), i = 1, . . . , I. (97)

22

Page 23: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Power posteriors as proposal densities. Recall the general IS estimator of Z =∫X `(y|x)g(x)dx, which involves a weighted sum of likelihood evaluations at points {xi}Ni=1 ∼ q(x)

drawn from importance density q(x):

Z =N∑

i=1

ρi`(y|xi), ρi =

g(xi)q(xi)∑Nn=1

g(xn)q(xn)

∝ g(xi)

q(xi), (98)

where∑N

i=1 ρi = 1. Let q(x) = π(x|β) ∝ q(x) = π(x|β) = g(x)`(y|x)β, that is we use the power

posterior as importance density so ρi ∝ g(xi)g(xi)`(y|xi)β = 1

`(y|xi)β . The resulting IS estimator is

Z =

∑Ni=1

1`(y|xi)β `(y|xi)∑Ni=1

1`(y|xi)β

(99)

=

∑Ni=1 `(y|xi)1−β

∑Ni=1 `(y|xi)−β

{xi}Ni=1 ∼ π(x|β) (via MCMC). (100)

Table 7 shows that this technique includes different schemes for different values of β. Differentpossible MIS schemes can be also considered, i.e., using Eq. (76) for instance [24, 22].

Table 7: Different estimators of Z using q(x) ∝ g(x)`(y|x)β as importance density, with β ∈ [0, 1].

Name Coefficient β Weights ρi Estimator Z =∑Ni=1 ρi`(y|xi)

Naive Monte Carlo β = 0 1N

1N

∑Ni=1 `(y|xi)

Harmonic Mean Estimator β = 11

`(y|xi)∑j

1`(y|xj)

Z = 11N

∑Ni=1

1`(y|xi)

Power posterior

as proposal pdf 0 < β < 11

`(y|xi)β∑j

1

`(y|xj)βZ =

∑i `(y|xi)1−β∑i `(y|xi)−β

Stepping-stone (SS) sampling. Consider again π(x|β) ∝ g(x)`(y|x)β. The goal is to estimate

Z = Z(1)Z(0)

, which can be expressed as the following product

Z =Z(1)

Z(0)=

K∏

k=1

Z(βk)

Z(βk−1), (101)

where βk are often chosen as βk = kK

, k = 1, . . . , K, i.e., with a uniform grid in [0, 1]. The idea of

23

Page 24: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

SS sampling is to estimate each ratio rk = Z(βk)Z(βk−1)

by importance sampling as

rk =Z(βk)

Z(βk−1)= Eπ(x|βk−1)

[π(x|βk)π(x|βk−1)

](102)

= Eπ(x|βk−1)

[`(y|x)βk

`(y|x)βk−1

](103)

≈ 1

N

N∑

i=1

`(y|xi)βk−βk−1 , {xi}Ni=1 ∼ π(x|βk−1). (104)

We can improve the numerical stability by factoring the largest sampled likelihood term `max =maxi `(y|xi), so

rk =1

N(`max)βk−βk−1

N∑

i=1

(`(y|xi)`max

)βk−βk−1

, {xi}Ni=1 ∼ π(x|βk−1). (105)

Multiplying all ratio estimates yields the final estimator of Z

Z =K∏

k=1

rk. (106)

Some strategies for selecting the values of β’s are discussed in [25].

4 Advanced schemes combining MCMC and IS

In the previous sections, we have already introduced several methods which require the use ofMCMC algorithms in order to draw from complex proposal densities. The RIS estimator, pathsampling, power posteriors and the SS sampling schemes are some examples. In this section, wedescribe more sophisticated scheme which combines MCMC and IS techniques.

4.1 MCMC within IS schemes

In this section, we will see how to properly weight samples obtained by different MCMC iterations.We denote as K(z|x) the transition kernel which summarizes all the steps of the employed MCMCalgorithm. Note that generally K(z|x) cannot be evaluated. However, we can use MCMC kernelsK(z|x) in the same fashion as proposal densities, considering the concept of the so-called properweighting [45, 53].

4.1.1 Weighting a sample after an MCMC iteration

Let us consider the following procedure:

1. Draw x0 ∼ q(x) (where q(x) is normalized, for simplicity).

24

Page 25: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

2. Draw x1 ∼ K(x1|x0), where the kernel K leaves invariant density η(x) = 1cη(x), i.e.,

XK(x′|x)η(x)dx = η(x′). (107)

3. Assign to x1 the weight

ρ(x0,x1) =η(x0)

q(x0)

π(x1)

η(x1). (108)

This weight is proper in the sense that can be used for building unbiased estimator Z (or othermoments π(x)), as described in the Liu’s definition [74, Section 14.2], [45, Section 2.5.4]. Indeed,we can write

E[ρ(x0,x1)] =

X

Xρ(x0,x1)K(x1|x0)q(x0)dx0dx1,

=

X

X

η(x0)

q(x0)

π(x1)

η(x1)K(x1|x0)q(x0)dx0dx1,

=

X

π(x1)

η(x1)

[∫

Xη(x0)K(x1|x0)dx0

]dx1,

=

X

π(x1)

cη(x1)cη(x1)dx1 =

Xπ(x1)dx1 = Z. (109)

Note that if η(x) ≡ π(x) then ρ(x1) = π(x0)q(x0)

, i.e., the IS weights remain unchanged after an MCMC

iteration with invariant density π(x). Hence, if we repeat the procedure above N times generating

{x(n)0 ,x

(n)1 }Nn=1, we can build the following unbiased estimator of the Z,

Z =1

N

N∑

n=1

ρ(x(n)0 ,x

(n)1 ) =

1

N

N∑

n=1

η(x(n)0 )

q(x(n)0 )

π(x(n)1 )

η(x(n)1 )

(110)

In the next section, we extend this idea where different MCMC updates are applied, each oneaddressing a different invariant density.

4.1.2 Annealed Importance Sampling (An-IS)

In the previous section, we have considered the application of one MCMC kernel K(x1|x0) (thatcould be formed by different MCMC steps). Below, we consider the application of several MCMCkernels addressing different target pdfs, and show their consequence in the weighting strategy. Weconsider again a sequence of tempered versions of the posterior, π1(x), π2(x), . . ., πL(x) ≡ π(x),where the L-th version, πL(x), coincides with the target function π(x). On possibility is to definedifferent scaled version of the target,

πi(x) = [π(x)]βi = g(x)βi`(y|x)βi , where 0 < β1 ≤ β2 ≤ . . . ≤ βL = 1, (111)

or alternatively

πi(x) = g(x)`(y|x)βi where 0 < β1 ≤ β2 ≤ . . . ≤ βL = 1. (112)

25

Page 26: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

as in path sampling and power posteriors. In any case, smaller β values correspond to flatterdistributions.3 The use of the tempered sequence of target pdfs usually improve the mixing of thealgorithm and foster the exploration of the space X . Since only the last function is the true target,πL(x) = π(x), different schemes have been proposed for suitable weighting the final samples.Let us consider conditional L − 1 kernels Ki(z|x) (with L ≥ 2), representing the probabilityof different MCMC updates of jumping from the state x to the state z (note that each Ki cansummarize the application of several MCMC steps), each one leaving invariant a different temperedtarget, πi(x) ∝ πi(x). For one single sample, the Annealed Importance Sampling (An-IS) is givenin Table 8.

Table 8: Annealed Importance Sampling (An-IS)

1. Draw N samples x(n)0 ∼ π0(x) = q(x) for n = 1, ..., N .

2. For k = 1, . . . , L− 1 :

(a) Draw N samples x(n)k ∼ Kk(x|x(n)

k−1), leaving invariant πk(x) for n = 1, ..., N , i.e.,we generate N samples using an MCMC with invariant distribution πk(x).

(b) Compute the weight associated to the sample x(n)k ,

ρ(n)k =

k∏

i=0

πi+1(x(n)i )

πi(x(n)i )

= ρ(n)k−1

πk+1(x(n)k )

πk(x(n)k )

. (113)

3. Return the weighted sample {x(n)L−1, ρ

(n)L−1}Nn=1. An estimator of the marginal likelihood

is Z = 1N

∑Nn=1 ρ

(n)L−1. Combinations of An-IS with path sampling and power posterior

methods can be also considered, employing the information of the rest of intermediatedensities.

Note that, when L = 2, we have ρ(n)1 =

π1(x(n)0 )

q(x(n)0 )

π(x(n)1 )

π1(x(n)1 )

. If, π1 = π2 = . . . = πL−1 = η 6= π, then

the weight is ρL−1 =η(x

(n)0 )

q(x(n)0 )

π(x(n)L−1)

η(x(n)L−1)

.

The method above can be modified incorporating an additional MCMC transition xL ∼KL(x|xL−1), which leaves invariant πL(x) = π(x). However, since πL(x) is the true target pdf,as we have seen above the weight remains unchanged (see the case η(x) = π(x) in the previous

section). Hence, in this scenario, the output would be {x(n)L , ρ

(n)L } = {x(n)

L , ρ(n)L−1}, i.e., ρ

(n)L = ρ

(n)L−1.

This method has been proposed in [68] but similarly schemes can be found in [13, 29].

Remark. The stepping-stones (SS) sampling method described in Section 3.3 is strictly connectedto an Ann-IS scheme when the intermediate pdfs π`(x) are chosen as powered posteriors, i.e.,

3Another alternative is to use the so-called data tempering [13], for instance, setting πi(x) ∝ p(x|y1, . . . , yd+i),where d ≥ 1 and d+ L = Dy (recall that y = [y1, . . . , yDy ] ∈ RDy ).

26

Page 27: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

π`(x) ∝ `(y|x)`g(x). In the SS technique, all the samples, drawn from different powered posteriors,are used within an unique estimator of Z.

Interpretation as Standard IS. For the sake of simplicity, here we consider reversible kernels,i.e., each kernel satisfies the detailed balance condition

πi(x)Ki(z|x) = πi(z)Ki(x|z) so thatKi(z|x)

Ki(x|z)=πi(z)

πi(x). (114)

We show that the weighting strategy suggested by An-IS can be interpreted as a standard ISweighting considering the following an extended target density, defined in the extended space X L,

πg(x0,x1, . . . ,xL−1) = π(xL−1)L−1∏

k=1

Kk(xk−1|xk). (115)

Note that πg has the true target π as a marginal pdf. Let also consider an extended proposal pdfdefined as

qg(x0,x1, . . . ,xL−1) = q(x0)L−1∏

k=1

Kk(xk|xk−1). (116)

The standard IS weight of an extended sample [x0,x1, . . . ,xL−1] in the extended space X L is

w(x0,x1, . . . ,xL−1) =πg(x0,x1, . . . ,xL−1)

qg(x0,x1, . . . ,xL−1)=π(xL−1)

∏L−1k=1 Kk(xk−1|xk)

q(x0)∏L−1

k=1 Kk(xk|xk−1). (117)

Replacing the expression Ki(z|x)Ki(x|z)

= πi(z)πi(x)

in (117), we obtain the Ann-IS weights

w(x0,x1, . . . ,xL−1) =π(xL−1)

q(x0)

L−1∏

k=1

πk(xk−1)

πk(xk), (118)

=π1(x0)

q(x0)

L−1∏

k=1

πk+1(xk)

πk(xk)=

L−1∏

k=0

πk+1(xk)

πk(xk)= ρL−1, (119)

where we have used πL(x) = π(x) and just rearranged the numerator.

4.1.3 Generic Sequential Monte Carlo

In this section, we describe a sequential IS scheme which encompasses the previous Ann-ISalgorithm as a special case. The method described here uses jointly MCMC transitions and,additionally, resampling steps as well. It is called Sequential Monte Carlo (SMC), since we have asequence of target pdfs πk(x), k = 1, . . . , L [66]. This sequence of target densities can be defined bya state-space model as in a classical particle filtering framework (truly sequential scenario, wherethe goal is to track dynamic parameters). Alternatively, we can also consider a static scenario as

27

Page 28: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

in the previous sections, i.e., the resulting algorithm is an iterative importance sampler where weconsider a sequence of tempered densities as in Eqs. (111)-(112), where πL(x) = π(x) [66]. Let usagain define an extended proposal density in the domain X k,

qk(x1, . . . ,xk) = q1(x1)k∏

i=2

Fi(xi|xi−1) : X k → R, (120)

where q1(x1) is a marginal proposal and Fi(xi|xi−1) are generic forward transition pdfs, that willbe used as partial proposal pdfs. Extending the space from X k to X k+1 (increasing its dimension),note that we can write the recursive equation

qk+1(x1, . . . ,xk,xk+1) = Fk+1(xk+1|xk)qk(x1, . . . ,xk) : X k+1 → R.

The marginal proposal pdfs are

qk(xk) =

Xk−1

qk(x1, . . . ,xk)dx1:k−1

=

Xk−1

q1(x1)k∏

i=2

Fi(xi|xi−1)dx1:k−1, (121)

=

X

[∫

Xk−2

q1(x1)k∏

i=2

Fi(xi|xi−1)dx1:k−2

]Fk(xk|xk−1)dxk−1,

=

Xqk−1(xk−1)Fk(xk|xk−1)dxk−1, (122)

Therefore, we would be interested in computing the marginal IS weights, wk = πk(xk)qk(xk)

, for each

k. However note that, in general, the marginal proposal pdfs qk(xk) cannot be computed andthen cannot be evaluated. A suitable alternative approach is described next. Let us consider theextended target pdf defined as

πk(x1, . . . ,xk) = πk(xk)k∏

i=2

Bi−1(xi−1|xi) : X k → R, (123)

Bi−1(xi−1|xi) are arbitrary backward transition pdfs. Note that the space of {πk} increases as kgrows, and πk is always a marginal pdf of πk. Moreover, writing the previous equation for k + 1

πk+1(x1, . . . ,xk,xk+1) = πk+1(xk+1)k+1∏

i=2

Bi−1(xi−1|xi),

and writing the ratio of both, we get

πk+1(x1, . . . ,xk,xk+1)

πk(x1, . . . ,xk)=πk+1(xk+1)

πk(xk)Bk(xk|xk+1). (124)

28

Page 29: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Therefore, the IS weights in the extended space X k are

wk =πk(x1, . . . ,xk)

qk(x1, . . . ,xk)(125)

=πk−1(x1, . . . ,xk−1)

qk−1(x1, . . . ,xk−1)

πk(xk)πk−1(xk−1)

Bk−1(xk−1|xk)Fk(xk|xk−1)

, (126)

= wk−1πk(xk)Bk−1(xk−1|xk)πk−1(xk−1)Fk(xk|xk−1)

. (127)

where we have replaced wk−1 = πk−1(x1,...,xk−1)

qk−1(x1,...,xk−1). The recursive formula in Eq. (127) is the key

expression for several sequential IS techniques. The SMC scheme summarized in Table 9 is ageneral framework which contains different algorithms as a special cases [66].

Choice of the forward functions. One possible choice is to use independent proposal pdfs,i.e., Fk(xk|xk−1) = Fk(xk) or random walk proposal Fk(xk|xk−1), where Fk represents standarddistributions (e.g., Gaussian or t-Student). An alternative is to choose Fk(xk|xk−1) = Kk(xk|xk−1),i.e., an MCMC kernel with invariant pdf πk.

Choice of backward functions. It is possible to show that the optimal backward transitions{Bk}Lk=1 are [66]

Bk−1(xk−1|xk) =qk−1(xk−1)

qk(xk)Fk(xk|xk−1). (130)

This choice reduces the variance of the weights [66]. However, generally, the marginal proposal qkin Eq. (121) cannot be computed (are not available), other possible {Bk} should be considered.For instance, with the choice

Bk−1(xk−1|xk) =πk(xk−1)

πk(xk)Fk(xk|xk−1), (131)

we obtain

wk = wk−1

πk(xk)πk(xk−1)

πk(xk)Fk(xk|xk−1)

πk−1(xk−1)Fk(xk|xk−1)(132)

= wk−1πk(xk−1)

πk−1(xk−1)(133)

Moreover, if also Fk(xk|xk−1) = Kk(xk|xk−1) is an MCMC kernel with invariant πk, then we comeback to An-IS algorithm [68, 13, 29], described in Table 8. Several other methods are contained asspecial cases of algorithm in Table 9, with specific choice of {Bk}, {Kk} and {πk} (e.g., PopulationMonte Carlo schemes [7]).

29

Page 30: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 9: General Sequential Monte Carlo (SMC)

1. Draw x(n)1 ∼ q1(x), n = 1, . . . , N .

2. For k = 2, . . . , L :

(a) Draw N samples x(n)k ∼ Fk(x|x(n)

k−1).

(b) Compute the weights

w(n)k = w

(n)k−1

πk(x(n)k )Bk−1(x

(n)k−1|x

(n)k )

πk−1(x(n)k−1)Fk(xk|x(n)

k−1), (128)

= w(n)k−1γ

(n)k , , k = 1, . . . , L, (129)

where we set γ(n)k =

πk(x(n)k )Bk−1(x

(n)k−1|x

(n)k )

πk−1(x(n)k−1)Fk(xk|x

(n)k−1)

.

(c) Normalize the weights w(n)k =

w(n)k∑N

j=1 w(j)k

, for n = 1, ..., N .

(d) If ESS ≤ εN :(with 0 < ε < 1 and ESS is a effective sample size measure [55], see section 4.1.4)

i. Resample N times {x(1)k , . . . ,x

(N)k } according to {w(n)

k }Nn=1, obtaining

{x(1)k , . . . , x

(N)k }.

ii. Set x(n)k = x

(n)k , Zk = 1

N

∑Nn=1 w

(n)k and w

(n)k = Zk for all n = 1, . . . , N

[54, 53, 67, 61].

3. Return the cloud of weighted particles and Z = ZL = 1N

∑Nn=1 w

(n)L , if a proper weighting

of the resampled particles is used (as suggested in the step 2d-ii above). Otherwise, use

other estimator ZL as in Eq. (136) (for further details, see Section 4.1.4).

4.1.4 Evidence computation in a sequential framework

The generic algorithm in Table 9 employs also resampling steps. Resampling consists in drawingparticles from the current cloud according to the normalized importance weights w

(n)k , for

n = 1, ...., N . The resampling steps are applied only in certain iterations taking into account

an ESS approximation, such as ESS = 1∑Nn=1(w

(n)k )2

, or ESS = 1

maxn w(n)k

[41, 55]. Generally,If

1NESS is smaller than a pre-established threshold ε ∈ [0, 1], all the particles are resampled. Thus,

the condition for the adaptive resampling can be expressed as ESS < εN . When ε = 1, theresampling is applied at each iteration [20, 21]. If ε = 0, no resampling steps are applied, and wehave the SIS method described above. Here, we discuss separately different scenarios ε = 0 and0 < ε ≤ 1 how it is possible to compute sequentially the estimator of the marginal likelihood.

30

Page 31: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Sequential Importance Sampling (SIS), ε = 0. This a specific case of the generic SMCin Table 9, where no resampling steps are applied. Consider the weight recursion in Eq. (128)

w(n)k = w

(n)k−1γ

(n)k ,

at the k-th iteration, we have two possible equivalent marginal likelihood estimators [54, 53, 61],

Z(1)k =

1

N

N∑

n=1

w(n)k (134)

=1

N

N∑

n=1

w(n)k−1γ

(n)k =

1

N

N∑

n=1

k∏

j=1

γ(n)j , (135)

or

Z(2)k =

k∏

j=1

[N∑

n=1

w(n)j−1γ

(n)j

]. (136)

In SIS, they are completely equivalent, i.e., Z(1)k = Z

(2)k , indeed

Z(2)k =

k∏

j=1

[N∑

n=1

w(n)j−1

NZ(1)j−1

γ(n)j

], (137)

=k∏

j=1

1

NZ(1)j−1

[N∑

n=1

w(n)j−1γ

(n)j

], (138)

=k∏

j=1

1

NZ(1)j−1

[N∑

n=1

w(n)j

], (139)

=k∏

j=1

Z(1)j

Z(1)j−1

= Z(1)k . (140)

Therefore, in SIS, we can use both indifferently.Generic SMC ( 0 < ε ≤ 1) with proper weighting after resampling. Let consider that aproper weighting for resampled particles is used [54, 53] as applied in Table 9, i.e., the unnormalized

weights after resampling are set equal to Z(1)k = 1

N

N∑n=1

w(n)k ,

w(1)d = w

(2)d = . . . = w

(N)k = Z

(1)k . (141)

Then, it is possible to show that Z(1)k and Z

(2)k are again equivalent, Z

(1)k = Z

(2)k . It can be easily

seen for (139),

Z(2)k =

k∏

j=1

1

NZ(1)j−1

[N∑

n=1

w(n)j

],

Z(2)k =

k∏

j=1

1

NZ(1)j−1

[NZ

(1)j

]=

k∏

j=1

Z(1)j

Z(1)j−1

= Z(1)k . (142)

31

Page 32: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Note that, we alway have w(1)k = w

(2)k = . . . = w

(N)k = 1

N.

Generic SMC ( 0 < ε ≤ 1) without proper weighting after resampling. In many worksregarding particle filtering it is noted that the unnormalized weights of the resampled particles,

w(1)k = w

(2)k = . . . = w

(N)k ,

but a specific value is not assigned (or usually people set to 1). However, also in this case, wealways have

w(1)k = w

(2)k = . . . = w

(N)k =

1

N, (143)

then Z(2)k can still be applied. However, Z

(1)k , which involves the unnormalized weights, is not

more a valid estimator (it loses its statistical meaning) [54, 53].

4.2 Estimation based on Multiple Try MCMC schemes

The Multiple Try Metropolis (MTM) methods are advanced MCMC algorithms which considerdifferent candidates as possible new state of the chain [49, 60, 59]. More specifically, at eachiteration different samples are generated and compared by using some proper weights. Then oneof them is selected and tested as possible future state. The main advantage of these algorithms isthat they foster the exploration of a larger portion of the sample space, decreasing the correlationamong the states of the generated chain. Here, we consider the use of importance weightsfor comparing the different candidates, in order to provide also an estimation of the marginallikelihood [60]. More specifically, we consider the Independent Multiple Try Metropolis type 2(IMTM-2) scheme [49] with an adaptive proposal pdf. The algorithm is given in Table 10. themean vector and covariance matrix are adapted using the empirical estimators yielding by all theweighted candidates drawn so far, i.e., {zn,τ , wn,τ} for all n = 1, ..., N and τ = 1, ..., T . Twopossible estimators of the marginal likelihood can be constructed, one based on standard adaptiveimportance sampling argument Z(2) [5, 6] and other based on group importance sampling ideaprovided in [53].

4.3 Layered Adaptive Importance Sampling (LAIS)

The LAIS algorithm consider the use of N parallel (independent or interacting) MCMC chainswith invariant pdf π(x) or a tempered version π(x|β) [57, 5]. Each MCMC chain can addressed adifferent tempered version π(x|β). After T iterations of the N MCMC schemes (upper layer), theresulting NT samples, {µn,t}, for n = 1, ..., N and t = 1, ..., T are used are location parameters ofNT proposal densities q(x|µn,t,C). Then, these proposal pdfs are employed within a MIS scheme

(lower layer), weighting the generated samples xn,t’s with the generic weight wn,t = π(xn,t)

Φ(xn,t)[24, 22].

The denominator Φ(xn,t) is a mixture of (all or a subset of) proposal densities which specifies thetype of MIS scheme applied [24, 22]. The algorithm, with different possible choices of Φ(xn,t), isshown in Table 11. The first choice in (148) is the most costly since we have to evaluate all theproposal pdfs in all the generated samples xn,t’s, but provides the best performance in terms of

32

Page 33: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 10: Adaptive Independent Multiple Try Metropolis type 2 (AIMTM-2)

1. Choose the initial parameters µt, Ct of the proposal q, and initial state x0 and a firstestimation of the marginal likelihood Z0.

2. For t = 1, ..., T :

(a) Draw z1,t, ...., zN,t ∼ q(z|µt,Ct).

(b) Compute the importance weights wn,t = π(zn,t)

q(zn,t|µt,Ct) , for n = 1, ..., N .

(c) Normalize them wn,t = wn,t

NZ′where

Z ′ =1

N

N∑

i=1

wi,t, and set Rt = Z ′. (144)

(d) Resample x′ ∈ {z1,t, ...., zN,t} according to wn, with n = 1, ..., N .

(e) Set xt = x′ and Zt = Z ′ with probability

α = min

[1,

Z ′

Zt−1

](145)

otherwise set xt = xt−1 and Zt = Zt−1.

(f) Update µt,Ct computing the corresponding empirical estimators using {zn,τ , wn,τ}for all n = 1, ..., N and τ = 1, ..., T .

3. Return the chain {xt}Tt=1, {Zt}Tt=1 and {Rt}Tt=1. Two possible estimators of Z can beconstructed:

Z(1) =1

T

T∑

t=1

Zt, Z(2) =1

T

T∑

t=1

Rt. (146)

efficiency of the final estimator. The second and third choices are temporal and spatial mixtures,respectively. The last choice corresponds to standard importance weights given in Section 3.Let assume πn(x) = π(x) for all n in the upper layer. Considering also standard parallelMetropolis-Hastings chains in the upper layer, the number of posterior evaluations in LAIS is2NT . Thus, if only one chain N = 1 is employed in the upper layer, the number of posteriorevaluations is 2T .Special case with recycling samples. The method in [78] can be considered as a special caseof LAIS when N = 1, and {µt = xt} i.e., all the samples {xt}Tt=1 are generated by the uniqueMCMC chain with random walk proposal ϕ(x|xt−1) = q(x|xt−1) with invariant density π(x). Inthis scenario, the two layers of LAIS are collapsed in a unique layer, so that {µt = xt}. Namely,no additional generation of samples are needed in the lower layer, and the samples generated in

33

Page 34: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 11: Layered Adaptive Importance Sampling (LAIS)

1. Generate NT samples, {µn,t}, using N parallel MCMC chains of length T , each MCMCmethod using a proposal pdf ϕn(µ|µt−1), with invariant distributions a power posteriorπn(x) = π(x|βn) (with βn > 0) or a posterior pdf with a smaller number of data.

2. Draw NT samples xn,t ∼ qn,t(x|µn,t,C) where µn,t plays the role of the mean, and Cis a covariance matrix.

3. Assign to xn,t the weights

wn,t =π(xn,t)

Φ(xn,t). (147)

There are different possible choices for Φ(xn,t), for instance:

Φ(xn,t) =1

NT

T∑

k=1

N∑

i=1

qn,t(xn,t|µi,k,C), (148)

Φ(xn,t) =1

T

T∑

k=1

qn,t(xn,t|µn,k,C), (149)

Φ(xn,t) =1

N

N∑

i=1

qn,t(xn,t|µi,t,C), (150)

Φ(xn,t) = qn,t(xn,t|µi,t,C), (151)

4. Return all the pairs {xn,t, wn,t}, and Z = 1NT

∑Tt=1

∑Nn=1wn,t.

the upper layer (via MCMC) are recycled. Hence, the number of posterior evaluations is only T .The denominator for weights used in [78] is in Eq. (149), i.e., a temporal mixture as in [17]. Theresulting estimator is

Z =1

T

T∑

t=1

π(xt)1T

∑Tk=1 ϕ(xt|xt−1)

, {xt}Tt=1 ∼ π(x) (via MCMC with a proposal ϕ(·|·)).

Relationship with KDE method. LAIS can be interpreted as an extension of the KDE methodin Section 2, where the KDE function is also employed as a proposal density in the MIS scheme.Namely, the points used in Eq. (18), in LAIS they are drawn from the KDE function using thedeterministic mixture procedure [24, 23, 22].Compressed LAIS (CLAIS). Let us consider the T or N is large (i.e., either large chainsor several parallel chains; or both). Since NT is large, the computation of the denominatorsEqs. (148)- (149)- (150) can be expensive. A possible solution is to use a partitioning orclustering procedure [52] with K << NT clusters considering the NT samples, and then employ

34

Page 35: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

as denominator the function

Φ(x) =1

K

K∑

i=1

N (x|µk,Ck), (152)

where µk represents the centroid of the k-th cluster, and Ck = Σk + hI with Σk the empiricalcovariance matrix of k-th cluster and h > 0.Relationship with power-posteriors methods. In the upper layer of LAIS, we can use non-tempered versions of the posterior, i.e., πn(x) = π(x) for all n, or tempered versions of the posteriorπn(x) = π(x|βn) = `(y|x)βng(x). However, unlike in power posterior methods these samples areemployed only as location parameters µn,t of the proposal pdfs qn,t(x|µn,t,C), and they are notincluded in the final estimators. Combining the power-posteriors idea and the approach in [78], wecould recycle xn,t = µn,t and use qn,t(x|µn,t) = ϕn,t(x|µn,t) where we denote as ϕn,t the proposalpdfs employed in the MCMC chains. Another difference is that in LAIS the use of an “anti-tempered” power posterior with βn > 1 is allowed and can be shown that is beneficial for theperformance of the estimators (after that the chains reach a good mixing) [56]. More generally,one can consider a time-varying βn,t (where t is the iteration of the n-th chain). In the firstiterations, one could use βn,t < 1 for fostering the exploration of the state space and helping themixing of the chain. Then, in the last iterations, one could use βn,t > 1 which increases theefficiency of the resulting IS estimators [56].

5 Vertical likelihood representations

5.1 Lebesgue representations of the marginal likelihood

5.1.1 First one-dimensional representation

The D-dimensional integral Z =∫X `(y|x)g(x)dx can be turned into a one-dimensional integral

using an extended space representation. Namely, we can write

Z =

X`(y|x)g(x)dx (153)

=

Xg(x)dx

∫ `(y|x)

0

dλ (extended space representation) (154)

=

Xg(x)dx

∫ ∞

0

I{0 < λ < `(y|x)}dλ (155)

35

Page 36: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

where I{0 < λ < `(y|x)} is an indicator function valuing 1 if λ ∈ [0, `(y|x)] and 0 otherwise.Switching the integration order, we obtain

Z =

∫ ∞

0

Xg(x)I{0 < λ < `(y|x)}dx (156)

=

∫ ∞

0

`(y|x)>λ

g(x)dx (157)

=

∫ ∞

0

Z(λ)dλ =

∫ sup `(y|x)

0

Z(λ)dλ, (158)

where we have set

Z(λ) =

`(y|x)>λ

g(x)dx. (159)

In Eq. (158), we have also assumed that `(y|x) is bounded so the limit of integration is sup `(y|x).

5.1.2 The survival function Z(λ) and related sampling procedures

The function above Z(λ) : R+ → [0, 1] is the mass of the prior restricted to the set {x : `(y|x) > λ}.Note also that

Z(λ) = P (λ < `(y|X)) , where X ∼ g(x). (160)

Moreover, we have that Z(λ) ∈ [0, 1] with Z(0) = 1 and Z(λ′) = 0 for all λ′ ≥ sup `(y|x), and itis also non-increasing. Therefore, Z(λ) is a survival function, i.e.,

F (λ) = 1− Z(λ) = P (`(y|X) < λ) = P (Λ < λ) , (161)

is the cumulative distribution of the random variable Λ = `(y|X) with X ∼ g(x) [58, 74].

Sampling according to F (λ) = 1− Z(λ). Since Λ = `(y|X) with X ∼ g(x), the following

procedure generates samples λn from dF (λ)dλ

:

1. Draw xn ∼ g(x), for n = 1, ..., N .

2. Set λn = `(y|xn), , for all n = 1, ..., N .

Recalling the inversion method [58, Chapter 2], note also that the corresponding values

bn = F (λn) ∼ U([0, 1]), (162)

i.e., they are uniformly distributed in [0, 1]. Since Z(λ) = 1− F (λ), and since V = 1− U is alsouniformly distributed U([0, 1]) if U ∼ U([0, 1]), then

an = Z(λn) ∼ U([0, 1]). (163)

36

Page 37: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

In summary, finally we have that

if xn ∼ g(x), and λn = `(y|xn) ∼ F (λ) then an = Z(λn) ∼ U([0, 1]). (164)

The truncated prior pdf. Note that Z(λ) is also the normalizing constant of the followingtruncated prior pdf

g(x|λ) =1

Z(λ)I{`(y|x) > λ}g(x), (165)

where g(x|0) = g(x) and g(x|λ) for λ > 0. Two graphical examples of g(x|λ) and Z(λ) are givenin Figure 2.

PriorLikelihoodZ( )

(a)

PriorLikelihoodZ( )

(b)

Figure 2: Two examples of the area below the truncated prior g(x|λ), i.e., the function Z(λ).Note that in figure (b) the value of λ is greater than in figure (a), so that the area Z(λ) decreases.If λ is bigger than the maximum of the likelihood function then Z(λ) = 0.

Sampling from g(x|λ) and F (λ|λ0). Given a fixed value λ0 ≥ 0, in order to generate samplesfrom g(x|λ0) one alternative is to use an MCMC procedure. However, in this case, the followingacceptance-rejection procedure can be also employed [58]:

1. For n = 1, ..., N :

(a) Draw x′ ∼ g(x).

(b) if `(y|x′) > λ0 then set xn = x′ and λn = `(y|x′).(c) if `(y|x′) ≤ λ0, then reject x′ and repeat from step 1(a).

2. Return {xn}Nn=1 and {λn}Nn=1.

Note that xn ∼ g(x|λ0), for all n = 1, ..., N and the probability of accepting a generated samplex′ is exactly Z(λ). The values λn = `(y|xn) where xn ∼ g(x|λ0), have the following truncatedcumulative distribution

F (λ|λ0) =F (λ)− F (λ0)

1− F (λ0), with λ ≥ λ0, (166)

37

Page 38: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

i.e., we can write λn ∼ F (λ|λ0).

Distribution of an = Z(λn) if λn ∼ F (λ|λ0). Considering the values λn = `(y|xn) wherexn ∼ g(x|λ0), then λn ∼ F (λ|λ0). Therefore, considering the values a0 = Z(λ0) ≤ 1 andan = Z(λn), with a similar argument used above in Eqs. (163)-(164) we can write

an ∼ U([0, a0]), (167)

an =ana0

∼ U([0, 1]), ∀n = 1, ..., N. (168)

In summary, with a0 = Z(λ0), we have that

if xn ∼ g(x|λ0) and λn = `(y|xn) ∼ F (λ|λ0), then Z(λn) ∼ U([0, a0]). (169)

and the ratio an = ana0∼ U([0, 1]).

Distributions amax. Let us consider λ1, ...., λn ∼ F (λ|λ0) and the minimum and maximumvalues

λmin = minnλn, amax = Z(λmin), and amax =

amaxa0

=Z(λmin)

Z(λ0). (170)

Let us recall an = ana0∼ U([0, 1]). Then, note that amax is maximum of N uniform random variables

a1, ..., aN ∼ U([0, 1]).

Then it is well-known that the cumulative distribution of the maximum value

amax = maxn

an ∼ B(N, 1),

is distributed according to a Beta distribution B(N, 1), i.e., Fmax(a) = aN and density fmax(a) =dFmax(a)

da= NaN−1 [58, Section 2.3.6]. In summary, we have

amax =Z(λmin)

Z(λ0)∼ B(N, 1), where λmin = min

nλn, and λn ∼ F (λ|λ0). (171)

This result is important for deriving the standard version of the nested sampling method, describedin the next section.

5.1.3 Second one-dimensional representation

Now let consider a specific area value a = Z(λ). The inverse function

Ψ(a) = Z−1(a) = sup{λ : Z(λ) > a}, (172)

38

Page 39: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

is also non-increasing. Note that Z(λ) > a if and only if λ < Ψ(a). Then, we can write

Z =

∫ ∞

0

Z(λ)dλ (173)

=

∫ ∞

0

∫ 1

0

I{a < Z(λ)}da (again the extended space “trick”) (174)

=

∫ 1

0

da

∫ ∞

0

I{u < Z(λ)}dλ (swicthing the integration order) (175)

=

∫ 1

0

da

∫ ∞

0

I{λ < Ψ(a)}dλ (using Z(λ) > a ⇐⇒ λ < Ψ(a)) (176)

=

∫ 1

0

Ψ(a)da. (177)

5.1.4 Summary of the one-dimensional representations

Thus, finally we have obtained two one-dimensional integrals for expressing the Bayesian evidenceZ,

Z =

∫ sup `(y|x)

0

Z(λ)dλ =

∫ 1

0

Ψ(a)da. (178)

Now that we have expressed the quantity Z as an integral of a function over R, we could thinkof applying simple quadrature: choose a grid of points in [0, sup `(y|x)] (λi > λi−1) or in [0, 1](ai > ai−1), evaluate Z(λ) or Ψ(a) and use the quadrature formulas

Z =I∑

i=1

(λi − λi−1)Z(λi), or (179)

Z =I∑

i=1

(ai − ai−1)Ψ(ai). (180)

However, this simple approach is not desirable since (i) the functions Z(λ) and Ψ(a) areintractable in most cases and (ii) they change much more rapidly over their domains than doesπ(x) = `(y|x)g(x), hence the quadrature approximation can have very bad performance, unlessthe grid of points is chosen with extreme care. Table 12 summarizes the one-dimensional expressionfor logZ and Z contained in this work. Clearly, in all of them, the integrand function depends,explicitly or implicitly, on the variable x.

5.2 Nested Sampling

Nested sampling is a technique for estimating the marginal likelihood that exploits the secondidentity in (178) [80, 14, 70]. Nested Sampling estimates Z by a quadrature using nodes (indecreasing order),

0 < a(I)max < · · · < a(1)

max < 1

39

Page 40: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 12: One-dimensional expression for logZ and Z. Note that, in all cases, the integrandfunction contains the dependence on x.

Method Expression Equations

power-posteriors logZ =∫ 1

0Eπ(x|β) [log `(y|x)] dβ (95)

vertical representation-1 Z =∫ sup `(y|x)

0Z(λ)dλ (158)-(159)

vertical representation-2 Z =∫ 1

0Ψ(a)da (177)

and the quadrature formula

Z =I−1∑

i=1

(a(i−1)max − a(i)

max)Ψ(a(i)max) =

I−1∑

i=1

(a(i−1)max − a(i)

max)λ(i)min, (181)

with a(0)max = 1. We have to specify the grid points a

(i)max’s (possibly well-located, with a suitable

strategy) and the corresponding values λ(i)min = Ψ(a

(i)max). Recall that the function Ψ(a), and its

inverse a = Ψ−1(λ) = Z(λ), are generally intractable, so that it is not even possible to evaluate

Ψ(a) at a grid of chosen a(i)max’s. The nested sampling algorithm works in the other way around: it

suitably selects the ordinates λ(i)min’s and find some approximations ai’s of the corresponding values

a(i)max = Z(λ

(i)min). This is possible since the distribution of a

(i)max is known.

5.2.1 Choice of λ(i)min and a

(i)max in nested sampling

Nested sampling employs an iterative procedure in order to generate an increasing sequence oflikelihood ordinates λ

(i)min, i = 1, ..., I, such that

λ(1)min < λ

(2)min < λ

(3)min.... < λ

(I)min. (182)

The details of the algorithm is given in Table 13 and it is based on the sampling of the truncatedprior pdf g(x|λ(i−1)

min ) (see Section 5.1.2), where i denotes the iteration index. The nested samplingprocedure is explained below:

• At the first iteration (i = 1), we set λ(0)min = 0 and a

(0)max = Z(λ

(0)min) = 1. Then, N samples

are drawn from the prior xn ∼ g(x|λ(0)min) = g(x) obtaining a cloud P = {xn}Nn=1 and then

set λn = `(y|xn), i.e., {λn}Nn=1 ∼ F (λ) as shown in Section 5.1.2. Thus, the first ordinate ischosen as

λ(1)min = min

nλn = min

n`(y|xn) = min

x∈P`(y|P).

Since {λn}Nn=1 ∼ F (λ), using the result in Eq. (171), we have that

a(1)max =

a(1)max

a(0)max

=Z(λ

(1)min)

Z(λ(0)min)∼ B(N, 1).

40

Page 41: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Since a(0)max = Z(λ

(0)min) = 1, then a

(1)max = a

(1)max ∼ B(N, 1). The corresponding x∗ =

arg minx∈P

`(y|P) is also removed from P , i.e., P = P\{x∗} (now |P| = N − 1).

• At a generic i-th iteration (i ≥ 2), a unique additional sample x′ is drawn from the truncated

prior g(x|λ(i−1)min ) and add to the current cloud of samples, i.e., P = P ∪ x′ (now again

|P| = N). First of all, note that the value λ′ = λn = `(y|x′) is distributed as F (λ|λ(i−1)min )

(see Section 5.1.2). More precisely, note that all the N ordinate values

{λn}Nn=1 = `(y|P) = {λn = `(y|xn) for all xn ∈ P}

are distributed as F (λ|λ(i−1)min ), i.e., {λn}Nn=1 ∼ F (λ|λ(i−1)

min ). This is due to how the populationP has been built in the previous iterations. Then, we choose the new ordinate value as

λ(i)min = min

nλn = min

x∈P`(y|P).

Moreover, since λ(i)min is the minimum value of {λ1, ..., λ} ∼ F (λ|λ(i−1)

min ), in Section 5.1.2 wehave seen that

a(i)max =

a(i)max

a(i−1)max

=Z(λ

(i)min)

Z(λ(i−1)min )

∼ B(N, 1), (183)

where we have used Eq. (171). We remove again the corresponding sample x∗ =arg min

x∈P`(y|P), i.e., we set P = P ∪ x′ and the procedure is repeated. Note that we have

found the recursiona(i)max = a(i)

maxa(i−1)max , (184)

for i = 1, ..., I and a(0)max = 1.

• A possible idea for approximating the random value a(i)max is to replace it with the expected

value of Beta distribution B(N, 1), i.e.,

a(i)max ≈ a1 =

N

N + 1≈ exp

(− 1

N

). (185)

where E[B(N, 1)] = NN+1

, and exp(− 1N

)becomes a very good approximation as N grows.

In that case, the recursion above becomes

a(i)max ≈ exp

(− 1

N

)a(i−1)max = exp

(− i

N

). (186)

Then we can use ai = exp(− iN

)as an approximation of a

(i)max.

The intuition behind the iterative approach above is to accumulate more ordinates λi close to thesup `(y|x). They are also more dense around sup `(y|x). Moreover, using this scheme, we can

provide an approximation ai of a(i)max since we know the distribution of a

(i)max.

41

Page 42: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 13: The standard Nested Sampling procedure.

1. Choose N and set a0 = 1.

2. Draw {xn}Nn=1 ∼ g(x) and define the set P = {xn}Nn=1. Let us also define the notation

`(y|P) = {λn = `(y|xn) for all xn ∈ P}, (187)

3. Set λ(1)min = min

x∈P`(y|P) and x∗ = arg min

x∈P`(y|P).

4. Set P = P\{x∗}, i.e., eliminate x∗ from P .

5. Find an approximation a1 of a(1)max = Z(λ

(1)min). One usual choice is a1 = exp

(− 1N

).

6. For i = 2, .., I :

(a) Draw x′ ∼ g(x|λ(i−1)min ) and add to the current cloud of samples, i.e., P = P ∪ x′.

(b) Set λ(i)min = min

x∈P`(y|P) and x∗ = arg min

x∈P`(y|P).

(c) Set P = P\{x∗}.

(d) Find an approximation ai of a(i)max = Z(λ

(i)min). One usual choice is

ai = exp

(− i

N

), (188)

The rationale behind this choice is explained in the sections above.

7. Return

Z =I∑

i=1

(ai−1 − ai)λ(i)min =

I∑

i=1

(e−i−1N − e−

iN )λ

(i)min. (189)

5.2.2 Further considerations

Perhaps, the most critical task of the nested sampling implementation consists in drawing fromthe truncated priors. For this purpose, one can use a rejection sampling or an MCMC scheme. Inthe first case, we drawn from the prior and then accept only the samples x′ such that `(y|x′) > λ.However, as λ grows, its performance deteriorates since the acceptance probability gets smaller andsmaller. The MCMC algorithms could also have poor performance due to the sample correlation,specially when the support of the constrained prior is formed by disjoint regions or distant modes.Moreover, in the derivation of the standard nested sampling method we have considered differentapproximations. First of all, for each likelihood value λi, its corresponding ai = Ψ−1(λi) isapproximated by taking the expected value of a Beta random variable. Then this expected valueis again approximated with an exponential function in Eq. (185). This step could be avoided,

42

Page 43: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

keeping directly NN+1

. The simplicity of the final formula ai = exp(− iN

)is perhaps the reason of

using the approximation NN+1≈ exp

(− 1N

). A further approximation E[a

(i)max] ≈ E[a

(i)max]E[a

(i−1)max ] is

also applied. Additionally, if an MCMC method is run for sampling from the constrained prior,also the likelihood values λi are in some sense approximated due to the possible burn-in period ofthe chain.

5.3 Generalized Importance Sampling based on verticalrepresentations

Let us recall the two possible IS estimators with proposal density q(x),

Z =1

N

N∑

n=1

ρn`(y|xn), and Z =N∑

n=1

ρn`(y|xn), {xn}Nn=1 ∼ q(x), (190)

where ρn = g(xn)q(xn)

and ρn = ρn∑Nn=1 ρn

. In [70], the authors consider the use of the following proposal

pdf

q(x) = πw(x) =g(x)W (`(y|x))

Zw, (191)

where the function W (λ) : R+ → R+ is defined by the user. Using q(x) = πw(x) leads to theweights of the form

ρn =g(xn)

πw(xn)=

1

W (`(y|xn)), xn ∼ πw(x). (192)

Note that choosing W (λ) = λ we have W (`(y|x)) = `(y|x), and q(x) = π(x), recovering the

harmonic mean estimator. With W (λ) = λβ, we have W (`(y|x)) = `(y|x)β and q(x) = g(x)`(y|x)β

Z(β),

recovering the method in Section 3.3 that uses a power posterior as a proposal pdf. Notethat also nested sampling can be included in Eqs. (190), i.e., Z = 1

N

∑Ii=1 ρn`(y|xn), where

ρi = N(e−n−1N − e− n

N ), that is, a weighted sum of likelihood values λn = `(y|xn).

6 Bayes factors with improper priors

So far we have considered proper priors, i.e.,∫X g(x)dx = 1 < ∞. The use of improper priors

is common in Bayesian inference to represent weak prior information. Consider g(x) ∝ h(x)where h(x) is a function whose integral over the state space does not converge,

∫X g(x)dx =∫

X h(x)dx = ∞. In that case, g(x) is not completely specified. Indeed, we can have differentdefinitions g(x) = ch(x) where c > 0 is the “normalizing” constant, not uniquely determinatesince c formally does not exist. Regarding the parameter inference and posterior definition, the

43

Page 44: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

use of improper priors poses no problems as long as∫X `(y|x)h(x)dx <∞, indeed

π(x) =1

Zπ(x) =

`(y|x)ch(x)∫X `(y|x)ch(x)dx

=`(y|x)h(x)∫

X `(y|x)h(x)dx, (193)

=1

Z`(y|x)h(x) (194)

where Z =∫X `(y|x)g(x)dx, Z =

∫X `(y|x)h(x)dx and Z = cZ. Note that the unspecified

constant c > 0 is canceled out. However, the issue is not solved when we compare differentmodels. For instance, the Bayes factors depend on the undetermined constants c1, c2 > 0 [81],

BF(y) =c1

c2

∫X1`1(y|x1)h1(x1)dx1∫

X2`2(y|x2)h2(x2)dx2

=Z1

Z2

=c1Z1

c2Z2

, (195)

so that different choices of c1, c2 provide different preferable models. There exists variousapproaches for dealing with this issue. Below we describe some relevant ones.

Partial Bayes Factors. The idea behind the partial Bayes factors consists of using a subset ofdata to build proper priors and, jointly with the remaining data, they are used to calculate theBayes factors. The method starts by dividing the data in two subsets, y = (ytrain,ytest). The firstpart ytrain will be used to obtain partial posterior distributions

gm(xm|ytrain) =cm

Z(m)train

`m(ytrain|xm)hm(xm), (196)

using the improper priors. Note that

Z(m)train = cm

Xm`m(ytrain|xm)hm(xm)dxm.

Then, these posteriors can be employed as prior distributions. The complete posterior of m-thmodel is

πm(x|y) = πm(x|ytest,ytrain) =1

Zm`m(y|xm)hm(xm).

Considering the conditional likelihood `m(ytest|xm,ytrain) of the remaining data ytest, so that wecan express the complete posterior as

πm(x|y) =1

Z(m)test|train

`m(ytest|xm,ytrain)gm(xm|ytrain), (197)

where gm(xm|ytrain) plays the role of a prior pdf. Replacing the expression of gm(xm|ytrain), wefinally have

πm(x|y) =cm

Z(m)test|trainZ

(m)train

`m(ytest|xm,ytrain)`m(ytrain|xm)hm(xm) (198)

44

Page 45: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

where Zm = Z(m)test|trainZ

(m)train, hence

Z(m)test|train =

Zm

Z(m)train

=

Xm`m(ytest|xm,ytrain)gm(xm|ytrain)dxm. (199)

Therefore, considering the partial posteriors gm(xm|ytrain) as proper priors, we can define thefollowing partial Bayes factor

BF(ytest|ytrain) =Z

(1)test|train

Z(2)test|train

=

Z1

Z(1)train

Z2

Z(2)train

(200)

=Z1

Z2

Z(1)train

Z(2)train

=BF(y)

BF(ytrain). (“Bayes law for Bayes Factors”). (201)

The last expression does not depend on c1, c2, since

BF(ytest|ytrain) =BF(y)

BF(ytrain)=

c1Z1

c2Z2

c1Z(1)train

c2Z(2)train

=Z1

Z2

Z(2)train

Z(1)train

. (202)

Therefore, one can approximate firstly BF(ytrain), secondly BF(y) and then compare the modelusing the partial Bayes factor BF(ytest|ytrain).

Remark. The trick here consists in computing two normalizing constants for each model, insteadof only one. The first normalizing constant is used for building an auxiliary normalized prior,depending on ytrain.

A training dataset ytrain is proper if∫Xm `m(ytrain|xm)hm(xi)dxm < ∞ for all models, and it

is called minimal if is proper and no subset of ytrain is proper. If we use actually proper priordensities, the minimal training dataset is the empty set and the fractional Bayes factor reducesto the classical Bayes factor. However, the main drawback of the partial Bayes factor approach isthe dependence on the choice of ytrain (which could affect the selection of the model). The authorssuggest to find the minimal suitable training set ytrain, but this task is not straightforward. Twoalternatives in the literature have been proposed, the fractional Bayes factors and the intrinsicBayes factors.

Fractional Bayes Factors [69]. Instead of using a training data, it is possible to use powerposteriors, i.e.,

FBF(y) =BF(y)

BF(y|β), (203)

where the denominator is

BF(y|β) =

∫X1`1(y|x1)βg1(x1)dx1∫

X2`2(y|x2)βg2(x2)dx2

=c1

∫X1`1(y|x1)βh1(x1)dx1

c2

∫X2`2(y|x2)βh2(x2)dx2

. (204)

45

Page 46: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

with 0 < β ≤ 1, and BF(y|1) = BF(y). Note that the value β = 0 is not admissible since∫Xm hm(xm)dxm =∞ for m = 1, 2. Again, since both BF(y) and BF(y|β) depend on the ratio c1

c2,

the fractional Bayes factor FBF(y) is independent on c1 and c2 by definition.

Intrinsic Bayes factors [3]. The partial Bayes factor (200) will depend on the choice of(minimal) training set ytrain. These authors solve the problem of choosing the training sample byaveraging the partial Bayes factor over all possible minimal training sets. They suggest to use thearithmetic mean, leading to the arithmetic intrinsic Bayes factor, or the geometric mean, leadingto the geometric intrinsic Bayes factor.

7 Theoretical and empirical comparisons

In this section, we illustrate the performance of different marginal likelihood estimators in differentexperiments. In Section 7.1, first we compare theoretically the variance of IS and RIS estimatorsin a one-dimensional example, which allows to discuss some important features required by theproposal and auxiliary densities (e.g., the conditions regarding the tails of these pdfs). In a secondpart of Section 7.1, we compare the Mean Square Error (MSE) and bias of IS an RIS estimators vianumerical simulations. In Section 7.2 we test several estimators in a nonlinear regression problemwith real data, where the likelihood function has highly non-elliptical contours.

7.1 Theoretical and empirical comparison of IS and RIS

In this example, our goal is to compare, theoretically and by numerical simulations, the standard ISand RIS schemes for estimating the normalizing constant of a Gaussian target π(x) = exp(−1

2x2).

We know the ground-truth Z =∫∞−∞ π(x)dx =

√2π, so π(x) = π(x)

Z= N (x|0, 1). The standard IS

estimator of Z with importance density q(x) and the RIS estimator with auxiliary density f(x)are

ZIS =1

N

N∑

i=1

π(xi)

q(xi), xi ∼ q(x), ZRIS =

11N

∑Ni=1

f(xi)π(xi)

, xi ∼ π(x). (205)

For a fair theoretical and empirical comparison, we consider

q(x) = f(x) = N (x|0, h2) =1√

2πh2exp

(− 1

2h2x2

). (206)

where h > 0 is the standard deviation. We desire to study the performance of the two estimatorsas h varies.

46

Page 47: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

7.1.1 Theoretical analysis

We compute the variances of the estimators ZIS and ZRIS as functions of h, starting from ZIS.Note that by the i.i.d. assumption, we can write

var[ZIS] =1

Nvarq

[π(x)

q(x)

]=

1

N

{Eq[π(x)

q(x)

]2

− Z2

}, (207)

Substituting π(x) = exp(−12x2) and q(x) = 1√

2πh2exp(− 1

2h2x2) we obtain

Eq[π(x)

q(x)

]2

=

∫ ∞

−∞

(π(x)

q(x)

)2

q(x)dx (208)

=

∫ ∞

−∞

π(x)2

q(x)dx (209)

=√

2πh2

∫ ∞

−∞exp

{−(

1− 1

2h2

)x2

}dx (210)

= 2πh√

2− 1h2

. (211)

Replacing the last expression above in Eq. (207), we obtain that the variance of ZIS is given by

varq

[ZIS

]=

N

h√2− 1

h2

− 1

. (212)

This variance reaches its minimum, var[ZIS] = 0, at h = 1, i.e., when the proposal is exactly the

posterior q(x) = N (x|0, 1) = π(x), as expected. For h < 1, var[ZIS] grows exponentially until

reaching h = 1√2

where is infinite. For 0 < h < 1√2, var[ZIS] is not defined. Finally, var[ZIS] grows

linearly from h = 1 onwards, i.e., diverges to infinity as h→∞. Figure 3-(a) shows that behavior,with N = 500. Note that changing the value of N simply scales the whole curve. Clearly, thisis perfectly in line the well-known theoretical requirement that the proposal pdf must have fattertails than the posterior density in a IS scheme. Moreover, this confirms that the use of proposalswith variance bigger than that of the target is generally not catastrophic. The opposite couldyield catastrophic results. Recall also that Eq[ZIS] = Z, i.e., the bias of ZIS is zero.

Regarding RIS, it is easier to compute the variance of r = 1

ZRIS, rather than ZRIS itself. Namely,

we consider the estimator r = 1N

∑Ni=1

f(xi)π(xi)

, with xi ∼ π(x). which is an unbiased estimator of 1Z

.

Since xi’s are i.i.d. from π(x), then we have

varπ [r] = varπ

[1

ZRIS

]=

1

Nvarπ

[f(x)

π(x)

](213)

=1

N

{Eπ[f(x)

π(x)

]2

− 1

Z2

}, (214)

47

Page 48: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Substituting π(x) = exp(−1

2x2)

and f(x) = 1√2πh2

exp(− 1

2h2x2)

we obtain

Eπ[f(x)

π(x)

]2

=

∫ ∞

−∞

(f(x)

π(x)

)2π(x)

Zdx (215)

=1

Z

∫ ∞

−∞

f(x)2

π(x)dx (216)

=1

2πh2√

∫ ∞

−∞exp

{−(

1

h2− 1

2

)x2

}dx (217)

=1

1

h2

√2h2− 1

. (218)

Hence the variance of r is given by

varπ[r] = varπ

[1

ZRIS

]=

1

2πN

1

h2

√2h2− 1− 1

, (219)

which reaches its minimum, varπ[r] = 0, at h = 1 as expected. Note that var[r] is definedwhen 0 < h <

√2 (there are two vertical asymptotes). Moreover, varπ[r] grows more quickly in

1 < h <√

2 than in 0 < h < 1. In Figure 3-(b), we show varπ[1/ZRIS] for N = 500. Again, theeffect of the number of samples N is simply a scaling factor of the curve. Indeed, r = 1

ZRIShas

the same behavior as the IS estimator when the variance of the denominator (in this case π(x))is smaller than the numerator (in this case f(x)). Then, we see that f(x) should have non-zerovariance and less variance than π(x), in order to avoid infinite variance of the resulting estimator

r = 1

ZRIS. We can observe two vertical asymptotes when we analyze varπ[1/ZRIS]. However, note

that varπ[ZRIS] has just one vertical asymptote at h = 0 as shown below.

0 0.5 1 1.5 2h

0

0.005

0.01

0.015

0.02

0.025

(a) Variance of ZIS.

0 0.2 0.4 0.6 0.8 1 1.2 1.4h

0

0.005

0.01

0.015

(b) Variance of 1

ZRIS.

Figure 3: The variances varπ[ZIS] and varπ[1/ZRIS] in Eqs. (212) and (219), respectively (N = 500).

48

Page 49: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

7.1.2 Numerical analysis

Setting N = 500, we compute numerically the mean square error (MSE) of both ZIS and ZRIS,

and the variance and bias of both ZRIS, averaging the results over 5000 independent runs. Weshow the results in Fig. 7.1.2. In Figure 7.1.2(a), we show bias and variance of the estimator

ZRIS. Note that its variance has only one asymptote at 0 instead of two asymptotes, unlike thevariance of r = 1/ZRIS. Also the bias diverges in 0. Observe also that the bias is negligible for0.1 < h < 1.6, with respect to the value of the variance. In Fig. 7.1.2-(b), we can see that the

MSE of ZIS corresponds to its theoretical variance shown in Fig. 3, as we expect since ZIS haszero bias, hence MSE(ZIS) = var(ZIS). Although ZRIS is not unbiased, we see that its MSE, alsoshown in Fig. 7.1.2-(b), is virtually identical to its variance shown in Fig. 7.1.2-(a), where thebias seems to be negligible for the majority of values of h.

0 0.5 1 1.5 2h

0

0.02

0.04

0.06

0.08

0.1bias-RISvar-RIS

(a) Bias and variance of ZRIS.

0 0.5 1 1.5 2h

0

0.02

0.04

0.06

0.08

0.1MSE-ISMSE-RIS

(b) MSE of ZIS and ZRIS.

Figure 4: (a) Bias (dashed line) and variance (solid line) of ZRIS as a function of h (N = 500).

(b) MSE of ZIS (solid line) and ZRIS (dashed line) as a function of h (N = 500).

7.2 Experiment with biochemical oxygen demand data

We consider a numerical experiment studied also in [19], that is a nonlinear regression problemmodeling data on the biochemical oxygen demand (BOD) in terms of time instants. The outcomevariable Yi = BOD (mg/L) is modeled in terms of ti = time (days) as

Yi = x1(1− e−x2ti) + εi, i = 1, . . . , 6, (220)

where the εi’s are independent N (0, σ2) errors, hence Yi ∼ N (x1(1 − e−x2ti), σ2). The datay ≡ {yi}6

i=1, measured at locations {ti}6i=1, are shown in Table 14 below.

The goal is to compute the normalizing constant of the posterior of x = (x1, x2) given the data y.Following [19], we consider uniform priors for x1 ∼ U([0, 60]), and x2 ∼ U([0, 6]), i.e., g1(x1) = 1

60

for x1 ∈ [0, 60], and g2(x2) = 16, with x2 ∈ [0, 6]. Moreover, we consider an improper prior for

49

Page 50: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 14: Data of the numerical experiment in Section 7.2.

ti (days) yi (mg/L)1 8.32 10.33 19.04 16.05 15.67 19.8

σ, g3(σ) ∝ 1σ. However, we will integrate out the variable σ. Indeed, the two-dimensional target

π(x) = π(x1, x2) results after integrating out σ by marginalizing

π(x1, x2, σ) = `(y|x1, x2, σ)g1(x1)g2(x2)g3(σ),

w.r.t. σ, namely we obtain

π(x) =

∫π(x1, x2, σ)dσ = ` (y|x1, x2) g1(x1)g2(x2) (221)

=1

60

1

6

1

π3

8{∑6

i=1[yi − x1(1− exp(−x2ti))]2}3 , (x1, x2) ∈ [0, 60]× [0, 6], (222)

for which we want to compute its normalizing constant Z =∫π(x)dx. The derivation is given in

the Supplementary Material. The true value (ground-truth) is logZ = −16.208, considering thedata in Table 14. As in [19], we compare the relative mean absolute error

E[|Z − Z|

]

Z,

obtained by different methods:

• the naive Monte Carlo estimator,

• a modified version of the Laplace method (more sophisticated) given in [19],

• a “crude” Laplace scheme (using sample mean and sample covariance considering MCMCsamples from π),

• the HM estimator of Eq. 46,

• the RIS estimator where f(x) = N (x|µ,Σ), where µ and Σ are the mean and covarianceof the MCMC samples from π (it is denoted as RIS in Table 15),

• another RIS scheme where f(x) is obtained by a KDE with K = 4 clusters and h = 0 (in asimilar fashion of Eq. (152)),

50

Page 51: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

• an IS estimator with a Gaussian proposal pdf q(x) = N (x|µ,Σ), where again µ and Σ arethe mean and covariance of the MCMC samples from the posterior,

• and, finally, a CLAIS scheme with K = 2, h = 0 i.e., as in Eq. (152).

Table 15: Relative error, and its corresponding standard error, in estimating the marginallikelihood by seven methods

Naive Laplace (soph) Laplace HM RIS RIS-kde IS CLAISRE 0.057 0.181 0.553 0.823 0.265 0.140 0.084 0.082

std err 0.001 0.013 0.003 0.018 0.006 0.004 0.015 0.014

comments — see [19] — — — K = 4 — K = 2

To obtain the samples from the posterior, we run T = 10000 iterations of a Metropolis-Hastingsalgorithm, using the prior as an independent proposal pdf. Since CLAIS draws additional samplesfrom q(x) in the lower layer, in order to provide a fair comparison, in CLAIS we consider N = 1,T ′ = T/2 = 5000 and 5000 additional samples in the lower layer. We averaged the relative errorover 1000 independent runs. Our results are shown in Table 15.

In this examples, and with these priors, the results show that the best performing estimatorin this case is the Naive Monte Carlo, since prior and likelihood has an ample overlapping regionof probability mass. However, the naive Monte Carlo scheme is generally inefficient when thereis a small overlap between likelihood and prior. Note also that IS and CLAIS provide goodperformance. The worst performance is provided by the HM estimator.

8 Final discussion

In this work, we have provided an exhaustive review of the techniques for marginal likelihoodcomputation with the purpose of model selection and hypothesis testing. Methods forapproximating ratios of normalizing constants have been also described. A careful use of theimproper priors in the Bayesian setting has been discussed. Most of the presented techniques arebased on the importance sampling (IS) approach, but also require the use of MCMC algorithms.Table 16 summarizes some methods for estimating Z, that can be employed if N samples fromthe posterior are available. This table is devoted to the interested readers which desire to obtainsamples {xn}Nn=1 by an MCMC method with invariant pdf π(x) (without either any temperingor sequence of densities) and, at the same time, also desire to approximate Z using {xn}Nn=1.Clearly, this table provides only a subset of all the possible techniques. They can be consideredthe simplest schemes, in the sense that they do not use any tempering strategy or sequence ofdensities. We also recall that AIC and DIC are commonly used for model comparison, althoughthey do not directly target the actual marginal likelihood.

51

Page 52: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

Table 16: Schemes for estimating Z, after that N samples have been generated by an MCMCalgorithm with invariant distribution π(x).

Method SectionNeed of

Commentsdrawing additional samples

Laplace 2 —– use MCMC for estimating xMAP

BIC 2 —– use MCMC for estimating xMLE

KDE 2 —– use MCMC for generating samples

Chib’s method 2additional samples are required

if the proposal is not independentRIS 3 —– the HM estimator is a special case

MTM 4.2 —– provides two estimators of Z[78] 4.3 —– related to LAIS

LAIS 4.3 with π in the upper-layerBelow: for model selection but do not approximate the marginal likelihood

AIC 2 —– use MCMC for estimating xMLE

DIC 2 —– use MCMC for estimating cp and x

References

[1] D. Ardia, N. Basturk, L. Hoogerheide, and H. K. Van Dijk. A comparative study of MonteCarlo methods for efficient evaluation of marginal likelihood. Computational Statistics &Data Analysis, 56(11):3398–3414, 2012.

[2] V. Balasubramanian. Statistical inference, Occam’s razor, and statistical mechanics on thespace of probability distributions. Neural computation, 9(2):349–368, 1997.

[3] J. O. Berger and L. R. Pericchi. The intrinsic Bayes factor for model selection and prediction.Journal of the American Statistical Association, 91(433):109–122, 1996.

[4] C. S. Bos. A comparison of marginal likelihood computation methods. In Compstat, pages111–116. Springer, 2002.

[5] M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, and P. M. Djuric. Adaptiveimportance sampling: The past, the present, and the future. IEEE Signal ProcessingMagazine, 34(4):60–79, 2017.

[6] M. F. Bugallo, L. Martino, and J. Corander. Adaptive importance sampling in signalprocessing. Digital Signal Processing, 47:36–49, 2015.

[7] O. Cappe, A. Guillin, J. M. Marin, and C. P. Robert. Population Monte Carlo. Journal ofComputational and Graphical Statistics, 13(4):907–929, 2004.

52

Page 53: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[8] B. P. Carlin and S. Chib. Bayesian model choice via Markov chain Monte Carlo methods.Journal of the Royal Statistical Society: Series B (Methodological), 57(3):473–484, 1995.

[9] M.-H. Chen, Q.-M. Shao, et al. On Monte Carlo methods for estimating ratios of normalizingconstants. The Annals of Statistics, 25(4):1563–1594, 1997.

[10] M. H. Chen, Q. M. Shao, and J. G. Ibrahim. Monte Carlo methods in Bayesian computation.Springer, 2012.

[11] S. Chib. Marginal likelihood from the Gibbs output. Journal of the american statisticalassociation, 90(432):1313–1321, 1995.

[12] S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis–Hastings output. Journalof the American Statistical Association, 96(453):270–281, 2001.

[13] N. Chopin. A sequential particle filter for static models. Biometrika, 89:539–552, 2002.

[14] N. Chopin and C. P. Robert. Properties of nested sampling. Biometrika, 97(3):741–755, 2010.

[15] G. Claeskens and N. L. Hjort. The focused information criterion. Journal of the AmericanStatistical Association, 98(464):900–916, 2003.

[16] P. Congdon. Bayesian model choice based on Monte Carlo estimates of posterior modelprobabilities. Computational statistics & data analysis, 50(2):346–357, 2006.

[17] J. M. Cornuet, J. M. Marin, A. Mira, and C. P. Robert. Adaptive multiple importancesampling. Scandinavian Journal of Statistics, 39(4):798–812, December 2012.

[18] P. Dellaportas, J. J. Forster, and I. Ntzoufras. On Bayesian model and variable selectionusing MCMC. Statistics and Computing, 12(1):27–36, 2002.

[19] T. J. DiCiccio, R. E. Kass, A. Raftery, and L. Wasserman. Computing Bayes factors bycombining simulation and asymptotic approximations. Journal of the American StatisticalAssociation, 92(439):903–915, 1997.

[20] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Mıguez.Particle filtering. IEEE Signal Processing Magazine, 20(5):19–38, September 2003.

[21] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: fifteen yearslater. technical report, 2008.

[22] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Efficient multiple importance samplingestimators. IEEE Signal Processing Letters, 22(10):1757–1761, 2015.

[23] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Heretical multiple importance sampling.IEEE Signal Processing Letters, 23(10):1474–1478, 2016.

[24] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Generalized Multiple ImportanceSampling. Statistical Science, 34(1):129–155, 2019.

53

Page 54: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[25] N. Friel and A. N. Pettitt. Marginal likelihood estimation via power posteriors. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 70(3):589–607, 2008.

[26] N. Friel and J. Wyse. Estimating the evidence-a review. Statistica Neerlandica, 66(3):288–308,2012.

[27] A. E. Gelfand and D. K. Dey. Bayesian model choice: asymptotics and exact calculations.Journal of the Royal Statistical Society: Series B (Methodological), 56(3):501–514, 1994.

[28] A. Gelman and X. L. Meng. Simulating normalizing constants: From importance samplingto bridge sampling to path sampling. Statistical science, pages 163–185, 1998.

[29] W. R. Gilks and C. Berzuini. Following a moving target-Monte Carlo inference for dynamicBayesian models. Journal of the Royal Statistical Society. Series B (Statistical Methodology),63(1):127–146, 2001.

[30] W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive Rejection Metropolis Sampling withinGibbs Sampling. Applied Statistics, 44(4):455–472, 1995.

[31] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice.Chapman and Hall/CRC, 1995.

[32] W. R. Gilks and P. Wild. Adaptive Rejection Sampling for Gibbs Sampling. Applied Statistics,41(2):337–348, 1992.

[33] S. J. Godsill. On the relationship between Markov chain Monte Carlo methods for modeluncertainty. Journal of computational and graphical statistics, 10(2):230–248, 2001.

[34] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian modeldetermination. Biometrika, 82(4):711–732, 1995.

[35] Q.F. Gronau, H. Singmann, and E.J. Wagenmakers. Bridgesampling: Bridge sampling formarginal likelihoods and Bayes factors. R package version 0.4-0, URL https://CRAN. R-project. org/package= bridgesampling, 2017.

[36] P. Grunwald and T. Roos. Minimum Description Length Revisited. arXiv:1908.08484, pages1–38, 2019.

[37] D. I. Hastie and P. J. Green. Model choice using reversible jump Markov chain Monte Carlo.Statistica Neerlandica, 66(3):309–338, 2012.

[38] J. A. Hoeting, D. Madigan, A. E. Raftery, and Chris T. Volinsky. Bayesian model averaging:a tutorial. Statistical Science, 14(4):382–417, 1999.

[39] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the american statistical association,90(430):773–795, 1995.

54

Page 55: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[40] K. H. Knuth, M. Habeck, N. K. Malakar, A. M. Mubeen, and B. Placek. Bayesian evidenceand model selection. Digital Signal Processing, 47:50–67, 2015.

[41] A. Kong. A note on importance sampling using standardized weights. Technical Report 348,Department of Statistics, University of Chicago, 1992.

[42] C. H. LaMont and P. A. Wiggins. Correspondence between thermodynamics and inference.Physical Review E, 99(5):052140, 2019.

[43] S. M. Lewis and A. E. Raftery. Estimating Bayes factors via posterior simulation with theLaplaceMetropolis estimator. Journal of the American Statistical Association, 92(438):648–655, 1997.

[44] F. Liang, C. Liu, and R. Caroll. Advanced Markov Chain Monte Carlo Methods: Learningfrom Past Samples. Wiley Series in Computational Statistics, England, 2010.

[45] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2004.

[46] P.i Liu, A. S. Elshall, M. Ye, P. Beerli, X. Zeng, D. Lu, and Y. Tao. Evaluatingmarginal likelihood with thermodynamic integration method and comparison with severalother numerical methods. Water Resources Research, 52(2):734–758, 2016.

[47] J. M. Marin and C. P. Robert. Importance sampling methods for Bayesian discriminationbetween embedded models. arXiv preprint arXiv:0910.2325, 2009.

[48] A. D. Martin, K. M. Quinn, and J. H. Park. MCMCpack: Markov chain Monte Carlo in R.Foundation for Open Access Statistics, 2011.

[49] L. Martino. A review of multiple try MCMC algorithms for signal processing. Digital SignalProcessing, 75:134 – 152, 2018.

[50] L. Martino, R. Casarin, F. Leisen, and D. Luengo. Adaptive independent sticky MCMCalgorithms. EURASIP Journal on Advances in Signal Processing (to paper), 2017.

[51] L. Martino and V. Elvira. Metropolis sampling. Wiley StatsRef: Statistics Reference Online,pages 1–18, 2017.

[52] L. Martino and V. Elvira. Compressed Monte Carlo for distributed Bayesian inference.viXra:1811.0505, 2018.

[53] L. Martino, V. Elvira, and G. Camps-Valls. Group importance sampling for particle filteringand MCMC. Digital Signal Processing, 82:133 – 151, 2018.

[54] L. Martino, V. Elvira, and F. Louzada. Weighting a resampled particle in Sequential MonteCarlo. IEEE Statistical Signal Processing Workshop, (SSP), 122:1–5, 2016.

[55] L. Martino, V. Elvira, and M. F. Louzada. Effective Sample Size for importance samplingbased on the discrepancy measures. Signal Processing, 131:386–401, 2017.

55

Page 56: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[56] L. Martino, V. Elvira, and D. Luengo. Anti-tempered layered adaptive importance sampling.International Conference on Digital Signal Processing (DSP), 2017.

[57] L. Martino, V. Elvira, D. Luengo, and J. Corander. Layered adaptive importance sampling.Statistics and Computing, 27(3):599–623, 2017.

[58] L. Martino, D. Luengo, and J. Mıguez. Independent random sampling methods. Springer,2018.

[59] L. Martino, V. P. Del Olmo, and J. Read. A multi-point Metropolis scheme with genericweight functions. Statistics & Probability Letters, 82(7):1445–1453, 2012.

[60] L. Martino and J. Read. On the flexibility of the design of multiple try Metropolis schemes.Computational Statistics, 28(6):2797–2823, 2013.

[61] L. Martino, J. Read, V. Elvira, and F. Louzada. Cooperative parallel particle filters for on-linemodel selection and applications to urban mobility. Digital Signal Processing, 60:172–185,2017.

[62] L. Martino, J. Read, and D. Luengo. Independent doubly adaptive rejection Metropolissampling within Gibbs sampling. IEEE Transactions on Signal Processing, 63(12):3123–3138,June 2015.

[63] X.-L. Meng and W. H. Wong. Simulating ratios of normalizing constants via a simple identity:a theoretical exploration. Statistica Sinica, pages 831–860, 1996.

[64] A. Mira and G. Nicholls. Bridge estimation of the probability density at a point. Technicalreport, Department of Mathematics, The University of Auckland, New Zealand, 2003.

[65] A. Mira and G. Nicholls. Bridge estimation of the probability density at a point. Technicalreport, Department of Mathematics, The University of Auckland, New Zealand, 2003.

[66] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.

[67] C. A. Naesseth, F. Lindsten, and T. B. Schon. Nested Sequential Monte Carlo methods.Proceedings of theInternational Conference on Machine Learning, 37:1–10, 2015.

[68] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.

[69] A. O’Hagan. Fractional Bayes factors for model comparison. Journal of the Royal StatisticalSociety: Series B (Methodological), 57(1):99–118, 1995.

[70] N. G. Polson and J. G. Scott. Vertical-likelihood Monte Carlo. arXiv preprintarXiv:1409.3601, 2014.

[71] C. M. Pooley and G. Marion. Bayesian model evidence as a practical alternative to devianceinformation criterion. Royal Society Open Science, 5(3):1–16, 2018.

56

Page 57: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[72] J. R Oaks, K. A. Cobb, V. N Minin, and A. D. Leache. Marginal likelihoods in phylogenetics:a review of methods and applications. Systematic biology, 68(5):681–697, 2019.

[73] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. Advances in neural informationprocessing systems, pages 505–512, 2003.

[74] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.

[75] C. P. Robert and D. Wraith. Computational methods for Bayesian model choice. AIPconference proceedings, 1193(1):251–262, 2009.

[76] Y. Sakamoto, M. Ishiguro, and G. Kitagawa. Akaike information criterion statistics.Dordrecht, The Netherlands: D. Reidel, 81, 1986.

[77] A.i Schoniger, T. Wohling, L. Samaniego, and W. Nowak. Model selection on solid ground:Rigorous comparison of nine ways to evaluate Bayesian model evidence. Water resourcesresearch, 50(12):9484–9513, 2014.

[78] I. Schuster and I. Klebanov. Markov Chain Importance Sampling-a highly efficient estimatorfor MCMC. arXiv preprint arXiv:1805.07179, 2018.

[79] G. Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464,1978.

[80] J. Skilling. Nested sampling for general Bayesian computation. Bayesian analysis, 1(4):833–859, 2006.

[81] D. J. Spiegelhalter and A. F.M. Smith. Bayes factors for linear and log-linear models withvague prior information. Journal of the Royal Statistical Society: Series B (Methodological),44(3):377–387, 1982.

[82] D.J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. Van der Linde. Bayesian measures ofmodel complexity and fit. J. R. Stat. Soc. B, 64:583–616, 2002.

[83] D.J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. Van der Linde. The deviance informationcriterion: 12 years on. J. R. Stat. Soc. B, 76:485–493, 2014.

[84] Statisticat and LLC. LaplacesDemon: Complete Environment for Bayesian Inference, 2018.R package version 16.1.1.

[85] G.l Stoltz and M. Rousset. Free energy computations: A mathematical perspective. WorldScientific, 2010.

[86] G. M. Torrie and J. P. Valleau. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199,1977.

57

Page 58: Marginal likelihood computation for model selection and ...vixra.org/pdf/2001.0052v1.pdf · Keywords: Marginal likelihood, Bayesian evidence, numerical integration, model selection,

[87] I. Urteaga, M. F. Bugallo, and P. M. Djuric. Sequential Monte Carlo methods under modeluncertainty. In 2016 IEEE Statistical Signal Processing Workshop (SSP), pages 1–5, 2016.

[88] V. Vyshemirsky and M. A. Girolami. Bayesian ranking of biochemical system models.Bioinformatics, 24(6):833–839, 2007.

[89] M. D. Weinberg et al. Computing the Bayes factor from a Markov chain Monte Carlosimulation of the posterior distribution. Bayesian Analysis, 7(3):737–770, 2012.

[90] Z. Zhao and T. A. Severini. Integrated likelihood computation methods. ComputationalStatistics, 32(1):281–313, 2017.

58