chapter 9: monte carlo, resampling, and bayesian analysisastm21 chapter 9: monte carlo, resampling,...

ASTM21 Chapter 9: Monte Carlo, resampling, and Bayesian analysis p.

Chapter 9: Monte Carlo, resampling, and Bayesian analysis

• Monte Carlo simulation

• The bootstrap method

• Bayesian data analysis

• Markov Chain Monte Carlo (MCMC) sampling

1


Exploring posterior distributions by computer power

Actual data analysis problems are usually so complex that it is difficult, and often impossible, to derive the distribution of a statistic directly from the statistics of the data (even if they were known).Examples where we need to know the distribution of a statistic: In hypothesis testing, we need the distribution of the test statistic t under H0 to compute

the p value for a given observed statistic. When fitting a model to data, the estimated parameters are statistics, and we need to know

their joint distribution to assess the uncertainties and possible dependencies among the parameters.

The classical way to handle this is to assume distributions that are so simple that they can be treated analytically, or semi-analytically. Examples: t-test to decide if two normal samples have the same mean value; linear error propagation of Gaussian erros.

The availability of fast computers and random number generators has changed this. In the general technique known as Monte Carlo simulation, distributions are studied empirically by means of (many) random experiments performed in the computer. This development has favoured the use of Bayesian techniques, because the simulations can readily be adapted to sample (and hence estimate) the posterior distributions.

2


Monte Carlo simulation of synthetic data sets (fictitious)

3

Press et al. (1992)Numerical Recipes (2nd ed.), page 685


Monte Carlo simulation of synthetic data sets (actual)

4

Press et al. (1992)Numerical Recipes (2nd ed.), page 686


Resampling (bootstrap method)

The “bootstrap method” may be useful if the observed dataset D consists of a reasonable number of independent and interchangeable observations xi , i = 1 ... N:

D(obs) = { x1, x2, ..., xN }

Synthetic datasets are generated by drawing N observations randomly with replacement from D(obs):

D(synt) = { xr1, xr2, ..., xrN }

where r1, r2, ..., rN are independent random integers from 1 to N, e.g., in MATLAB: r = randi(N,N,1) [returns an N×1 array of random integer values from 1 to N].

Then treat the synthetic datasets as in the Monte Carlo method.

Recommended number of synthetic datasets: n > N·(ln N)2

5


Bayesian data analysis

6

“[Laplace] stated that: ‘... it is a bet of 11,000 to 1 that the error of this result is not 1/100th of its value.’ He would have won the bet, as another 150 years’ accumulation of data has changed the estimate by only 0.63%!”

(D.S. Sivia & J. Skilling, Data Analysis, A Bayesian Tutorial, Oxford University Press 2006)


The Bayesian approach

In the Bayesian approach

it is possible to assign probability density functions to the parameters of a model, and

it is possible to assign probabilities to different hypotheses.

This contrasts to the “traditional” (frequentist) approach where

model parameters are not random variables, so it is meaningless to assign probabilities

a hypothesis is either true or false.

Actually, the Bayesian point of view is much older than the frequentist, as illustrated by the quotation from Laplace (1812). To him it made sense to make a statement like

which is nonsense to a frequentist (the error either is or is not > 1%).

Clearly this was a statement about his degree of belief, not about the objective world.

7


x

y

h(x, y)

g(y)

f (x)

f (x|y)

8

Marginal and conditional probability (reminder from Ch. 2)


Bayes’ rule (reminder from Ch. 2)

The joint probability density of x and y can be written

from which we obtain Bayes’ rule

Let x = data (x), y = model parameters (!), and using that f(x⎪!) = L(!⎪x), we find

The “evidence” is a normalization factor, which is independent of the parameters, but not of the model (different models give different evidence for same data).

9

SRVWHULRU =OLNHOLKRRG � SULRU

HYLGHQFH


Bayesian estimation

The posterior probability density excapsulates the available information (prior + data), given the model.

It can be used to compute point estimates (i.e., a single “best” estimate of !), uncertainties, correlations, confidence regions, etc., based on the given model.

For comparing different models (hypothesis testing), the evidence is needed.

Examples of point estimates:

The posterior mean (minimum mean square error, MMSE):

The posterior mode (maximum a posteriori probability, MAP):

10

posterior probability density (with data)

likelihood function

prior probability density (without data)∝ ×


Bayesian estimation: Role of prior density

The choice of the prior density influences the Bayesian estimate:

If the data contain more information (narrower likelihood), the prior becomes less important:

11

prior likelihood

posterior


Informative and uninformative priors

An informative prior g(θ) expresses some specific information about θ, i.e., you would be reluctant to accept a value of θ with a low prior probability density, or not at all if g = 0. The informative prior is always to some extent subjective.

If you measure similar quantities (e.g. stellar proper motions) many times, you acquire a feeling for what range of values to expect. If the next measurement deviates very much from your prior expectations, you will be reluctant to accept it, and may want to check the data an extra time. It may be possible to quantify this prior expectation, e.g. using past statistics.

An uninformative prior g(θ) attempts to provide only the minimum or most general information about the variable, based on some “objective” principle. For example, if θ has only a finite number (n) of discrete outcomes, one can apply the principle of indifference and assign g = 1/n to each of them. If θ is continuous, or discrete with infinite n, there is no generally accepted principle although a number of them have been proposed, depending on the circumstances.• If θ is continuous and only makes physically sense in a certain interval [θmin, θmax], it may be

reasonable to use a flat prior in that interval: g(θ) = 1/(θmax − θmin); g(θ) = 0 elsewhere. The (improper) prior g(θ) = 1 (everywhere) is often used for a location parameter.

• If θ is positive unbounded, 0 < θ < ∞, it may be possible to use the (improper) prior: g(θ) = 1/θ (sometimes called Jeffreys’ prior). This is often used for a scale parameter. [Why?]

12


Summary of useful formulae

This page summarizes some formulae for computing the Maximum Likelihood (ML), Maximum Aposteriori Probability (MAP), and Minimum Mean Square Error (MMSE) estimates, given the likelihood function L(!|{data}) and the prior density g(!).

Note that there is no guarantee that all the integrals exist.

Integrals are often calculated using Monte Carlo methods; see below (MCMC).

13


A trivial (?) example of Bayesian estimation

Suppose we want to measure the intensity λ of a source by counting the number of photons, n, detected in a certain time interval. We assume that n ~ Pois(λ).

Given n, what is the estimate of λ? The answer depends on the choice of method and, for Bayesian estimation, on the choice of prior.

Exercise:

Compute the ML, MAP and MMSE estimates of λ. For the Bayesian estimates, assume an (improper) prior of the form λα, where α is some constant (= 0 and −1 for the two improper priors discussed above). Discuss the result in the case when n = 0.

Hints:

1. The likelihood function is L(λ | n) = λn exp(−λ) /n! for λ ≥ 0

2. , where Γ is the gamma function*

_________________* Γ(x + 1) = x! when x is a non-negative integer, and generally Γ(x + 1) = xΓ(x)

14


Markov Chains

A Markov chain (MC) is a random process (e.g., time series) where the conditional probability of the next value depends only on the current value: is a MC if

The MC is stationary if the distribution function Fn is the same for all n.Example: The graph shows a realization of the stationary Markov chainwith C = 0.99, σ = 1, X1 = 50

150 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

−30

−20

−10

0

10

20

30

40

50

burn-inphase

−25

−20

−15

−10

−50

510

1520

25050100

150

200

250

300

350

400

450

500

equilibrium distribution = N(0, σ2/(1−C2))

Note: samples are not independent!


Markov Chain Monte Carlo (MCMC)

16

MCMC is a method to sample an arbitrary N-dimensional probability density function P(x) by generating a Markov chain {Xn} whose equilibrium distribution is P.A useful feature of the MCMC is that P(x) need not be normalized (unit volume); only relative densities are needed.Such samples are particularly useful for computing integrals over the N-dimensional space. For the arbitrary function f (x) we have

The average should exclude the burn-in phase (the first 103 or so samples), when the MC has not yet reached equilibrium.Note that the integrals are taken over the N-dimensional space, which makes them difficult to compute numerically (especially when N is more than a few) without resorting to Monte Carlo techniques such as MCMC (cf. NR2, Ch. 7.6 or NR3, Ch. 7.7).A common application is in Bayesian estimation, to calculate the integrals of the posterior density needed for the MMSE estimate and its covariance (p. 13) or for confidence regions.


MCMC: the Metropolis-Hastings algorithm

The most common algorithm for MCMC is due to N.C. Metropolis and W.K. Hastings. For details, see NR3 or Wikipedia under “Metropolis-Hastings algorithm”. The following is a simplified recipe for gaussian proposal density.

1. Take an arbitrary starting point X1 (for n = 1)2. Given Xn, generate a “proposal state” Y = Xn + g, where g ~ N(0, V ) [see below about V ]3. Calculate the ratio a = P(Y ) / P(Xn)4. If a ≥ 1

set Xn+1 = Yelse

generate r ~ U(0, 1)if r ≤ a

set Xn+1 = Yelse

set Xn+1 = Xn 5. Increment n and go to 2.

V, the variance of g, needs to be tuned so that the steps in 2 are not too small (so that too many steps are needed to explore P), nor too large (giving too small acceptance ratios a). As a rule of thumb, the average acceptance ratio ⟨min(a,1)⟩ should be in the range 0.1 to 0.4.

17


MCMC - an illustration

18

0 0.5 1 1.5 2 2.5 3 3.5 40

1

2

3

4

5

6

7

0 0.5 1 1.5 2 2.5 3 3.5 40

1

2

3

4

5

6

7

The first 300 steps with non-optimizedstep size (mean a = 0.85)

The first 30,000 steps with optimizedstep size (mean a = 0.34)

chapter 9: monte carlo, resampling, and bayesian analysisastm21 chapter 9: monte carlo, resampling,...

Documents