developments in bayesian priors roger barlow manchester iop meeting november 16 th 2005

Developments in Bayesian Priors

Roger BarlowManchester IoP meeting

November 16th 2005

Manchester IoP Half Day Meeting

Roger Barlow: Developments in Bayesian Priors

Slide 2

Plan

• Probability– Frequentist– Bayesian

• Bayes Theorem– Priors

• Prior pitfalls (1): Le Diberder• Prior pitfalls (2): Heinrich• Jeffreys’ Prior

– Fisher Information

• Reference Priors: Demortier



Slide 3

Probability

Probability as limit of frequencyP(A)= Limit NA/Ntotal

Usual definition taught to studentsMakes senseWorks well most of the time-

But not all



Slide 4

Frequentist probability

“It will probably rain tomorrow.”“ Mt=174.3±5.1 GeV means the top quark

mass lies between 169.2 and 179.4, with 68% probability.”

“The statement ‘It will rain tomorrow.’ is probably true.”

“Mt=174.3±5.1 GeV means: the top quark mass lies between 169.2 and 179.4, at 68% confidence.”



Slide 5

Bayesian Probability

P(A) expresses my belief that A is true

Limits 0(impossible) and 1 (certain)

Calibrated off clear-cut instances (coins, dice, urns)



Slide 6

Frequentist versus Bayesian?

Two sorts of probability – totally different. (Bayesian probability also known as Inverse Probability.)

Rivals? Religious differences? Particle Physicists tend to be frequentists.

Cosmologists tend to be BayesiansNo. Two different tools for practitionersImportant to:• Be aware of the limits and pitfalls of both• Always be aware which you’re using



Slide 8

Bayesian Prior

P(theory) is the PriorExpresses prior belief theory is trueCan be function of parameter:

P(Mtop), P(MH), P(α,β,γ)

Bayes’ Theorem describes way prior belief is modified by experimental data

But what do you take as initial prior?



Slide 9

Uniform Prior

General usage: choose P(a) uniform in a(principle of insufficient reason)

Often ‘improper’: ∫P(a)da =∞. Though posterior P(a|x) comes out sensible

BUT!If P(a) uniform, P(a2) , P(ln a) , P(√a).. are notInsufficient reason not valid (unless a is ‘most

fundamental’ – whatever that means)Statisticians handle this: check results for

‘robustness’ under different priors



Slide 10

Example – Le Diberder

Sad StoryFitting CKM angle α from B6 observables3 amplitudes: 6 unknown parameters

(magnitudes, phases) α is the fundamentally interesting one



Slide 11

Results

Frequentist

BayesianSet one phase to zeroUniform priors in other two phases and 3 magnitudes



Slide 12

More ResultsBayesianParametrise Tree and Penguin amplitudes

0

00

1

21

2

P

TC

TC P

ii

iiC

i iiC

A Te Pe

A e T T e

A e T e Pe

Bayesian3 Amplitudes: 3 real parts, 3 Imaginary parts



Slide 13

Interpretation

• B shows same (mis)behaviour

• Removing all experimental info gives similar P(α)

• The curse of high dimensions is at work

Uniformity in x,y,z makes P(r) peak at large rThis result is not robust

under changes of prior



Slide 14

Example - Heinrich

CDF statistics group looking at problem of estimating signal cross section S in presence of background and efficiency.

N= εS+bEfficiency and Background from separate calibration

experiments (sidebands or MC). Scaling factors κ, ω are known.

Everything done using Bayesian methods with uniform priors and Poisson statistics formula. Calibration experiments use uniform prior for ε and for b, yielding posteriors used for S

P(N|S)=(1/N!)∫∫e-(εS+b) (εS+b )N P(ε) P(b) dε db Check coverage – all fine



Slide 15

But it all goes pear shaped..

If particle decays in several channelsHγγ H τ+ τ- Hbb

Each channel with different b and ε: total 2N+1 parameters, 2N+1 experiments

Heavy undercoverage!e.g. with 4 channels, all ε=25±10%, b=0.75±0.25For s≈10 get ’90% upper limit’ above s in only 80% of cases

90%

100%

S10 20



Slide 16

The curse strikes again

Uniform prior in ε: fineUniform prior in ε1, ε2…

εN

εN-1 prior in total εPrejudice in favour of

high efficiencySignal size downgraded



Slide 17

Happy ending

Effect avoided by using Jeffreys’ Priors instead of uniform priors for ε and b

Not uniform but like 1/ε, 1/b

Not entirely realistic but interestingUniform prior in S is not a problem –

but maybe should consider 1/√S?Coverage (a very frequentist concept)

is a useful tool for Bayesians



Slide 18

Fisher InformationAn informative experiment

is one for which a measurement of x will give precise information about the parameter a.

Quantify: I(a)= -<2 ln L/a2>

(Second derivative – curvature)

P(x,a): everything

P(x)|a is the pdf

P(a)|x is the likelihood L(a)



Slide 19

Jeffreys’ PriorA prior may be uniform in a – but if

I(a) depends on a it’s still not ‘flat’: special values of a give better measurements

Transform a a’ such that I(a’) is

constant. Then choose a uniform prior• location parameter – uniform prior OK• scale parameter – a’ is ln a. prior 1/a• Poisson mean – prior 1/√a



Slide 20

Objective Prior?

Jeffreys called this an ‘objective’ prior as opposed to ‘subjective’ or straight guesswork, but not everyone was convinced

For statisticians ‘flat prior’ means Jeffreys prior. For physicists it means uniform prior

Prior depends on likelihood. Your ‘prior belief’ P(MH) (or whatever) depends on the analysis

Equivalent to a prior proportional to √I



Slide 21

Reference Priors (Demortier)

4 steps• Intrinsic DiscrepancyBetween two PDFs δ{P1(z),P2(z)}=Min{∫P1(z)ln(P1(z)/P2(z)) dz,

∫P2(z)ln(P2(z)/P1(z))dz}Sensible measure of differenceδ=0 iff P1(z) & P2(z) are the same, else +ve Invariant under all transformations of z



Slide 22

Reference Priors (2)

2) Expected Intrinsic InformationMeasurement M: x is sampled from p(x|a)Parameter a has a prior p(a)Joint distribution p(x,a)=p(x|a) p(a) Marginal distribution p(x)=∫p(x|a) p(a) da

I(p(a),M)=δ{p(x,a),p(x)p(a)}Depends on (i) x-a relationship and (ii)

breadth of p(a)Expected Intrinsic (Shannon) Information from measurement M about parameter a



Slide 23

Reference Priors (3)

3) Missing informationMeasurement Mk – k samples of xEnough measurements fix a completelyLimit k∞ I(p(a),Mk) is the difference

between knowledge encapsulated in prior p(a) and complete knowledge of a. Hence Missing Information given p(a).



Slide 24

Reference Priors(4)

4) Family of priors P (e.g. Fourier series, polynomials, histogram). p(a)P

Ignorance principle: choose the least informative (dumbest) prior in the family: the one for which the missing information Limit k∞ I(p(a),Mk) is largest.

Technical difficulties in taking k limit and integrating over infinite range of a



Slide 25

Family of Priors (Google)



Slide 26

Reference Priors

Do not represent subjective belief – in fact the opposite (like a jury selection). Allow the most input to come from the data. Formal consensus practitioners can use to arrive at sensible posterior

Depend on measurement p(x|a) – cf JeffreysAlso require family of P of possible priorsMay be improper but this doesn’t matter (do not

represent…). For 1 parameter (if measurement is asymptoticallly

Gaussian, which the CLT usually secures) give Jeffreys prior

But can also (unlike Jeffreys) work for several parameters



Slide 27

Summary

• Probability– Frequentist– Bayesian

• Bayes Theorem– Priors

• Prior pitfalls (1): Le Diberder• Prior pitfalls (2): Heinrich• Jeffreys’ Prior

– Fisher Information

• Reference Priors: Demortier

developments in bayesian priors roger barlow manchester iop meeting november 16 th 2005

Documents

bayesian probability

demortier slide

bayesian prior ptheory

uniform priors

different priors

probability probability

results frequentist

frequentist probability