bayesian updating, evolutionary dynamics and relative entropycshalizi/talks/allerton-2010.pdfgood...

Bayesian Updating, Evolutionary Dynamics andRelative Entropy

Cosma Shalizi

Statistics Department, Carnegie Mellon University

Santa Fe Institute

29 September 2010, Allerton

IntroductionConvergence

ExampleDiscussion

Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence

What's the Problem?

Question: Why does Bayes (often) work, even when parametersaren't randomly generated?Answer: For the same reason that evolution often works. Bayes isnatural selection, without mutation, sex, or any of the other goodparts.Question: If Bayes is evolution, what is the �tness function?Answer: The relative entropy rate. Bayes is evolutionary searchwith an information-theoretic objective function.Question: What is this observation good for?Answer: Understanding what happens with mis-speci�cation anddependence.

Cosma Shalizi Dynamics of Bayesian Updating


ExampleDiscussion


Setting/Notation

Data X1,X2, . . . from stochastic process PFamily Θ of stochastic processes/hypotheses Fθ, indexed by θAssume: h(θ) ≡ limt→∞

1tDKL(P(X t

1 )‖Fθ(X t1 )) exists ∀θ

Prior Π0 over Θ, posterior after X1,X2, . . .Xt ≡ X t1 is Πt

Will not assume P ∈ suppΠ0, or even P ∈ ΘUpdate via Bayes's rule

Πt (A) =

∫AdΠ0(θ)fθ(x t1)∫

Θ dΠ0(θ)fθ(x t1)=

∫AdΠ0(θ)

fθ(xt

1)p(xt

1)∫Θ dΠ0(θ)

fθ(xt

1)p(xt

1)

=Π0 (RtA)

Π0 (Rt)

or

πt(θ) = π0(θ)Rt(θ)

Π0 (Rt)



ExampleDiscussion


The Central Analogy: Bayes ≺ Darwin-Wallace

πt+1(θ) = πt(θ)Rt+1(θ)/Rt(θ)

Πt (Rt+1/Rt)

above-average conditional likelihood ⇒ increased posterior weightThis is the replicator dynamic from evolutionary theory

simple hypotheses θ ⇔ genotypesposterior Πt ⇔ population distribution

conditional likelihood Rt+1(θ)Rt(θ) ⇔ time-varying �tness

divergence rate h(θ) ⇔ long-run average �tness

First three are immediate; the last is not



ExampleDiscussion


Past performance is indicative of future results

Assumption (Asymptotic equipartition property (AEP),Shannon-McMillan-Breiman)

∀θ ∈ Θ

lim1

tlogRt(θ) = −h(θ) a.s.

Lemma (Almost-universal-quanti�er exchange)

With P-probability 1, the set of θ on which t−1 logRt(θ) does not

converge has Π0 measure 0

. . . so h(θ) really is long-run �tness



ExampleDiscussion


Frequentist Licenses for Bayesian Procedures

Questions statisticians care about:

1 What are the asymptotics of Πt and πt?

2 When does Πt converge P-a.s., and how fast?

3 When is Bayesian learning consistent, i.e., ∀ neighborhoods Nof P , ΠtN → 1 (a.s. P)?

4 What does Bayesian learning do when it's not consistent?

What biologists know:

Theorem (Fisher's Fundamental Theorem of Natural Selection)

With a �xed �tness function, the population mean �tness grows at

a rate equal to the variance of the �tness in the population

Assumes �xed �tness function, �nite space of types, etc., so nodirect application. . .



ExampleDiscussion


A Dead End

Theorem (Doob)

If any consistent estimator exists, the set of sample paths along

which Πt does not concentrate on P has Π0-measure 0

. . . even if P 6∈ suppΠ0

Diaconis and Freedman: inconsistency for IID data from R, withbad but natural priorsMany results since on Bayesian consistency, with strongassumptions about dynamics, models, and prior (basically, capacitycontrol)Very undesirable behavior possible for mis-speci�ed models, e.g.,oscillating forever between concentrating on wrong answers



ExampleDiscussion


In the Evolutionary Analogy. . .

Integrated likelihood corresponds to population-average �tnessGenotypes with highest �tness in the initial population take overPopulation �tness → max �tness with positive weight in theoriginal populationDe�ne

−h(A) ≡ − ess infθ∈A

h(θ)

Positive weight for �tnesses ε-close to −h(Θ), because it's anessential in�mumSuggests: Πt(A)→ 0 unless h(A) = h(Θ)



ExampleDiscussion

Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance

Lower Bound on the Integrated Likelihood

Π0 (Rt) is hardly ever much below exp {−th(Θ)}

Lemma (The integrated likelihood tracks the best)

∀ε > 0, P {x∞1 : Π0 (Rt) ≤ exp {−t(h(Θ) + ε)}, i.o.} = 0

So

lim inft→∞

1

tlogΠ0 (Rt) ≥ −h(Θ) a.s.

Theorem (Upper bound for posterior density)

∀θ where π0(θ) > 0,

lim supt→∞

1

tlog πt(θ) ≤ h(Θ)− h(θ) a.s.



ExampleDiscussion


Why isn't it easy to upper bound the integrated likelihood?

Consider x t1 ≡ �x t1 repeats forever with probability 1�

�Explains� data perfectly, but generically h(x t1) =∞Lots of hypotheses like this! x t1x

1t also explained x t1 perfectly

plus even more which just give probabilities close to 1Individually fragile, collectively high capacity

Very slow convergence, e.g., x10100

1 looks great for a long timeΠ0 can't give them too much weight or they dominate everythingNeed to rule out priors full of fail



ExampleDiscussion


In the Evolutionary Analogy. . .

The bad θ are like genotypes which are super specialized to veryparticular environmentsSo they do better than generalists. . . until that really narrow niche vanishes, never to return. . . but that could take a really long timeNeed to keep them from dominating the population in the �rstplace



ExampleDiscussion


Lemma (Life is easier with uniform convergence)

For any G where t−1 logRt(θ) converges uniformly and Π0(G ) > 0,

lim supt→∞

1

tlogΠ0 (GRt) ≤ −h(G ) a.s.

Uniform convergence keeps the bad hypotheses from taking overThe posterior would be well-behaved if the AEP convergeduniformly over all ΘBut it does not, in any interesting problemTry approximating Θ by uniform-convergence sets



ExampleDiscussion


Assumption (The Good Sets of the Sieve)

There exists a sequence of sets Gt → Θ such that

1 Π0(Gt) ≥ 1− α exp {−tβ}, for some α > 0, β > 2h(Θ);

2 t−1 logRt(θ) converges uniformly in θ over each Gt ;

3 h(Gt)→ h(Θ).

Method of sieves (Grenander): constrain estimates to lie in a goodset Gt , to avoid over-�tting, but relax the constraint as t grows tokeep consistency (mustn't relax too fast){Gt} is like a sieve, but will not have to impose a hard constraint



ExampleDiscussion


More on the Good Sets

The �rst 2 parts of the Good Sets assumption are freebies fromclassic measure theory:

Theorem (Egorov)

If a sequence of �nite measurable functions ft(θ) converges

pointwise to a �nite, measurable function f (θ) for Π0-almost-all

θ ∈ Θ, then for each ε > 0, there is a Gε ⊂ Θ such that

Π0(Gε) ≥ Π0(Θ)− ε, and the convergence is uniform on Gε

The third part rules out low-divergence models always being theslowest to converge, and doesn't seem to be a freebie



ExampleDiscussion


Lemma

Almost surely,

Π0 (RtGct ) ≤ exp {−tβ/2}

for all but �nitely many t.

Actually works for any sequence of sets with exponentially-shrinkingprior weight

Lemma (Why β > 2h(Θ))

Πt(Gt) =Π0 (RtGt)

Π0 (Rt)→ 1



ExampleDiscussion


Su�ciently Rapid Convergence

De�nition (One-sided convergence times)

∀A ⊆ Θ, ∀δ > 0, ∃ random time τ(A, δ) such that

t−1 logΠ0 (ARt) ≤ δ + lim supt

t−1 logΠ0 (ARt)

for all t > τ(A, δ), provided the lim sup is �nite.

The good sets need to be su�ciently converged by the time they'recalled for

Assumption (Su�ciently Rapid Convergence on Good Sets)

The good sets Gt can be chosen so that, for every δ, t ≥ τ(Gt , δ)holds a.s. for all su�ciently large t.



ExampleDiscussion


Lemma (The integrated likelihood doesn't do better than the best)

Assume su�ciently rapid convergence. Then, P-a.s.,

lim supt→∞

1

tlogΠ0 (Rt) ≤ −h(Θ)

Theorem (Decay rate of the posterior density)

Assuming the AEP and su�ciently rapid convergence, for all θ ∈ Θwhere π0(θ) > 0,

limt→∞

1

tlog πt(θ) = h(Θ)− h(θ) a.s.

The form of the rapid-convergence assumption is not naturalWould like to replace it with a more primitive condition, even if it'sa stronger one (e.g., metric entropy conditions)



ExampleDiscussion


Convergence and Consistency

Assumption (Good Sets Are Good Even for Subsets)

The good sets Gt can be chosen so that, for any set A with

Π0(A) > 0, h(Gt ∩ A)→ h(A).

Does not seem to follow automatically but also don't have acounter-example

Theorem (Posterior convergence)

If h(A) > h(Θ), thenΠt(A)→ 0 a.s.



ExampleDiscussion


Not Quite a Real Large Deviations Principle

Theorem

If A is such that

− lim sup t−1 logΠ0(A ∩ G ct ) ≥ 2h(A)

then

limt→∞

1

tlogΠt(A) = h(Θ)− h(A)

This holds whenever 2h(A) < β or A ⊂⋂∞k=n Gk for some n.

≈ a region of sub-optimal �tness ends up shrinking exponentiallyfast, remnant is dominated by its highest-�tness sub-regionsExtends to rate of posterior convergence



ExampleDiscussion


Prediction

One-step-ahead predictive distribution for models and reality:

Fθ

(Xt |σ

(X t−11

)), P

(Xt |σ

(X t−11

))Posterior predictive distribution is a mixture, F t

Π ≡∫

Θ F tθdΠt(θ)

Theorem (Predictive performance)

lim supt→∞

ρ2Hellinger(Pt ,F t

Π) ≤ h(Θ) a.s.

lim supt→∞

ρ2TV (Pt ,F tΠ) ≤ 4h(Θ) a.s.

Haven't managed to show lim supt→∞DKL(Pt‖F tΠ) ≤ h(Θ)



ExampleDiscussion

A Tractable Example

A Bp = 1.0 : '1'p = 0.5 : '1'

p = 0.5 : '0'

X = the 0/1 process, not the A/B processΘ = all �nite-order Markov chains with non-zero transitionprobabilities



ExampleDiscussion

P is stationary and ergodicThe A/B process is a Markov chainThe 0/1 process is not Markov at any �nite orderConsistent non-Bayesian reconstruction algorithms existGt = chains of order ≤ c1 log t − 1, with smallest log transitionprobability ≤ c2t

c3 , all constants explicitPrior: arbitrary, so long as Π0(G c

t ) ≤ α exp {−βt} for someα, β > 0



ExampleDiscussion

What Happens?

Posterior never concentrates on any one hypothesesEvery compact K ⊂ Θ has h(K ) > h(Θ) = 0 so Πt(K )→ 0Always shifts to higher order chains with more extreme transitionsMost posterior weight on chains of order O(log t) but no higher

Horrible, data-memorizing hypotheses have exponentially higherlikelihood but exponentially small posterior weightPredictions get better and betterNever gets questions like �Is P(010) > 0?� right(though e.g. Shalizi & Klinkner 2004 does)



ExampleDiscussion

What Makes Convergence Work?

The crucial assumptions:

1 Asymptotic equipartitionhard to see how Bayes could possibly work if past likelihooddoesn't indicate future behavior

2 Building a sieve of sets where convergence is increasinglyill-behaved, but never really out of controlVery mis-speci�ed models need tighter sieves, β > 2h(Θ)

3 Keeping the prior weight of the remaining bad setsexponentially small



ExampleDiscussion

Conclusion

1 Bayes is evolutionary search for good predictors among �xedcandidates

2 With enough capacity control built in to the prior, the posteriorconcentrates on the best available approximations to P

3 For this to fail, AEP has to fail for positive-measure subsets ofΘ � bad luck bordering on conspiracy

4 Predictions become averages over hypotheses which havemanaged to predict well

5 Model averaging and the prior are regularization


bayesian updating, evolutionary dynamics and relative entropycshalizi/talks/allerton-2010.pdfgood...

Documents