bayesian updating, evolutionary dynamics and relative entropycshalizi/talks/allerton-2010.pdfgood...
TRANSCRIPT
Bayesian Updating, Evolutionary Dynamics andRelative Entropy
Cosma Shalizi
Statistics Department, Carnegie Mellon University
Santa Fe Institute
29 September 2010, Allerton
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
What's the Problem?
Question: Why does Bayes (often) work, even when parametersaren't randomly generated?Answer: For the same reason that evolution often works. Bayes isnatural selection, without mutation, sex, or any of the other goodparts.Question: If Bayes is evolution, what is the �tness function?Answer: The relative entropy rate. Bayes is evolutionary searchwith an information-theoretic objective function.Question: What is this observation good for?Answer: Understanding what happens with mis-speci�cation anddependence.
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
Setting/Notation
Data X1,X2, . . . from stochastic process PFamily Θ of stochastic processes/hypotheses Fθ, indexed by θAssume: h(θ) ≡ limt→∞
1tDKL(P(X t
1 )‖Fθ(X t1 )) exists ∀θ
Prior Π0 over Θ, posterior after X1,X2, . . .Xt ≡ X t1 is Πt
Will not assume P ∈ suppΠ0, or even P ∈ ΘUpdate via Bayes's rule
Πt (A) =
∫AdΠ0(θ)fθ(x t1)∫
Θ dΠ0(θ)fθ(x t1)=
∫AdΠ0(θ)
fθ(xt
1)p(xt
1)∫Θ dΠ0(θ)
fθ(xt
1)p(xt
1)
=Π0 (RtA)
Π0 (Rt)
or
πt(θ) = π0(θ)Rt(θ)
Π0 (Rt)
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
The Central Analogy: Bayes ≺ Darwin-Wallace
πt+1(θ) = πt(θ)Rt+1(θ)/Rt(θ)
Πt (Rt+1/Rt)
above-average conditional likelihood ⇒ increased posterior weightThis is the replicator dynamic from evolutionary theory
simple hypotheses θ ⇔ genotypesposterior Πt ⇔ population distribution
conditional likelihood Rt+1(θ)Rt(θ) ⇔ time-varying �tness
divergence rate h(θ) ⇔ long-run average �tness
First three are immediate; the last is not
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
Past performance is indicative of future results
Assumption (Asymptotic equipartition property (AEP),Shannon-McMillan-Breiman)
∀θ ∈ Θ
lim1
tlogRt(θ) = −h(θ) a.s.
Lemma (Almost-universal-quanti�er exchange)
With P-probability 1, the set of θ on which t−1 logRt(θ) does not
converge has Π0 measure 0
. . . so h(θ) really is long-run �tness
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
Frequentist Licenses for Bayesian Procedures
Questions statisticians care about:
1 What are the asymptotics of Πt and πt?
2 When does Πt converge P-a.s., and how fast?
3 When is Bayesian learning consistent, i.e., ∀ neighborhoods Nof P , ΠtN → 1 (a.s. P)?
4 What does Bayesian learning do when it's not consistent?
What biologists know:
Theorem (Fisher's Fundamental Theorem of Natural Selection)
With a �xed �tness function, the population mean �tness grows at
a rate equal to the variance of the �tness in the population
Assumes �xed �tness function, �nite space of types, etc., so nodirect application. . .
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
A Dead End
Theorem (Doob)
If any consistent estimator exists, the set of sample paths along
which Πt does not concentrate on P has Π0-measure 0
. . . even if P 6∈ suppΠ0
Diaconis and Freedman: inconsistency for IID data from R, withbad but natural priorsMany results since on Bayesian consistency, with strongassumptions about dynamics, models, and prior (basically, capacitycontrol)Very undesirable behavior possible for mis-speci�ed models, e.g.,oscillating forever between concentrating on wrong answers
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Attention-Grabbing Opening CatechismThe Inevitable Notation-Fixing SlideThe Evolutionary Analogy and the AEPBayesian Convergence
In the Evolutionary Analogy. . .
Integrated likelihood corresponds to population-average �tnessGenotypes with highest �tness in the initial population take overPopulation �tness → max �tness with positive weight in theoriginal populationDe�ne
−h(A) ≡ − ess infθ∈A
h(θ)
Positive weight for �tnesses ε-close to −h(Θ), because it's anessential in�mumSuggests: Πt(A)→ 0 unless h(A) = h(Θ)
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Lower Bound on the Integrated Likelihood
Π0 (Rt) is hardly ever much below exp {−th(Θ)}
Lemma (The integrated likelihood tracks the best)
∀ε > 0, P {x∞1 : Π0 (Rt) ≤ exp {−t(h(Θ) + ε)}, i.o.} = 0
So
lim inft→∞
1
tlogΠ0 (Rt) ≥ −h(Θ) a.s.
Theorem (Upper bound for posterior density)
∀θ where π0(θ) > 0,
lim supt→∞
1
tlog πt(θ) ≤ h(Θ)− h(θ) a.s.
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Why isn't it easy to upper bound the integrated likelihood?
Consider x t1 ≡ �x t1 repeats forever with probability 1�
�Explains� data perfectly, but generically h(x t1) =∞Lots of hypotheses like this! x t1x
1t also explained x t1 perfectly
plus even more which just give probabilities close to 1Individually fragile, collectively high capacity
Very slow convergence, e.g., x10100
1 looks great for a long timeΠ0 can't give them too much weight or they dominate everythingNeed to rule out priors full of fail
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
In the Evolutionary Analogy. . .
The bad θ are like genotypes which are super specialized to veryparticular environmentsSo they do better than generalists. . . until that really narrow niche vanishes, never to return. . . but that could take a really long timeNeed to keep them from dominating the population in the �rstplace
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Lemma (Life is easier with uniform convergence)
For any G where t−1 logRt(θ) converges uniformly and Π0(G ) > 0,
lim supt→∞
1
tlogΠ0 (GRt) ≤ −h(G ) a.s.
Uniform convergence keeps the bad hypotheses from taking overThe posterior would be well-behaved if the AEP convergeduniformly over all ΘBut it does not, in any interesting problemTry approximating Θ by uniform-convergence sets
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Assumption (The Good Sets of the Sieve)
There exists a sequence of sets Gt → Θ such that
1 Π0(Gt) ≥ 1− α exp {−tβ}, for some α > 0, β > 2h(Θ);
2 t−1 logRt(θ) converges uniformly in θ over each Gt ;
3 h(Gt)→ h(Θ).
Method of sieves (Grenander): constrain estimates to lie in a goodset Gt , to avoid over-�tting, but relax the constraint as t grows tokeep consistency (mustn't relax too fast){Gt} is like a sieve, but will not have to impose a hard constraint
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
More on the Good Sets
The �rst 2 parts of the Good Sets assumption are freebies fromclassic measure theory:
Theorem (Egorov)
If a sequence of �nite measurable functions ft(θ) converges
pointwise to a �nite, measurable function f (θ) for Π0-almost-all
θ ∈ Θ, then for each ε > 0, there is a Gε ⊂ Θ such that
Π0(Gε) ≥ Π0(Θ)− ε, and the convergence is uniform on Gε
The third part rules out low-divergence models always being theslowest to converge, and doesn't seem to be a freebie
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Lemma
Almost surely,
Π0 (RtGct ) ≤ exp {−tβ/2}
for all but �nitely many t.
Actually works for any sequence of sets with exponentially-shrinkingprior weight
Lemma (Why β > 2h(Θ))
Πt(Gt) =Π0 (RtGt)
Π0 (Rt)→ 1
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Su�ciently Rapid Convergence
De�nition (One-sided convergence times)
∀A ⊆ Θ, ∀δ > 0, ∃ random time τ(A, δ) such that
t−1 logΠ0 (ARt) ≤ δ + lim supt
t−1 logΠ0 (ARt)
for all t > τ(A, δ), provided the lim sup is �nite.
The good sets need to be su�ciently converged by the time they'recalled for
Assumption (Su�ciently Rapid Convergence on Good Sets)
The good sets Gt can be chosen so that, for every δ, t ≥ τ(Gt , δ)holds a.s. for all su�ciently large t.
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Lemma (The integrated likelihood doesn't do better than the best)
Assume su�ciently rapid convergence. Then, P-a.s.,
lim supt→∞
1
tlogΠ0 (Rt) ≤ −h(Θ)
Theorem (Decay rate of the posterior density)
Assuming the AEP and su�ciently rapid convergence, for all θ ∈ Θwhere π0(θ) > 0,
limt→∞
1
tlog πt(θ) = h(Θ)− h(θ) a.s.
The form of the rapid-convergence assumption is not naturalWould like to replace it with a more primitive condition, even if it'sa stronger one (e.g., metric entropy conditions)
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Convergence and Consistency
Assumption (Good Sets Are Good Even for Subsets)
The good sets Gt can be chosen so that, for any set A with
Π0(A) > 0, h(Gt ∩ A)→ h(A).
Does not seem to follow automatically but also don't have acounter-example
Theorem (Posterior convergence)
If h(A) > h(Θ), thenΠt(A)→ 0 a.s.
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Not Quite a Real Large Deviations Principle
Theorem
If A is such that
− lim sup t−1 logΠ0(A ∩ G ct ) ≥ 2h(A)
then
limt→∞
1
tlogΠt(A) = h(Θ)− h(A)
This holds whenever 2h(A) < β or A ⊂⋂∞k=n Gk for some n.
≈ a region of sub-optimal �tness ends up shrinking exponentiallyfast, remnant is dominated by its highest-�tness sub-regionsExtends to rate of posterior convergence
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Control of the Integrated LikelihoodGood Sets of the SievePosterior ConvergenceGeneralization Performance
Prediction
One-step-ahead predictive distribution for models and reality:
Fθ
(Xt |σ
(X t−11
)), P
(Xt |σ
(X t−11
))Posterior predictive distribution is a mixture, F t
Π ≡∫
Θ F tθdΠt(θ)
Theorem (Predictive performance)
lim supt→∞
ρ2Hellinger(Pt ,F t
Π) ≤ h(Θ) a.s.
lim supt→∞
ρ2TV (Pt ,F tΠ) ≤ 4h(Θ) a.s.
Haven't managed to show lim supt→∞DKL(Pt‖F tΠ) ≤ h(Θ)
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
A Tractable Example
A Bp = 1.0 : '1'p = 0.5 : '1'
p = 0.5 : '0'
X = the 0/1 process, not the A/B processΘ = all �nite-order Markov chains with non-zero transitionprobabilities
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
P is stationary and ergodicThe A/B process is a Markov chainThe 0/1 process is not Markov at any �nite orderConsistent non-Bayesian reconstruction algorithms existGt = chains of order ≤ c1 log t − 1, with smallest log transitionprobability ≤ c2t
c3 , all constants explicitPrior: arbitrary, so long as Π0(G c
t ) ≤ α exp {−βt} for someα, β > 0
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
What Happens?
Posterior never concentrates on any one hypothesesEvery compact K ⊂ Θ has h(K ) > h(Θ) = 0 so Πt(K )→ 0Always shifts to higher order chains with more extreme transitionsMost posterior weight on chains of order O(log t) but no higher
Horrible, data-memorizing hypotheses have exponentially higherlikelihood but exponentially small posterior weightPredictions get better and betterNever gets questions like �Is P(010) > 0?� right(though e.g. Shalizi & Klinkner 2004 does)
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
What Makes Convergence Work?
The crucial assumptions:
1 Asymptotic equipartitionhard to see how Bayes could possibly work if past likelihooddoesn't indicate future behavior
2 Building a sieve of sets where convergence is increasinglyill-behaved, but never really out of controlVery mis-speci�ed models need tighter sieves, β > 2h(Θ)
3 Keeping the prior weight of the remaining bad setsexponentially small
Cosma Shalizi Dynamics of Bayesian Updating
IntroductionConvergence
ExampleDiscussion
Conclusion
1 Bayes is evolutionary search for good predictors among �xedcandidates
2 With enough capacity control built in to the prior, the posteriorconcentrates on the best available approximations to P
3 For this to fail, AEP has to fail for positive-measure subsets ofΘ � bad luck bordering on conspiracy
4 Predictions become averages over hypotheses which havemanaged to predict well
5 Model averaging and the prior are regularization
Cosma Shalizi Dynamics of Bayesian Updating