the bayesian approach - ku leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...
TRANSCRIPT
Uncertainty Reasoning inKnowledge Systems
—The Bayesian approach
Herman BruyninckxDepartment of Mechanical Engineering
K.U.Leuven, Belgium
Autumn 2007
H. Bruyninckx, The Bayesian approach 1
Further reading
++ Christopher M. Bishop, Pattern Recognition andMachine Learning, 2006.(Permanently available in CBA library: 1 681.3*I5 2006.)
+ Jensen, Finn V. and Nielsen, Thomas D., BayesianNetworks and Decision Graphs, 2001.(Available in CBA library: 1 519.2 2001.)
+ David MacKay, Information Theory, Inference, andLearning Algorithms, Cambridge University Press.(Available in CBA library: 1 519.72 2003.Available electronically at http://www.inference.phy.cam.ac.uk/mackay/itila/book.html.)
H. Bruyninckx, The Bayesian approach 2
Further reading (2)
I Sebastian Thrun, Wolfram Burgard and Dieter Fox,Probabilistic Robotics, MIT Press, 2005.(Available in CBA library: 1 681.3* I29 2005.)
I The http://en.wikipedia.org contains many goodarticles on Bayesian concepts and algorithms, and a (stillrather poor) overview: http:
//en.wikipedia.org/wiki/Bayesian_probability.
I Kevin P. Murphy, A Brief Introduction to GraphicalModels and Bayesian Networks, 1998. Availableelectronically at http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html.
H. Bruyninckx, The Bayesian approach 3
Further reading (3)
I On-line facsimile van Bayes’ paper: http:
//www.stat.ucla.edu/history/essay.pdf.
I Stephen E. Fienberg, When Did BayesianInference Become “Bayesian”? BayesianAnalysis, 1(1):1-40, 2006.http://lib.stat.cmu.edu/~fienberg/
fienberg-BA-06-Bayesian.pdf
H. Bruyninckx, The Bayesian approach 4
What is URKS?Knowledge System (intelligent system; reasoning system):
I Computer program that can answer questions about theinformation in a real-world system (engineered, social).
I It uses relationships between variables that represent thereal-world system in a mathematical way.
I Learning of the variables as well as of the relationships isan important (but not necessary) part of a knowledgesystem.
I Examples: weather forecasting; traffic speed detector;medical diagnosis system; legal court reasoning system;robot moving around in unknown environment; GPSnavigation system; . . .
I Counter examples: lecture room automation system.
H. Bruyninckx, The Bayesian approach 5
Uncertainty (information, probability, possibility, plausibility,chance, . . . ):
I There can be uncertainty about: the values of thevariables; the structure of the relationships; the mostappropriate relationship in a set of possible relationships;etc.
I Examples: predicted weather; estimated speed of a car;identification of the car that drove too fast; probabilityof being guilty; correct physical theory of the origin ofthe universe; position of a car on the map; probabilitythat patient will die from cancer within six months; . . .
I Counter examples: value of the 1000th digit of π;number of millimeters in one meter; my name is“Herman”; . . .
H. Bruyninckx, The Bayesian approach 5
Reasoning:
I System executes “most appropriate” algorithmto find the answer to a question I pose.
I The questions can be of a very variable nature:What happens when I give this input? How canthe system incorporate this new information?What would be the optimal structure of thesystem for these information needs? . . .
I Learning is one important example of“reasoning”.
I Decision making is another important example.
I How to reason about concepts like: opinion;act of God; personal bias; prior knowledge; . . .
H. Bruyninckx, The Bayesian approach 5
Overview of the Bayesian approach
1. Modelling — How is the informationrepresented?
I Structure/Relationships.I Values.
2. Reasoning/Information processing — Whatquestions can be asked?
I learning (both values and structure), dataassociation, pattern recognition, classification,clustering, model building (= learning structure),. . .
3. Decision making — What do you do with theanswer?(Beyond the scope of this course.)
H. Bruyninckx, The Bayesian approach 6
1. Modelling
1.1 Mathematical representation of problem domain= parameters (“random variables”) + relationships(“graphical model,” “(Bayesian) network”).
H. Bruyninckx, The Bayesian approach 7
1.2 Information
I Probability Density Function (PDF) overparameter space: function i.s.o. single number.
I Subjective representation of the human’sknowledge about—or interpretation of—theworld.
-8 -6 -4 -2 0 2 4 6 8-8-6
-4-2
0 2
4 6
8
0 0.005 0.01
0.015 0.02
0.025 0.03
0.035 0.04
0.045 0.05
Where am I?
H. Bruyninckx, The Bayesian approach 8
Joint PDF:
I captures uncertain/stochastic dependencebetween parameters.
I Graphical model = “bookkeeping” structure forcomputations of information processing =factorization of joint PDF.
p(a, b, c , x , . . . , α, β, η, θ) = p(x , y .z)p(c |x , y , z)p(a, b) . . .
H. Bruyninckx, The Bayesian approach 9
2. Information processing
2.1 Inference (deduction, abduction).2.2 Learning:
I parameters in given model.I structure of new model.
2.3 Estimation (data reduction).
Major computational algorithms:
I Bayes’ rule: incorporate new information.
I Marginalization: reduce existing information.
I Belief propagation in network: transportinformation.
H. Bruyninckx, The Bayesian approach 10
3. Decision making
3.1 Utility functions:I Gives “value” to information in variables. . .I . . . to translate information into action.I Does not influence information content in network.
(At least not directly.)I Arbitrary!
3.2 Attached “somewhere” to graphical model.
3.3 Computation of decision making need not besynchronised with computation of informationprocessing.
This course: only estimation and model selection.
H. Bruyninckx, The Bayesian approach 11
Graphical models
H. Bruyninckx, The Bayesian approach 12
Graphical Models
I Best reference: Bishop 2006.I Graphical representation of structure of
relationships:I node contains variables.I arc (edge, link, arrow, . . . ) contains mathematical
representation of relationship.
I The model describes (a simplification of) theworld in a coordinate-free way. The samemodel can have various mathematicalrepresentations as well as various algorithmicimplementations.
I Not limited to Bayesian paradigm.
H. Bruyninckx, The Bayesian approach 13
Joint PDF
I Bayesian paradigm: each node = joint PDF.
I The graphical model encodes conditionalindependence: X and Y are conditionallyindependent if knowing Y doesn’t give anyinformation about X . This gives(computational) structure to a graphicalmodel: it’s useless to do computations thatinvolve X if one wants to know things aboutY , so one can “cut off” parts of the model.
I If X and Y are conditionally independent givenZ then Y gives us no new information about Xonce we know Z .
H. Bruyninckx, The Bayesian approach 14
Directed arc — “Arrow”
I Explicit relationship z = f (x , y).I Arrow:
I indicates which direction is “easy” to calculate.I is not necessarily physical causality.
I Factorization of the joint PDF:
p(x , y , z) = p(z |x , y)p(x , y).
H. Bruyninckx, The Bayesian approach 15
Undirected arc
I Implicit relationship f (x , y , z) = 0.
I Factorization: not possible to write p(x , y , z)as a factorized function of p(x , y) and p(z).
I Computationally more costly than explicitrelationship.
H. Bruyninckx, The Bayesian approach 16
Hidden (“latent”) variable
Sometimes one uses a graphical indication of whatparameters are “measured” and what parametersare “hidden”:
H. Bruyninckx, The Bayesian approach 17
Network with cycle
H. Bruyninckx, The Bayesian approach 18
Graphical models & Bayes paradigm
I Undirected arcs: “Markov Random Field”I Directed arcs: “Bayesian/Belief Network”I Both: “Chain Graph”
I Wider applicability than only Bayesianparadigm!
I Difference with neural networks: interpretationof nodes and edges in context of model.
I Most efficient: Directed Acyclic Graph (DAG).(Serialization of computations!)
I Can be extended with decision making nodes.
H. Bruyninckx, The Bayesian approach 19
Directed Acyclic Graph (DAG)
p(A,B , . . . ,G ) = p(A)p(B)p(C |AB)p(E |C )
p(D|C )p(F |E )p(G |DEF ).
H. Bruyninckx, The Bayesian approach 20
Example: burglar alarm
Bayesian knowledge system = graphical model +probability tables/functions over random variables innodes.
H. Bruyninckx, The Bayesian approach 21
Markov Random Field
Undirected arrows.
Two (sets of) nodes A and B are made conditionallyindependent by a third (set) C, if all paths betweennodes in A and B are separated by a node in C.
In some cases, extra hidden variables can transformMRF into Bayesian network.
H. Bruyninckx, The Bayesian approach 22
Bayes Net: Serial connection—Causal chain—
A and C are independent, given the information inB: p(C |B ,A) = p(C |B).
Information in B “blocks” transfer of informationbetween A and C.
A and C are d-separated by B.(“d” stands for “disclosure”.)
Example: knowing that there is an alarm, oneknows the chance that Mary calls, independently ofwhether there has been a burglary or not.
H. Bruyninckx, The Bayesian approach 23
Bayes Net: Diverging connections—Common cause—
B, C, . . . , X are independent, given the informationin A: p(B |A,C , . . . ,X ) = p(B |A).
Example: knowing that there is an alarm, oneknows the probabilities that John and Mary call,without them having to know about each other.
Terminology: fan-out = # of A’s outgoing arcs.
H. Bruyninckx, The Bayesian approach 24
Bayes Net: Converging connectionsCommon effect – Explaining away
B, C, . . . , X are dependent, only via information in A:p(A,B ,C ) = p(B)p(C )p(A|B ,C ).
The information in B, C, . . . , X becomes correlated byinformation in their common child parameter A.
Example: if one observes an alarm, then information aboutwhether a burglary has occurred gives information aboutwhether an earthquake has occurred.
Terminology: fan-in = # of A’s incoming arcs.H. Bruyninckx, The Bayesian approach 25
Bayesian Probability Theory
Nice properties:
I Unique: single answer.
I Consistent: independent of route to answer.
I Plausible: extension of 0/1 logic.
What’s in a name?:
I Information = belief = evidence = possibility= probability = . . .
I Inference = information processing = beliefpropagation = . . .
H. Bruyninckx, The Bayesian approach 26
Arbitrary choices in modelling
1. Variables.(You must choose them!)
2. Relationships.(You must choose them!)
3. Information = PDF.(You must choose its form!)
H. Bruyninckx, The Bayesian approach 27
What is a PDF?= function p(x) over space of variable x
+ measure “dx” that describes the “density” ofthe domain around the value x .
I Average/expected value weigthed by density:∫D
p(x) dx (1)
I Integral over whole variable space = 1:I value “1” is arbitrary choice!I only relative value is important.
I Example of non-uniform density: longitude +latitude.
H. Bruyninckx, The Bayesian approach 28
(Figure source: Stefan Kuhn, Wikipedia.org.)
Mercator projection: the surface “density” is notconstant over the whole map.
H. Bruyninckx, The Bayesian approach 29
Example PDF: Gaussian
Gaussian (or Normal) PDF, with mean µ = 0 andvariance σ = 5, 10, 20, 30.
H. Bruyninckx, The Bayesian approach 30
Example PDFMultidimensional Gaussian
1√(2π)n ||P||1/2
exp
{−1
2(x− µ)T P−1(x− µ)
}
-8 -6 -4 -2 0 2 4 6 8-8-6
-4-2
0 2
4 6
8
0 0.005
0.01 0.015
0.02 0.025
0.03 0.035
0.04 0.045
0.05
H. Bruyninckx, The Bayesian approach 31
Mean µ—Covariance P
µ =
∫x p(x) dx, P =
∫(x−µ)(x−µ)T p(x) dx
-8 -6 -4 -2 0 2 4 6 8-8-6-4-2 0 2 4 6 8
σ1σ2
H. Bruyninckx, The Bayesian approach 32
Gaussian PDFs (2)
Advantages:
I Only two parameters needed(per dimension of domain).
I Information processing: (often) analyticallypossible.
Disadvantages;
I mono-modal; uni-variate; only one “peak”.
I extends until infinity.
Efficient extensions:
I Sum of Gaussians.
I Exponential PDFs: α h(x) exp{β g(x)}.H. Bruyninckx, The Bayesian approach 33
Joint PDF
The multi-dimensional, single-valued informationfunction p(X ,Y ,Z , . . . ) that describes the“statistical” dependencies between the variablesX ,Y ,Z , . . . in a model M .
⇒ p(X ,Y ,Z , . . . |M)
Model M = representation of the relationshipsbetween variables X ,Y ,Z , . . .
H. Bruyninckx, The Bayesian approach 34
Relationships= dependent variables
Two variables X and Y are (statistically) dependentif a change in the value of X is correlated with achange in the value of Y .
Correlation: one can derive (part of the) change ininformation about X from a change in informationabout Y .
Not necessarily physical causal connection betweenX and Y .
H. Bruyninckx, The Bayesian approach 35
Dynamic relationships:“state-space” description
{dxdt = f (x , θ, u)
y = g(x , θ, u)
I x : domain values.
I θ: model parameters (PDF, relationships).
I u: inputs.
I y : outputs values.
Time is explicit parameter, and “horizon” ofnetwork moves with time.
H. Bruyninckx, The Bayesian approach 36
Dynamic inference exampleMobile robot in corridor
H. Bruyninckx, The Bayesian approach 37
Bayesian mathematics1. Sum rule
p(x + y |H) = p(x |H) + p(y |H)− p(xy |H).
Notation: “+,” “or,” “∨”.
Full version: p(X = x ∨ Y = y |H).
In words: the value of the information one hasabout the random variable X having the value x orthe random variable Y having the value y , given thecontext H, is the sum of the information about theindividual random variables, if both randomvariables are independent in the given context.
H. Bruyninckx, The Bayesian approach 38
2. Product rule
p(xy |H) = p(x |yH) p(y |H).
Notation: “multiplication” “and,” “∧”.
Full version: p(X = x ∧ Y = y |H).
In words: the value of the information one hasabout the random variable X having the value xand the random variable Y having the value y , isthe product of the information one has about therandom variable X given the value of the randomvariable Y , multiplied by the information one hasabout the random variable Y , all in the context H.
H. Bruyninckx, The Bayesian approach 39
2b. Chain rule
Apply product rule multiple times:
p(xyz |H) = p(x |yzH) p(y |zH) p(z |H).
H. Bruyninckx, The Bayesian approach 40
3. Marginalization
p(x |H) =
∫p(x , y , z , . . . |H) dy dz . . .
(∫ →∑
, for discrete PDFs.)
In words: finds information on x taking intoaccount information about all other variablesy , z , . . . with which x is related.
In other words: “averages out” influence of allother related variables.
H. Bruyninckx, The Bayesian approach 41
4. Bayes’ rule(Inference—“Inverse” probability)
p(xy |H) = p(yx |H)
(product rule) ⇓ (product rule)
p(x |y ,H)p(y |H) = p(y |x ,H)p(x |H)
⇓
p(x |y ,H) =p(y |x ,H)
p(y |H)p(x |H)
H. Bruyninckx, The Bayesian approach 42
Bayes’ rule interpretation
p(Model params|Data,H)
=p(Data|Model params,H)
p(Data|H)p(Model params|H).
“Posterior =Conditional data likelihood
Data Likelihood︸ ︷︷ ︸“Likelihood”
× Prior.”
Data: given; all factors: f (model parameters).p(Data|H) = (often) “normalization factor.”
H. Bruyninckx, The Bayesian approach 43
Bayes’ rule with Gaussians
H. Bruyninckx, The Bayesian approach 44
Important properties
I PDF on Model parameters “in” ⇒ PDF onModel parameters “out.”
I Integration of information is multiplicative.
I Computationally intensive for general PDFs.
I Easy for discrete PDFs and Gaussians.(And some other families of continuous PDFs.)
I p(Data|Model params): requires known tableor mathematical function Data= f (Modelparams) to predict Data from Model.
I Likelihood is not a PDF.
H. Bruyninckx, The Bayesian approach 45
Example with discrete PDF(Inverse probability example)
Data:
I In MAI 31% of students are non-European.
I At URKS 50% of students are non-European.
I At MAI 48% of students studies URKS.
Question:
I Probability that non-European student at MAIstudies URKS?
H. Bruyninckx, The Bayesian approach 46
Bayes rule with discrete PDF (2)
Model:
I A corresponds to “studying URKS in MAI.”
I B corresponds to “being non-European inMAI.”
Inference:
p(A) = .48p(B) = .31p(B |A) = .50
⇒ p(A|B) =p(B |A)
p(B)p(A) = .77.
H. Bruyninckx, The Bayesian approach 47
Intuitive interpretation of PDFs—The prosecutor’s fallacy—
The following links contain examples of courtreasoning that lead to very wrong conclusions, bymisunderstanding of the differences between p(x |y)and p(y |x), and between likelihood and PDF:
I http://en.wikipedia.org/wiki/
Prosecutor’s_fallacy
I http://en.wikipedia.org/wiki/
Conditional_probability#The_
conditional_probability_fallacy
H. Bruyninckx, The Bayesian approach 48
Inference in graphical models
“Infer hidden X given observed Y!”
Terminology:
I Prediction, deduction, forward reasoning:reason from cause to symptom.
I Diagnosis, abduction, backward reasoning:reason from sympton to cause.
Bayes p(X |Y ) =p(Y |X )
p(Y )p(X ) “inverts” causality.
H. Bruyninckx, The Bayesian approach 49
Example: burglar alarm (1)“What is the probability p(J ,M ,A,¬B ,¬E ), thatthere is no burglary, nor earthquake, but that thealarm went and both John and Mary called?”
p(J ,M ,A,B ,E )
= p(J ,M |A,B ,E )p(A,B ,E )
(product rule)
= p(J |A)p(M |A)p(A|B ,E )p(B ,E )
(product rule; independence via A)
= p(J |A)p(M |A)p(A|B ,E )p(B |E )p(E )
(product rule)
= p(J |A)p(M |A)p(A|B ,E )p(B)p(E ).
(independence of B and E )
H. Bruyninckx, The Bayesian approach 50
⇒ p(J = T ,M = T ,A = T ,B = F ,E = F )
= p(J = T |A = T )p(M = T |A = T )
p(A = T |B = F ,E = F )p(B = F )p(E = F )
= 0.90× 0.70× 0.001× 0.999× 0.998
= 0.00062811
No Bayes’ rule required, because everything isdeduction (“forward reasoning”).
H. Bruyninckx, The Bayesian approach 51
Example: burglar alarm (2)
“What is the probability that there is a burglary,given that John calls?”
Bayes’ rule: p(B |J) = p(J |B)p(B)/p(J).
1. p(B) = 0.001
2. p(J) =∑A,B,E
p(J ,A,B ,E ) (marginalisation)
=∑A,B,E
p(J |A,B ,E )p(A,B ,E ) (product)
=∑A,B,E
p(J |A)p(A|B ,E )p(B)p(E ). (id.)
H. Bruyninckx, The Bayesian approach 52
This sum
p(J) =∑A,B,E
p(J |A)p(A|B ,E )p(B)p(E )
has only non-neglible terms when B ,E are False:
p(J) ≈ (0.90× 0.001 + 0.05× 0.999) 0.999︸ ︷︷ ︸B=F
× 0.998︸ ︷︷ ︸E=F
= 0.0507.
H. Bruyninckx, The Bayesian approach 53
3. p(J |B)
=∑A,E
p(J ,A,E |B) (marginalisation)
=∑A,E
p(J |A,B ,E )p(A|B ,E )p(E |B) (product)
=∑A,E
p(J |A)p(A|B ,E )p(E ) (indep.)
= 0.90(0.95× 0.002 + 0.94× 0.998)+ 0.05(0.05× 0.002 + 0.06× 0.998)
= 0.849.
⇒ p(B |J) = 0.849× 0.001/0.0507 = 0.01698.
H. Bruyninckx, The Bayesian approach 54
Example: DAG-with-“loop”
p(W , J ,B ,M) = p(W )p(J |W )p(B |W )p(M |J ,B).
H. Bruyninckx, The Bayesian approach 55
Inference example
“What are the probabilities that Mary missed herappointment because there was a traffic jam(p(M = T |J = T ))? Or because she has been outtoo late (p(M = T |B = T ))?”
1. p(M = T |J = T )= p(M = T , J = T )/p(J = T )= p(Mt , Jt)/p(Jt) (notation)
= t1t2
(notation)
H. Bruyninckx, The Bayesian approach 56
2. t1 =∑W ,B
p(W , Jt ,B ,Mt)
=∑W ,B
p(Mt |Jt ,B)p(Jt |W )p(B |W )p(W )
= p(Mt |Jt ,Bt)︸ ︷︷ ︸0.8
p(Jt |Wt)︸ ︷︷ ︸0.2
p(Bt |Wt)︸ ︷︷ ︸0.8
p(Wt)︸ ︷︷ ︸0.28
+ p(Mt |Jt ,Bt)︸ ︷︷ ︸0.8
p(Jt |Wf )︸ ︷︷ ︸0.9
p(Bt |Wf )︸ ︷︷ ︸0.1
p(Wf )︸ ︷︷ ︸0.72
+ p(Mt |Jt ,Bf )︸ ︷︷ ︸0.6
p(Jt |Wt)︸ ︷︷ ︸0.2
p(Bf |Wt)︸ ︷︷ ︸0.2
p(Wt)︸ ︷︷ ︸0.28
+ p(Mt |Jt ,Bf )︸ ︷︷ ︸0.6
p(Jt |Wf )︸ ︷︷ ︸0.9
p(Bf |Wf )︸ ︷︷ ︸0.9
p(Wf )︸ ︷︷ ︸0.72
= 0.444
H. Bruyninckx, The Bayesian approach 57
3. t2 =∑
W ,B,M
p(W , Jt ,B ,M)
=∑
W ,B,M
p(M |J ,B)p(J |W )p(B |W )p(W )
= . . .
H. Bruyninckx, The Bayesian approach 58
Bayesian mathematicsJaynes’ axiomatic foundation
Axioms for plausible Bayesian inference
I Degrees of plausibility are represented by real numbers.II Qualitative correspondence with common sense.
III If a conclusion can be reasoned out in more than one way,then every possible way must lead to the same result.
IV Always take into account all of the evidence one has.V Always represent equivalent states of knowledge by equiv-
alent plausibility assignments.
H. Bruyninckx, The Bayesian approach 59
Critique on Bayesian axioms
I Assigning equivalent probabilities for equivalentstates seems to assume that the modeller has“absolute knowledge,” since one often doesn’tknow that states are equivalent.
I Many people state that representinginformation by one single real number is notalways possible or desirable.
I Intuitive interpretation(?).
H. Bruyninckx, The Bayesian approach 60
Bayesian mathematicsCox’ derivation
Plausible functional relationships:
p(A,B) = f{
p(A), p(B |A)}
⇒ f{
p(A), p(B |A)}
= f{
p(A)}m
f{
p(B |A)}m
(Arbitrary choices m = 1, f (u) = u.)Negation:
∃function g : g
(g(p(A)
))= p(A)
g(p(A OR B)
)= g
(p(A)
)AND g
(p(B)
)H. Bruyninckx, The Bayesian approach 61
Information measures
I “How much information do I have?”
I Absolute measure = reduce PDF to one scalar.
I Relative measure = what is the informationdifference between two PDFs? (one scalar)
I Global measure = measures the whole PDF.
I Choice of information measure is arbitrary, but:
I Entropy is a natural measure. . .I . . . can be derived from first principles.
H. Bruyninckx, The Bayesian approach 62
Entropy example
.3.2 .2 .1 .1.1
.2 .2 .1.1 .2 .2
.2.2.1
.1.1 .2 .2 .2 .2
H=1 .6957
H=1 .8866 H=1 .6094
.2 .2 .1
H=1 .7480
H. Bruyninckx, The Bayesian approach 63
Logarithm as information measureGood’s axioms (1966)
I I (M :E AND F |C ) = f {I (M :E |C ), I (M :F |E AND C )}II I (M :E AND M |C ) = I (M |C )
III I (M :E |C ) is a strictly increasing function of its argumentsIV I (M1 AND M2:M1|C ) = I (M1:M1|C ) if M1 and M2 are
mutually irrelevant pieces of information.V I (M1 AND M2|M1 AND C ) = I (M2|C )
⇓. . . straightforward but tedious. . .
⇓Logarithm of PDF is additive measure
of the information contained in the PDF.
H. Bruyninckx, The Bayesian approach 64
Example of logarithmic measure:Shannon entropy
Interpretation:
I Physical: measures “uncertainty,” “disorder,”“chaos.”
I Bayesian: measures our (lack of) information,not the physical system’s lack of order.
I Composition of information = multiplicative.Composition of measures = additive (via log).
First case: discrete PDF
I {x1, . . . , xn}, with PDF p(x) = {p1, . . . , pn}.I Problem: define a measure H(p) of PDF p(x).
H. Bruyninckx, The Bayesian approach 65
Entropy information measure.Axiomatic approach
Discrete distributionsI H is a continuous function of p.
II If all n probabilities pi are equal (and henceequal to 1/n, if we choose them to have to sumto 1), the entropy H(1/n, . . . , 1/n) is a mono-tonically increasing function of n.
III H is an invariant, i.e., the uncertainty shouldnot depend on how one orders or groups theelements xi .
H. Bruyninckx, The Bayesian approach 66
Invariance of entropyMathematical discussion
Invariance w.r.t. grouping of variables expressedmathematically as:
H(p1, . . . , pn) =H(w1,w2, . . . )
+ w1H(p1|w1, . . . , pk |w1)
+ w2H(pk+1|w2, . . . , pk+m|w2)
+ . . . ,
where w1 is the probability of the set {x1, . . . , xk},w2 is the probability of the set {xk+1, . . . , xk+m}, . . .
H. Bruyninckx, The Bayesian approach 67
w1w2
w3
pn
p1
pm
pm+1
H. Bruyninckx, The Bayesian approach 68
Invariance of entropyExample
p1 = 1/2, p2 = 1/3, p3 = 1/6
Choose grouping into sets:w1 = p1 = 1/2,w2 = p2 + p3 = 1/3 + 1/6 = 1/2
Then:H(1/2, 1/3, 1/6) = H(1/2, 1/2) + 1
2H(2/3, 1/3)
H. Bruyninckx, The Bayesian approach 69
Continuity & monotonicity of entropyMathematical discussion
I Continuous → rational numbers:pi = ni/
∑nj=1 nj
I Uniform P = (1/N , . . . , 1/N)→ H(p) = H(P)
(group into n1, n2, etc.)
I ⇒ H(n) + H(m) = H(mn)
I ⇒ H(n) = K ln(n)
H. Bruyninckx, The Bayesian approach 70
Entropy formulas
H(p1, . . . , pn) = −K∑
pi ln(pi).
I ln(pi) < 0, because (ni/∑
ni) < 1.
I limpi→0 pi ln(pi) = 0.
I H increases with uncertainty(Uncertainty ≈ lack of clear peak.)
I There is no absolute zero for entropy.
I Be careful to compare entropies of two PDFs:must be defined over same domain.
H. Bruyninckx, The Bayesian approach 71
Entropy for continuous PDF
I H(1/N , . . . , 1/N) not well defined for N →∞.
I PDF1 = p(x) dx , PDF2 = p(y) dy : x and yspan the same parameter space!
I Take ni/∑
ni in interval: density.
I Ratios of densities make sense (locally):
ni/∑
ni
mj/∑
mj→ dx
dy
H. Bruyninckx, The Bayesian approach 72
Mutual informationRelative entropy
Kullback-Leibler divergence
H(p, q) = −∫
p(x) ln
(p(x)
q(x)
)dx .
I Asymmetry between p(x) and q(x).
I Asymmetry ⇒ no “distance” function!(Distance = independent of direction.)
I Corresponds to intuition when going betweentwo states of information.
H. Bruyninckx, The Bayesian approach 73
Local information measure—Fisher information—
(No mathematical details!)I Logarithm-based measures are global!I Local measure = rate of change of information
in given direction v of local change in PDF:
H(PDF:PDF+εv) =1
2
∑i ,j
gij(PDF)v iv j+O(ε2).
Sensitivity of variance to small changes inmean.
I Fisher information H : is a matrix, not scalar⇒ scalar measures = f (H).
H. Bruyninckx, The Bayesian approach 74
Information measuresExample of Gaussian PDFs
N (µ,P) =1√
(2π)n ||P||1/2exp
{−1
2(x− µ)T P−1(x− µ)
}.
I Quadratic measure for “error” x− µ:
1
2(x− µ)T P−1(x− µ)
(= square of the “Mahalanobis distance”)I Covariance matrix P−1 is Fisher InformationI Is additive after taking the logarithm of PDF!
H. Bruyninckx, The Bayesian approach 75
Measures inspired by P−1
P−1 is matrix measure ⇒ derived scalar measures:
I Trace.
I Determinant.
I Ratio of singular values.
I . . .
−− No unique choice.
++ Computationally efficient.
Note: inverse P−1 of Covariance Matrix is called theInformation Matrix.
H. Bruyninckx, The Bayesian approach 76
Ignorance—No information
I Bayes approach: state of “No information”does not exist.Jaynes: “Merely knowing the physical meaningof our parameters in a model, alreadyconstitutes highly relevant prior informationwhich our intuition is able to use at once.”
I Approaches:I “Uniform” prior.
(Note: impossible on infinite domains.)I Requirement of Invariance under transformations⇒ Maximum Entropy (MaxEnt) PDF.
I Explicit “I don’t know” hypothesis.
H. Bruyninckx, The Bayesian approach 77
Bayes rule is information-preserving
I Same Data explained by models M1 and M2:
p(M1|Data) =p(Data|M1)
p(Data)p(M1),
p(M2|Data) =p(Data|M2)
p(Data)p(M2),
⇒ logp(M1|Data)
p(M2|Data)= log
p(Data|M1)
p(Data|M2)+log
p(M1)
p(M2).
⇒ Bayes’ rule does not add or remove(logarithmic) information.
H. Bruyninckx, The Bayesian approach 78
Goodness-of-fit vs Model complexity
Goodness-of-fit/Likelihood:how well can a given Model explain the Data?
Complexity of a model is higher if:
I it has a higher dimensional state space.
I it has more parameters in PDF representation.
⇒ same Data is less likely in more complex model.
How to trade-off complexity vs explanatory power oftwo models?⇒ Occam’s razor, Bayes’ factor, Occam’s factor,AIC, BIC, MDL, . . .
H. Bruyninckx, The Bayesian approach 79
Occam’s razor
Pluralitas non est ponenda sine necessitate.
(Given two equally predictive models, choose thesimpler one.)
Example: series −1, 3, 7, 11; what’s next?
Model 1: 15, 19, . . . f (x) = x + 4.
Model 2: −19.9, 1043.8, . . . :
f (x) = −x3
11+
9x2
11+
23
11f (−1) = 3, f (3) = 7, f (7) = 11, f (11) = −19.4, . . .
Model 1: 1 integer; Model 2: 3 rational numbers.
H. Bruyninckx, The Bayesian approach 80
Bayesian model selection(“Hypothesis test”)
Explicative capacity of two models M1 and M2 forthe same Data:
p(M1|Data)
p(M2|Data)=
p(Data|M1)
p(Data|M2)︸ ︷︷ ︸Bayes’ factor
p(M1)
p(M2)︸ ︷︷ ︸prior
.
(Single scalar when Data is given!)Bayes’ Factor: penalizes complex modelsautomatically in the trade-off between predictivepower and complexity (see later: Occam’s factor).
H. Bruyninckx, The Bayesian approach 81
Bayes’ Factor
“Average out” influence of all parameters θi inmodel Mi :
p(Data|M1)
p(Data|M2)︸ ︷︷ ︸Bayes’ factor
=
∫p(Data|θ1,M1)p(θ1|M1)dθ1∫p(Data|θ2,M2)︸ ︷︷ ︸
data fit
p(θ2|M2)dθ2︸ ︷︷ ︸Occam’s factor
.
Bayes’ factor is ratio of two scalars, so comparisonis possible.
Occam’s Factor: complexity part of Bayes’ Factor.
p(θ1|M1): after inference (“training”) with Data.
H. Bruyninckx, The Bayesian approach 82
Occam’s factor
= complexity measure in Bayes’ factor:
p(Data|M) =
∫p(Data|θ,M)︸ ︷︷ ︸
data fit
p(θ|M)dθ︸ ︷︷ ︸Occam’s factor
.
The larger the θ parameter space (i.e., the morecomplex the model), the smaller the regionp(θ|M)dθ where the largest probability mass isfound. (In case the model has “converged.”)
Occam’s factor includes density dθ!
H. Bruyninckx, The Bayesian approach 83
Bayes’ factor implementations
Akaike Information Criterion (AIC)
−2 ln p(Data|θml ,M) + 2k , (k = # parameters).
Bayesian Information Criterion (BIC):
−2 ln p(Data|θml ,M) + k ln(N), (N =
# data samples).
Minimum Description Length (MDL):
− ln p(Data|θml ,M) +k
2ln(N).
(Beware: these measures do note take density intoaccount!)
H. Bruyninckx, The Bayesian approach 84
Examples up to now: simple!
I Static network: all PDFs & nodes are fixed.
I PDFs are binary: true, false.
I Fully observed: all PDF tables are given.
I Only estimation, no learning.
But the real world is more complex:
I Dynamic networks: PDFs change over time,because new “data” comes in from the world.
I PDFs: continuous functions.
I Partially observed: some PDFs are unknown.
I Learning and estimation.
H. Bruyninckx, The Bayesian approach 85
What makes uncertainty reasoning“Bayesian”?
I Use explicit models.
I Uncertainty = PDF.
I Allow a priori knowledge.
I Inference = calculate marginals of unobservedvariables, conditional on observed variables.
I Use Bayes’ rule i.s.o. only likelihood.
I Estimate PDF i.s.o. “point estimates”.(E.g., Maximum Likelihood, Maximum APosteriori)
H. Bruyninckx, The Bayesian approach 86
Maximum Likelihood EstimatorMaximum A Posterior
I Likelihood (from Bayes’ rule):
L(Model params) = p(Data|Model params).
I Maximum Likelihood Estimator (MLE):
arg maxModel params
L(Model params).
I Maximum A Posterior (MAP):
arg maxModel params
p(Model params|Data).
H. Bruyninckx, The Bayesian approach 87
MLE, MAP (2)
+ Often easy to calculate.
− Only point estimate.
− Doesn’t take density dx into account.
− Not very robust against small changes in PDF.
H. Bruyninckx, The Bayesian approach 88
Bayesian Learning
Things to learn:
I Structure of the graphical model.
I Parameters of the PDFs in the model.
Four categories, from simple to complex:
1. known model + full observations.(Done already! Only queries.)
2. known model + partial observations.(Inference of unobserved parameters.)
3. unknown model + full observations.
4. unknown model + partial observations.
H. Bruyninckx, The Bayesian approach 89
Bayesian Estimation
To treat parameters as additional unobservedvariables and to compute a full posterior distributionover all nodes conditional upon observed data, thento integrate out the parameters.
Point estimate x = X from PDF p(x , y , z , . . . ):
0. Learn p(x , y , z , . . . ).
1. Marginalize out all other variables:p(x) =
∫p(x , y , z , . . . ) dy dz . . .
2. (If point estimate is needed:)Weighted average over x (posterior mean):
X =∫
x p(x) dx .
H. Bruyninckx, The Bayesian approach 90
Bayesian model learning(Revisited)
Model learning = estimation in (much) larger space:
I New model = combination of primitive models.
I Combinations are represented by parameters.
I The combination parameters are learned.
Model learning is an order of magnitudemore complex than parameter estimation!
H. Bruyninckx, The Bayesian approach 91
Approximations
I Modelling:I Choice of model parameters.I Choice of graphical model structure
(= stochastic dependencies.)I Choice of PDF family.
I Assumptions:I Independence in joint PDFs.I Independence over time steps: Markov property.
(x(k + 1) = f (x(k)).)I Choice of priors.
Always requires application-dependent insights!
H. Bruyninckx, The Bayesian approach 92
Computational simplifications
I Integral needed during marginalization:I Replace integral by finite sum.I (Sample-based) Monte Carlo integration.
I Use ML instead of full PDF.
I Use explicit input/output relationships(y = f (x)) instead of implicit (h(x , y) = 0).
I Linearize non-linear relationships.
I Adapt scheduling of inference computations.
I Replace some steps by analytic approximations.
I . . .
Always requires application-dependent insights!
H. Bruyninckx, The Bayesian approach 93
Overview of “Bayesian” algorithms
0. Answering queries in Bayes net.I Network is given.I All PDFs are known
I Already explained earlier in course.
1. General inference in graphical models= to update PDFs when new data comes in:
I Bayesian network: Message passingI Markov Random Field: Junction tree
H. Bruyninckx, The Bayesian approach 94
Overview “Bayesian” algorithms (2)
2. Recursive estimation= inference + estimation in given model whennew data comes in at regular intervals:
I Kalman Filter.I Particle Filter.I Hidden Markov Model (Baum-Welch,
Viterbi).
3. Model approximation= to construct model that fits data “best”:
I Variational Bayes.I Expectation-Maximization (EM).
H. Bruyninckx, The Bayesian approach 95
Inference on Bayes NetworkMessage passing—Belief propagation
I Propagation of new information from one nodeto all others via message passing in forwardand backward direction of the directed graph.
I Messages are stored locally (at nodes) in ordernot to repeat some calculations.
I More efficient algorithms for graphs with morestructure (tree, DAG, . . . ) are limit cases ofthis message passing.
I No details given in this course. . .I Further reading: Bishop 2006 (Chapter 8).
H. Bruyninckx, The Bayesian approach 96
Markov blankets= all nodes influencing given node
MRF:all neighbours.
Bayes network:parents, children,and children’sparents.
H. Bruyninckx, The Bayesian approach 97
H. Bruyninckx, The Bayesian approach 98
Inference on general graph—Junction Tree—
I Transform MRF graphical model into a tree bytaking nodes together.
I Transformed model is not unique!(Different arbitrary choices are possible.)
I Too complex for this introductory course. . .
I Approximation via Loopy Belief Propagation= apply Belief Propagation although graph isnot DAG.
H. Bruyninckx, The Bayesian approach 99
Dynamic/Recursive network1st-order Markov
“Arrows”:
Xk+1 = f (Xk ,Uk+1)
Yk = g(Xk ,Uk)
“1st-order Markov” = time-influence only one step.
H. Bruyninckx, The Bayesian approach 100
Dynamic/Recursive network2nd-order Markov
H. Bruyninckx, The Bayesian approach 101
Simplest dynamic network:Kalman Filter
Required inference = given Y and U , update X .
Assumptions made:
I 1st order Markov.
I PDF representation: Gaussian distributions!
I Functional relationships f (·), g(·): linear!
⇒ analytical solution possible for Bayes’ rule(“learning”) and marginalization (“estimation”)!
H. Bruyninckx, The Bayesian approach 102
Kalman Filters (cont’d)
Typical applications: tracking (= adapting to smalldeviations from previous values).
Simplifications to reduce computations:
I State-space domain Model: xk+1 = F xk + Qk .
I Gaussian “uncertainty” on xk : covariance Pk .
I Gaussian “process noise”: covariance Qk .
I Data: zk = H xk + Rk .
I Gaussian “measurement uncertainty” on zk :covariance Rk .
H. Bruyninckx, The Bayesian approach 103
H. Bruyninckx, The Bayesian approach 104
Kalman Filters—Further readingTo derive Kalman Filter from Bayes’ rule:
Ho, Yu-Chi and Lee, R. C. K., A Bayesian Approach toProblems in Stochastic Estimation and Control, IEEETransactions on Automatic Control, 1964.
To derive Kalman Filter as optimal information processor:
Zellner, Arnold. Optimal Information Processing andBayes’s Theorem, The American Statistician,42:278–284, 1988.
To derive Kalman Filter as recursive least-squares filter:
Kalman, R. E. and Bucy, R. S. New results in linearfiltering and prediction theory, Transactions of theASME, Journal of Basic Engineering, 83:95–108, 1961.
H. Bruyninckx, The Bayesian approach 105
Sample-based inference(Monte Carlo methods)
Terminology: Sequential Monte Carlo, bootstrapfilter, condensation algorithm, particle filter,. . .
Approximated PDF = samples with a weight:
p(xk |zk)→ {x ik ,w
ik}Ns
i=1,∑
i
w ik = 1.
Operations on PDFs (marginalization, Bayes’rule, . . . ) → operations on samples. For example:
φ =
∫φ(x)p(x)dx ≈ 1
N
N∑i=1
φ(x i) =N∑
i=1
w iφ(x i).
H. Bruyninckx, The Bayesian approach 106
Importance of sample-based methods
I Can approximate any PDF. . .
I . . . including PDFs with mixeddiscrete-continuous parameters.
I Sequential method: Markov property leads toreduced computational complexity.
I Used where Kalman Filters are too “unimodal.”
I Accuracy doesn’t depend on dimension ofspace!
H. Bruyninckx, The Bayesian approach 107
How to generate samples?
I Inversion sampling.
I Proposal-based sampling:
I Importance sampling.I Rejection sampling.
I Markov Chain Monte Carlo (MCMC) sampling:
I Metropolis-Hastings sampling.I Gibbs sampling.
All methods use Random Number Generators.All methods assume evaluating p(x) is cheap.
H. Bruyninckx, The Bayesian approach 108
Inversion sampling via CDF
I (Only) sampling from uniform PDF is simple.I Searching in the Cumulative Density Function
can be costly.I Only some families op PDF allow simple
inversion sampling.
H. Bruyninckx, The Bayesian approach 109
Importance sampling
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
++ ++ ++ ++++ + + + ++ +++ ++ ++ ++ ++ ++ + + ++ ++ +++ ++ + ++ + +++ ++++
BetaGaussianBeta SamplesNormal samples
I Sample from simpler proposal: q(x).
I Adjust weight of each sample according top(x)/q(x).
I q(x) should be “similar” to p(x).
H. Bruyninckx, The Bayesian approach 110
Rejection sampling
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Rejection Sampling
x
fact
or *
dno
rm(x
, mu,
sig
ma) student t
Scaled Gaussian
I Sample from simpler proposal q(x).I q(x) should be “similar” to p(x).I Sample from cq(x), c > 1.I Accept if smaller than p(x).
H. Bruyninckx, The Bayesian approach 111
MCMC sampling(Markov Chain Monte Carlo)
I Finding appropriate “proposal density” q(x):difficult!Especially in higher dimensions.
I PDF is found as the limit of a Markov Chain.
I Can take long time before chain has“converged.”
I Two major approaches:I Metropolis-Hastings sampling.I Gibbs sampling.
H. Bruyninckx, The Bayesian approach 112
Example: Visual feature tracking
Movie
I Low-level visual features are tracked.
I “Statistical weight” is built up over time.
I Some features become permanent.
http://www.doc.ic.ac.uk/~ajd/
H. Bruyninckx, The Bayesian approach 113
Example: robot localization withlaser distance scanner
Movie
I Uniform initial sample set.
I Some rooms are almost identical → multiple“peaks.”
http://www.cs.washington.edu/ai/Mobile_
Robotics/mcl/
H. Bruyninckx, The Bayesian approach 114
Hidden Markov modelSystem behaves as a State Machine:
(Source: Wikipedia)
I x: states.
I y: possibleobservations.
I a: state transitionprobabilities.
I b: observationprobabilities.
This is not a graphical model!
Example: speech recognition.
H. Bruyninckx, The Bayesian approach 115
Graphical model of HMM
“Easy” calculation direction:
p(Y ) =∑X
p(Y |X )p(X ).
Even this explodes for real systems. . .
H. Bruyninckx, The Bayesian approach 116
Fastest Bayesian Matching—Viterbi Algorithm—
(Also known as: forward algorithm.)
I Given the HMM model parameters (transitionand output probabilities), find the most likelysequence of (hidden) states (e.g., spokenwords) which could have generated a givenoutput sequence (e.g., measured acousticsignal).
I Maximum Likelihood, not full-PDF:
argmax p(hidden state sequence|observed sequence).
H. Bruyninckx, The Bayesian approach 117
HMM properties that avoid exhaustive search:
I observed and hidden events: in alignedsequences.(= form of graphical model)
I observed event corresponds to exactly onehidden event.(= state dynamics is Finite State Machine)
I computing most likely hidden sequence up totime t only depends on (i) observed event at t,and (ii) most likely sequence up to time t − 1.(= dynamic programming)
H. Bruyninckx, The Bayesian approach 118
Fastest Bayesian Learning—Baum-Welch—
(Also known as: forward-backward algorithm.)
Learning of the probabilities aij and bij in a HMM:
I Maximum Likelihood, not full PDF.
I Makes forward steps starting from assumed MLparameters.
I Compares it to the measured outputs.
I Makes backward steps to adapt MLparameters.
I Until convergence, to local optimum!
H. Bruyninckx, The Bayesian approach 119
Approximate learning algorithm—Variational Bayes—
I Ever-returning problem: estimate p(z |x) wherex are the observed variables (“data”), and zare the hidden variables.
I Approximate “true” p(z |x) by a memberq(z |θ) from a θ-parameterized family of PDFs.For example: sum of Gaussians, with theweights as parameters.
I Approximation is guided by minimization ofKullback-Leibler divergence between true ep(z |x) and approximated q(z |θ).
H. Bruyninckx, The Bayesian approach 120
Approximate learning algorithm—Expectation-Maximization—
I EM is “abstract algorithm”, not reallyexecutable in itself.
I Gives Maximum Likelihood estimate of θ inq(z |θ).
I Works in two steps:I Expectation step.I Maximization step.
I Can be proven to converge. . .
I . . . but not necessarily to global optimum.
H. Bruyninckx, The Bayesian approach 121
Outline of EMData x are given; θ and z are alternatively updated.
Expectation step k :I Assume θ has given value θ(k).I Then p(z |x , θ(k)) results from inference.I The “total” log-likelihood function log p(x , z |θ)
is function of z and θ.I Marginalizing z away gives a function of θ:
Q(θ) =
∫p(z |x , θ(k)) log p(x , z |θ)dz .
Maximization step k : find the maximum of Q(θ):
θ(k + 1) = arg maxθ
Q(θ).
H. Bruyninckx, The Bayesian approach 122
Alternative explanation of EM
System spaces:
I Total space S : all possible PDFs over statespace, modelled by parameters φ.
I Model space M : all PDFs that are in the θfamily, i.e., subspace of S , represented byparameters φ(θ).
I Data space D: all PDFs p(x , z) that arepossible given the observed + hiddenparameters.
D need not be part of M!
H. Bruyninckx, The Bayesian approach 123
EM = Iterative Maximum Likelihood steps:
I Inference with last P → Q ∈ D.
I Find “closest” representative in M= “project” PDF Q onto M= P ∈ M : minP KL(Q,P).(KL = Kullback-Leibler divergence)
H. Bruyninckx, The Bayesian approach 124
Graphical model learning
Unknown model + full or partial observations
Much harder than parameter learning!
Generic approach to model learning: find “best”parameters in combinations of more primitivesub-models.
Major problem: create sub-models + dataassociation of (hidden) data to sub-models.
Major algorithm: EM, with more complex θ’s:
1. discrete selection of sub-models.
2. continuous information in each selection.
H. Bruyninckx, The Bayesian approach 125
H. Bruyninckx, The Bayesian approach 126