1 22c:145 artificial intelligence bayesian networks reading: ch 14. russell & norvig
TRANSCRIPT
1
22c:145 Artificial Intelligence
Bayesian Networks
• Reading: Ch 14. Russell & Norvig
2
Review of Probability Theory
• Random Variables• The probability that a random variable X has value
val is written as P(X=val)• P: domain ! [0, 1]
– Sums to 1 over the domain: » P(Raining = true) = P(Raining) = 0.2» P(Raining = false) = P(: Raining) = 0.8
• Joint distribution: • P(X1, X2, …, Xn)
• Probability assignment to all combinations of values of random variables and provide complete information about the probabilities of its random variables.
• A JPD table for n random variables, each ranging over k distinct values, has kn entries!
Toothache :Toothache
Cavity 0.04 0.06
: Cavity 0.01 0.89
3
Review of Probability Theory
• Conditioning• P(A) = P(A | B) P(B) + P(A | :B) P(:B)
= P(A Æ B) + P(A Æ :B)• A and B are independent iff
• P(A Æ B) = P(A) ¢ P(B)• P(A | B) = P(A)• P(B | A) = P(B)
• A and B are conditionally independent given C iff• P(A | B, C) = P(A | C)• P(B | A, C) = P(B | C)• P(A Æ B | C) = P(A | C) ¢ P(B | C)
• Bayes’ Rule• P(A | B) = P(B | A) P(A) / P(B)• P(A | B, C) = P(B | A, C) P(A | C) / P(B | C)
4
Bayesian Networks
• To do probabilistic reasoning, you need to know the joint probability distribution
• But, in a domain with N propositional variables, one needs 2N numbers to specify the joint probability distribution
• We want to exploit independences in the domain• Two components: structure and numerical
parameters
Lecture 14 • 5
Bayesian networks• A simple, graphical notation for conditional independence
assertions and hence for compact specification of full joint distributions
• Syntax:• a set of nodes, one per variable• a directed, acyclic graph (link ≈ "directly influences")• a conditional distribution for each node given its parents:
P (Xi | Parents (Xi))
• In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values
6
Bayesian (Belief) Networks
• Set of random variables, each has a finite set of values
• Set of directed arcs between them forming acyclic graph, representing causal relation
• Every node A, with parents B1, …, Bn, has
P(A | B1,…,Bn) specified
7
Key Advantage
• The conditional independencies (missing arrows) mean that we can store and compute the joint probability distribution more efficiently
How to design a Belief Network?•Explore the causal relations
8
Icy Roads
“Causal” Component
Holmes Crash
Icy
Watson Crash
Inspector Smith is waiting for Holmes and Watson, who are driving (separately) to meet him. It is winter. His secretary tells him that Watson has had an accident. He says, “It must be that the roads are icy. I bet that Holmes will have an accident too. I should go to lunch.” But, his secretary says, “No, the roads are not icy, look at the window.” So, he says, “I guess I better wait for Holmes.”
9
Icy Roads
“Causal” Component
Holmes Crash
Icy
Watson Crash
Inspector Smith is waiting for Holmes and Watson, who are driving (separately) to meet him. It is winter. His secretary tells him that Watson has had an accident. He says, “It must be that the roads are icy. I bet that Holmes will have an accident too. I should go to lunch.” But, his secretary says, “No, the roads are not icy, look at the window.” So, he says, “I guess I better wait for Holmes.”
10
Icy
Icy Roads
“Causal” Component
Holmes Crash
Watson Crash
Inspector Smith is waiting for Holmes and Watson, who are driving (separately) to meet him. It is winter. His secretary tells him that Watson has had an accident. He says, “It must be that the roads are icy. I bet that Holmes will have an accident too. I should go to lunch.” But, his secretary says, “No, the roads are not icy, look at the window.” So, he says, “I guess I better wait for Holmes.”
11
Icy
Icy Roads
“Causal” Component
Holmes Crash
Watson Crash
Inspector Smith is waiting for Holmes and Watson, who are driving (separately) to meet him. It is winter. His secretary tells him that Watson has had an accident. He says, “It must be that the roads are icy. I bet that Holmes will have an accident too. I should go to lunch.” But, his secretary says, “No, the roads are not icy, look at the window.” So, he says, “I guess I better wait for Holmes.”
H and W are dependent,
12
Icy Roads
“Causal” Component
Holmes Crash
Icy
Watson Crash
Inspector Smith is waiting for Holmes and Watson, who are driving (separately) to meet him. It is winter. His secretary tells him that Watson has had an accident. He says, “It must be that the roads are icy. I bet that Holmes will have an accident too. I should go to lunch.” But, his secretary says, “No, the roads are not icy, look at the window.” So, he says, “I guess I better wait for Holmes.”
H and W are dependent, but conditionally independent given I
13
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Holmes Lawn Wet
Sprinkler
Watson Lawn Wet
Rain
14
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Holmes Lawn Wet
Sprinkler
Watson Lawn Wet
Rain
15
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Holmes Lawn Wet
Sprinkler
Watson Lawn Wet
Rain
16
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Holmes Lawn Wet
Sprinkler
Watson Lawn Wet
Rain
17
Rain
Holmes Lawn Wet
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Sprinkler
Watson Lawn Wet
Given W, P(R) goes up
18
Rain
Holmes Lawn Wet
Holmes and Watson in IA
Holmes and Watson have moved to IA. He wakes up to find his lawn wet. He wonders if it has rained or if he left his sprinkler on. He looks at his neighbor Watson’s lawn and he sees it is wet too. So, he concludes it must have rained.
Sprinkler
Watson Lawn Wet
Given W, P(R) goes up and P(S) goes down – “explaining away”
19
Inference in Bayesian Networks
Query Types
Given a Bayesian network, what questions might we want to ask?
•Conditional probability query: P(x | e)•Maximum a posteriori probability:
What value of x maximizes P(x|e) ?
General question: What’s the whole probability distribution over variable X given evidence e, P(X | e)?
20
Using the joint distribution
To answer any query involving a conjunction of variables, sum over the variables not involved in the query.
Pr(d) Pr(a,b,c,d )ABC
Pr(Aa Bb C c)cdom(C )
bdom(B )
adom(A )
21
Using the joint distribution
To answer any query involving a conjunction of variables, sum over the variables not involved in the query.
Pr(d) Pr(a,b,c,d )ABC
Pr(Aa Bb C c)cdom(C )
bdom(B )
adom(A )
Pr(d | b)Pr(b,d )Pr(b)
Pr(a,b,c,d)
AC
Pr(a,b,c,d)
ACD
22
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
23
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
24
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
P(ABCD) =
P(D|ABC)P(ABC)
25
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
P(ABCD) =
P(D|ABC)P(ABC) =
P(D|C) P(ABC) =
A independent from D given C
B independent from D given C
26
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
P(ABCD) =
P(D|ABC)P(ABC) =
P(D|C) P(ABC) =
P(D|C) P(C|AB) P(AB) =
A independent from D given C
B independent from D given C
27
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
P(ABCD) =
P(D|ABC)P(ABC) =
P(D|C) P(ABC) =
P(D|C) P(C|AB) P(AB) =
P(D|C) P(C|AB) P(A)P(B) A independent from D given C
B independent from D given C
A independent from B
28
Chain Rule
• Variables: V1, …, Vn
• Values: v1, …, vn
• P(V1=v1, V2=v2, …, Vn=vn) = i P(Vi=vi | parents(Vi))
A B
C
D
P(A) P(B)
P(C|A,B)
P(D|C)
P(ABCD) = P(A=true, B=true, C=true, D=true)
P(ABCD) =
P(D|ABC)P(ABC) =
P(D|C) P(ABC) =
P(D|C) P(C|AB) P(AB) =
P(D|C) P(C|AB) P(A)P(B) A independent from D given C
B independent from D given C
A independent from B
29
Icy Roads with Numbers
Holmes Crash
Icy
Watson Crash
P(I=t) P(I=f)
0.7 0.3t = true
f = false
The right-hand column in these tables is redundant, since we know the entries in each row must add to 1.
NB: the columns need NOT add to 1.
30
Icy Roads with Numbers
P(W=t | I) P(W=f | I)
I=t 0.8 0.2
I=f 0.1 0.9
Holmes Crash
Icy
Watson Crash
P(I=t) P(I=f)
0.7 0.3t = true
f = false
The right-hand column in these tables is redundant, since we know the entries in each row must add to 1.
Note: the columns need NOT add to 1.
31
Icy Roads with Numbers
P(W=t | I) P(W=f | I)
I=t 0.8 0.2
I=f 0.1 0.9
Holmes Crash
Icy
Watson Crash
P(I=t) P(I=f)
0.7 0.3
P(H=t | I) P(H=f | I)
I=t 0.8 0.2
I=f 0.1 0.9
t = true
f = false
The right-hand column in these tables is redundant, since we know the entries in each row must add to 1.
Note: the columns need NOT add to 1.
32
Probability that Watson Crashes
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(W) =
33
Probability that Watson Crashes
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(W) = P(W| I) P(I) + P(W|-I) P(-I)
= 0.8*0.7 + 0.1*0.3
= 0.56 + 0.03
= 0.59
34
Probability of Icy given Watson
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(I | W) =
35
Probability of Icy given Watson
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(I | W) = P(W | I) P(I) / P(W)
= 0.8*0.7 / 0.59
= 0.95
We started with P(I) = 0.7; knowing that Watson crashed raised the probability to 0.95
36
Probability of Holmes given Watson
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(H|W) =
37
Probability of Holmes given Watson
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
-I 0.1
P(W| I)
I 0.8
-I 0.1
P(H|W) = P(H, I | W) + P(H, -I | W) = P(H|W,I)P(I|W) + P(H|W,-I) P(-I| W) = P(H|I)P(I|W) + P(H|-I) P(-I| W) = 0.8*0.95 + 0.1*0.05 = 0.765
We started with P(H) = 0.59; knowing that Watson crashed raised the probability to 0.765
38
Prob of Holmes given Icy and Watson
Holmes Crash
Icy
Watson Crash
P(I)=0.7
P(H| I)
I 0.8
~I 0.1
P(W| I)
I 0.8
~I 0.1
P(H|W, ~I I) = P(H ~I) = 0.1
H and W are independent given I, so H and W are conditionally independent given I
Lecture 14 • 39
Example
• Topology of network encodes conditional independence assertions:
• Weather is independent of the other variables• Toothache and Catch are conditionally independent given
Cavity
Lecture 14 • 40
Example• I'm at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls
• Network topology reflects "causal" knowledge:• A burglar can set the alarm off• An earthquake can set the alarm off• The alarm can cause Mary to call• The alarm can cause John to call
Lecture 14 • 41
Example contd.
Lecture 14 • 42
Compactness
• A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values
• Each row requires one number p for Xi = true(the number for Xi = false is just 1-p)
• If each variable has no more than k parents, the complete network requires O(n · 2k) numbers
• I.e., grows linearly with n, vs. O(2n) for the full joint distribution
• For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)
Lecture 14 • 43
Semantics
The full joint distribution is defined as the product of the local conditional distributions:
P (X1, … ,Xn) = πi = 1 P (Xi | Parents(Xi))
e.g., P(j m a b e)= P (j | a) P (m | a) P (a | b, e) P (b) P (e)
•
•
n
Lecture 14 • 44
Constructing Bayesian networks
• 1. Choose an ordering of variables X1, … ,Xn
• 2. For i = 1 to n• add Xi to the network• select parents from X1, … ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
This choice of parents guarantees:P (X1, … ,Xn) = πi =1 P (Xi | X1, … , Xi-1)
= πi =1P (Xi | Parents(Xi))
(by construction)(chain rule)
•
n
n
Lecture 14 • 45
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?
•
Example
Lecture 14 • 46
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?P(A | J, M) = P(A | J)? P(A | J, M) = P(A)?
• No
Example
Lecture 14 • 47
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? NoP(B | A, J, M) = P(B | A)? P(B | A, J, M) = P(B)?
• No
Example
Lecture 14 • 48
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? NoP(B | A, J, M) = P(B | A)? YesP(B | A, J, M) = P(B)? NoP(E | B, A ,J, M) = P(E | A)?P(E | B, A, J, M) = P(E | A, B)?
• No
Example
Lecture 14 • 49
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? NoP(B | A, J, M) = P(B | A)? YesP(B | A, J, M) = P(B)? NoP(E | B, A ,J, M) = P(E | A)? NoP(E | B, A, J, M) = P(E | A, B)? Yes
• No
Example
Lecture 14 • 50
Example contd.
• Deciding conditional independence is hard in noncausal directions• (Causal models and conditional independence seem hardwired for
humans!)• Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed
51
Excercises
P(J, M, A, B, E ) = ?P(M, A, B) = ?P(-M, A, B) = ?P(A, B) = ?P(M, B) = ?P(A | J) = ?
Lecture 14 • 52
Summary
• Bayesian networks provide a natural representation for (causally induced) conditional independence
• Topology + CPTs = compact representation of joint distribution
• Generally easy for domain experts to construct