cs b553 : a lgorithms for optimization and learning
DESCRIPTION
CS b553 : A lgorithms for Optimization and Learning. Bayesian Networks. agenda. B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries. Purposes of bayesian Networks. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/1.jpg)
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGBayesian Networks
![Page 2: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/2.jpg)
AGENDA Bayesian networks
Chain rule for Bayes nets Naïve Bayes models
Independence declarations D-separation
Probabilistic inference queries
![Page 3: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/3.jpg)
PURPOSES OF BAYESIAN NETWORKS Efficient and intuitive modeling of complex
causal interactions Compact representation of joint distributions
O(n) rather than O(2n) Algorithms for efficient inference with given
evidence (more on this next time)
![Page 4: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/4.jpg)
INDEPENDENCE OF RANDOM VARIABLES Two random variables a and b are
independent if P(A,B) = P(A) P(B)
hence P(A|B) = P(A) Knowing b doesn’t give you any information
about a
[This equality has to hold for all combinations of values that A and B can take on, i.e., all events A=a and B=b are independent]
![Page 5: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/5.jpg)
SIGNIFICANCE OF INDEPENDENCE If A and B are independent, then
P(A,B) = P(A) P(B)
=> The joint distribution over A and B can be defined as a product over the distribution of A and the distribution of B
=> Store two much smaller probability tables rather than a large probability table over all combinations of A and B
![Page 6: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/6.jpg)
CONDITIONAL INDEPENDENCE Two random variables a and b are
conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)
hence P(A|B,C) = P(A|C) Once you know C, learning B doesn’t give
you any information about A
[again, this has to hold for all combinations of values that A,B,C can take on]
![Page 7: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/7.jpg)
SIGNIFICANCE OF CONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t
have a direct relationship with SAT scores but good students are more likely to get good
SAT scores, so they are not independent… It is reasonable to believe that Grade(CS101)
and SAT are conditionally independent given Intelligence
![Page 8: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/8.jpg)
BAYESIAN NETWORK Explicitly represent independence among
propositions Notice that Intelligence is the “cause” of both Grade
and SAT, and the causality is represented explicitly
Intel.
Grade
P(I=x)high 0.3low 0.7
SAT
6 probabilities, instead of 11
P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I)
P(G=x|I) I=low I=high
‘a’ 0.2 0.74‘b’ 0.34 0.17‘C’ 0.46 0.09
P(S=x|I) I=low I=highlow 0.95 0.05high 0.2 0.8
![Page 9: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/9.jpg)
DEFINITION: BAYESIAN NETWORK Set of random variables X={X1,…,Xn} with
domains Val(X1),…,Val(Xn) Each node has a set of parents PaX
Graph must be a DAG Each node also maintains a conditional
probability distribution (often, a table) P(X|PaX) 2k-1 entries for binary valued variables
Overall: O(n2k) storage for binary variables
Encodes the joint probability over X1,…,Xn
![Page 10: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/10.jpg)
CALCULATION OF JOINT PROBABILITY
B E P(a|…)TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(b)0.001
P(e)0.002
A P(j|…)TF
0.900.05
A P(m|…)
TF
0.700.01
P(jmabe) = ??
![Page 11: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/11.jpg)
P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e) P(m|a,b,e) P(abe)(J and M are independent given A)
P(j|a,b,e) = P(j|a)(J and B and J and E are independent given A)
P(m|a,b,e) = P(m|a) P(abe) = P(a|b,e) P(b|e) P(e)
= P(a|b,e) P(b) P(e)(B and E are independent)
P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)
Burglary Earthquake
Alarm
MaryCallsJohnCalls
![Page 12: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/12.jpg)
CALCULATION OF JOINT PROBABILITY
B E P(a|…)TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
alarm
MaryCallsJohnCalls
P(b)0.001
P(e)0.002
A P(j|…)TF
0.900.05
A P(m|…)
TF
0.700.01
P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
![Page 13: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/13.jpg)
CALCULATION OF JOINT PROBABILITY
b e P(a|…)TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
alarm
maryCallsjohnCalls
P(b)0.001
P(e)0.002
a P(j|…)TF
0.900.05
a P(m|…)
TF
0.700.01
P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
P(x1x2…xn) = Pi=1,…,nP(xi|paXi) full joint distribution
![Page 14: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/14.jpg)
CHAIN RULE FOR BAYES NETS Joint distribution is a product of all CPTs
P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)
![Page 15: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/15.jpg)
EXAMPLE: NAÏVE BAYES MODELS P(Cause,Effect1,…,Effectn)
= P(Cause) Pi P(Effecti | Cause)
Cause
Effect1 Effect2 Effectn
![Page 16: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/16.jpg)
ADVANTAGES OF BAYES NETS (AND OTHER GRAPHICAL MODELS) More manageable # of parameters to set and
store Incremental modeling Explicit encoding of independence
assumptions Efficient inference techniques
![Page 17: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/17.jpg)
ARCS DO NOT NECESSARILY ENCODE CAUSALITY
A
B
C
C
B
A
2 BN’s with the same expressive power, and a 3rd with greater power (exercise)
C
B
A
![Page 18: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/18.jpg)
READING OFF INDEPENDENCE RELATIONSHIPS
Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?
No! C parent’s (B) are
given, and so it is independent of its non-descendents (A)
Independence is symmetric:C A | B => A C | B
A
B
C
![Page 19: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/19.jpg)
BASIC RULE A node is independent of its non-descendants
given its parents (and given nothing else)
![Page 20: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/20.jpg)
WHAT DOES THE BN ENCODE?
Burglary EarthquakeJohnCalls MaryCalls | AlarmJohnCalls Burglary | AlarmJohnCalls Earthquake | AlarmMaryCalls Burglary | AlarmMaryCalls Earthquake | Alarm
Burglary Earthquake
Alarm
MaryCallsJohnCalls
A node is independent of its non-descendents, given its parents
![Page 21: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/21.jpg)
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | Alarm ? No! Why?
Burglary Earthquake
Alarm
MaryCallsJohnCalls
![Page 22: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/22.jpg)
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | Alarm ? No! Why? P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086
Burglary Earthquake
Alarm
MaryCallsJohnCalls
![Page 23: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/23.jpg)
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of Alarm,
which makes Burglary and Earthquake dependent
Burglary Earthquake
Alarm
MaryCallsJohnCalls
![Page 24: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/24.jpg)
INDEPENDENCE RELATIONSHIPS For polytrees, there exists a unique
undirected path between A and B. For each node on the path: Evidence on the directed road XEY or XEY
makes X and Y independent Evidence on an XEY makes descendants
independent Evidence on a “V” node, or below the V:
XEY, or XWY with W… Emakes the X and Y dependent (otherwise they are independent)
![Page 25: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/25.jpg)
GENERAL CASE Formal property in general case:
D-separation : the above properties hold for all (acyclic) paths between A and B
D-separation independence
That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation The CPTs may indeed encode additional
independences
![Page 26: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/26.jpg)
PROBABILITY QUERIES Given: some probabilistic model over
variables X Find: distribution over YX given evidence
E=e for some subset E X / Y P(Y|E=e)
Inference problem
![Page 27: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/27.jpg)
ANSWERING INFERENCE PROBLEMS WITH THE JOINT DISTRIBUTION Easiest case: Y=X/E
P(Y|E=e) = P(Y,e)/P(e) Denominator makes the probabilities sum to 1 Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)
Otherwise, let Z=X/(EY) P(Y|E=e) = Sz P(Y,Z=z,e) /P(e) P(e) = Sy Sz P(Y=y,Z=z,e)
Inference with joint distribution: O(2|X/E|) for binary variables
![Page 28: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/28.jpg)
NAÏVE BAYES CLASSIFIER P(Class,Feature1,…,Featuren)
= P(Class) Pi P(Featurei | Class)
Class
Feature1 Feature2 Featuren
P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)
= 1/Z P(C) Pi P(Fi|C)
Given features, what class?
Spam / Not SpamEnglish / French / Latin
…
Word occurrences
![Page 29: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/29.jpg)
NAÏVE BAYES CLASSIFIER P(Class,Feature1,…,Featuren)
= P(Class) Pi P(Featurei | Class)
P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)
= 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn)
= 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C)Pj=k+1…n P(fj|C)
= 1/Z P(C) Pi=1…k P(Fi|C)Pj=k+1…n Sfj P(fj|C)
= 1/Z P(C) Pi=1…k P(Fi|C)
Given some features, what is the distribution over class?
![Page 30: CS b553 : A lgorithms for Optimization and Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56815f07550346895dcdc71e/html5/thumbnails/30.jpg)
FOR GENERAL QUERIES For BNs and queries in general, it’s not that
simple… more in later lectures.
Next class: skim 5.1-3, begin reading 9.1-4