bayes nets and probabilities

Click here to load reader

Upload: treva

Post on 24-Feb-2016

45 views

Category:

Documents


1 download

DESCRIPTION

Bayes Nets and Probabilities. Oliver Schulte Machine Learning 726. Bayes Nets: General Points. Represent domain knowledge . Allow for uncertainty . Complete representation of probabilistic knowledge. Represent causal relations. Fast answers to types of queries: - PowerPoint PPT Presentation

TRANSCRIPT

Slide 1

Oliver SchulteMachine Learning 726Bayes Nets and Probabilities#/57If you use insert slide number under Footer, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.1Bayes Nets: General PointsRepresent domain knowledge.Allow for uncertainty.Complete representation of probabilistic knowledge.Represent causal relations.Fast answers to types of queries:Probabilistic: What is the probability that a patient has strep throat given that they have fever?Relevance: Is fever relevant to having strep throat?#/57Bayes Net LinksJudea Pearl's Turing AwardSee UBCs AISpace#/57Probability Reasoning (With Bayes Nets)#/57Random VariablesA random variable has a probability associated with each of its values.A basic statement assigns a value to a random variable. VariableValueProbabilityWeatherSunny

0.7WeatherRainy0.2WeatherCloudy0.08WeatherSnow0.02CavityTrue0.2CavityFalse0.8#/57Fill in Cavity value later5Probability for SentencesA sentence or query is formed by using and, or, not recursively with basic statements.Sentences also have probabilities assigned to them.SentenceProbabilityP(Cavity = false AND Toothache = false)0.72P(Cavity = true OR Toothache = true)0.28

#/57Use sentences that are used later.2nd sentence should have higher probability.Exercise: Prove that if A entails B, then P(A) >= P(B).6Probability NotationOften probability theorists write A,B instead of A B (like Prolog). If the intended random variables are known, they are often not mentioned.ShorthandFull NotationP(Cavity = false,Toothache = false)P(Cavity = false Toothache = false) P(false, false) P(Cavity = false Toothache = false)#/577Axioms of probabilityFor any sentence A, B

0 P(A) 1P(true) = 1 and P(false) = 0P(A B) = P(A) + P(B) - P(A B)P(A) = P(B) if A and B are logically equivalent.

Sentences considered as sets of complete assignments#/57Logical equivalence connects probability and logic.True is a constant sentence that is true in all possible worlds.8Rule 1: Logical EquivalenceP(NOT (NOT Cavity))P(Cavity)0.20.2P(NOT (Cavity OR Toothache)P(Cavity = F AND Toothache = F)0.720.72P(NOT (Cavity AND Toothache))P(Cavity = F OR Toothache = F)0.880.88#/57Spot the pattern.Cavity equiv Cacity = T9The Logical Equivalence PatternP(NOT (NOT Cavity))=P(Cavity)0.20.2P(NOT (Cavity OR Toothache)=P(Cavity = F AND Toothache = F)0.720.72P(NOT (Cavity AND Toothache))=P(Cavity = F OR Toothache = F)0.880.88Rule 1: Logically equivalent expressions have the same probability.#/57This shows how logical reasoning is an important part of probabilistic reasoning. Often easier to determine probability after transforming expression into a logically equivalent form.10Rule 2: MarginalizationP(Cavity, Toothache)P(Cavity, Toothache = F)P(Cavity)0.120.080.2P(Cavity, Toothache)P(Cavity = F, Toothache)P(Toothache)0.120.080.2P(Cavity = F, Toothache)P(Cavity = F, Toothache = F)P(Cavity = F)0.080.720.8#/57Assignment: fill in marginal over 2 variables.11The Marginalization PatternP(Cavity, Toothache)+P(Cavity, Toothache = F)=P(Cavity)0.120.080.2P(Cavity, Toothache)+P(Cavity = F, Toothache)=P(Toothache)0.120.080.2P(Cavity = F, Toothache)+P(Cavity = F, Toothache = F)=P(Cavity = F)0.080.720.8#/57Assignment: fill in marginal over 2 variables.12Prove the Pattern: MarginalizationTheorem. P(A) = P(A,B) + P(A, not B)Proof. A is logically equivalent to [A and B) (A and not B)].P(A) = P([A and B) (A and not B)]) =P(A and B) + P(A and not B) P([A and B) (A and not B)]). Disjunction Rule.[A and B) (A and not B)] is logically equivalent to false, so P([A and B) (A and not B)]) =0.So 2. implies P(A) = P(A and B) + P(A and not B).

#/57Draw Venn diagram for 1 (split space in B, not B).13Completeness of Bayes NetsA probabilistic query system is complete if it can compute a probability for every sentence.Proposition: A Bayes net is complete. Proof has two steps.Any system that encodes the joint distribution is complete.A Bayes net encodes the joint distribution.#/57The Joint Distribution#/57Assigning Probabilities to SentencesA complete assignment is a conjunctive sentence that assigns a value to each random variable.The joint probability distribution specifies a probability for each complete assignment.

A joint distribution determines an probability for every sentence.How? Spot the pattern.

#/57Give examples: P(toothache, not cavity, toothache or cavity, as needed before).16Probabilities for Sentences: Spot the Pattern

SentenceProbabilityP(Cavity = false AND Toothache = false)0.72P(Cavity OR Toothache)0.28P(Toothache = false)0.8#/57Inference by enumeration

#/5718Inference by enumerationMarginalization: For any sentence A, sum the joint probabilities for the complete assignments where A is true.P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2.

#/5719Completeness Proof for Joint DistributionTheorem [from propositional logic] Every sentence is logically equivalent to a disjunction of the formA1 or A2 or ... or Akwhere the Ai are complete assignments.All of the Ai are mutually exclusive (joint probability 0). Why?So if S is equivalent to A1 or A2 or ... or Ak, thenP(S) = i P(Ai)where each Ai is given by the joint distribution.#/57Bayes Nets and The Joint Distribution#/57Example: Complete Bayesian Network

#/5722Example Horn Clauses: Alarm -> JohnClass p:0.9

The StoryYou have a new burglar alarm installed at home.Its reliable at detecting burglary but also responds to earthquakes.You have two neighbors that promise to call you at work when they hear the alarm.John always calls when he hears the alarm, but sometimes confuses alarm with telephone ringing.Mary listens to loud music and sometimes misses the alarm.#/57Computing The Joint DistributionA Bayes net provides a compact factored representation of a joint distribution.In words, the joint probability is computed as follows.

For each node Xi:Find the assigned value xi.Find the values y1,..,yk assigned to the parents of Xi.Look up the conditional probability P(xi|y1,..,yk) in the Bayes net.Multiply together these conditional probabilities.#/57Product Formula Example: BurglaryQuery: What is the joint probability that all variables are true? P(M, J, A, E, B) = P(M|A) p(J|A) p(A|E,B)P(E)P(B)= .7 x .9 x .95 x .002 x .001

#/5725Compactness of Bayesian NetworksConsider n binary variablesUnconstrained joint distribution requires O(2n) probabilitiesIf we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2k) probabilitiesExampleFull unconstrained joint distributionn = 30: need 230 probabilities for full joint distributionBayesian networkn = 30, k = 4: need 480 probabilities

#/5726Recall toothache example where we directly represented the joint distribution.Summary: Why are Bayes nets useful?- Graph structure supports- Modular representation of knowledge- Local, distributed algorithms for inference and learning- Intuitive (possibly causal) interpretation- Factored representation may have exponentially fewer parameters than full joint P(X1,,Xn) => lower sample complexity (less data for learning) lower time complexity (less time for inference)

#/57Is it Magic?How can the Bayes net reduce parameters?By exploiting conditional independencies.Why does the product formula work?The Bayes net topological or graphical semantics.The graph by itself entails conditional independencies.The Chain Rule.#/5728Conditional Probabilities and Independence#/57Conditional Probabilities: IntroGiven (A) that a die comes up with an odd number, what is the probability that (B) the number isa 2a 3Answer: the number of cases that satisfy both A and B, out of the number of cases that satisfy A.Examples:#faces with (odd and 2)/#faces with odd= 0 / 3 = 0.#faces with (odd and 3)/#faces with odd= 1 / 3.

#/57Conditional Probs ctd.Suppose that 50 students are taking 310 and 30 are women. Given (A) that a student is taking 310, what is the probability that (B) they are a woman?Answer: #students who take 310 and are a woman/#students in 310 = 30/50 = 3/5.Notation: P(A|B)#/57Conditional Ratios: Spot the PatternSpot the PatternP(Student takes 310)P(Student takes 310 and is woman)P(Student is woman|Student takes 310)=50/15,00030/15,0003/5P(die comes up with odd number)P(die comes up odd and is 3)P(3|odd number)1/21/61/3#/57Conditional Probs: The Ratio PatternSpot the PatternP(Student takes 310)/P(Student takes 310 and is woman)=P(Student is woman|Student takes 310)=50/15,00030/15,0003/5P(die comes up with odd number)/P(die comes up odd and is 3)=P(3|odd number)1/21/61/3P(A|B) = P(A and B)/ P(B) Important!#/57Exercise: prove that conditioning leads to a well-defined normalized probability measure.33Conditional Probabilities: MotivationMuch knowledge can be represented as implications B1,..,Bk =>A.Conditional probabilities are a probabilistic version of reasoning about what follows from conditions.Cognitive Science: Our minds store implicational knowledge.#/57The Product Rule: Spot the PatternP(Cavity)P(Toothache|Cavity)P(Cavity,Toothache)0.20.60.12P(Cavity =F)P(Toothache|Cavity = F)P(Toothache,Cavity =F)0.80.10.08P(Toothache)P(Cavity| Toothache)P(Cavity,Toothache)0.20.60.12#/57Product Rule: P(A,B) = P(A|B) x P(B)35IndependenceA and B are independent iffP(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

Suppose that Weather is independent of the Cavity Scenario. Then the joint distribution decomposes:P(Toothache, Catch, Cavity, Weather)= P(Toothache, Catch, Cavity) P(Weather)

Absolute independence powerful but rare.

#/5736ExerciseProve that the three definitions of independence are equivalent (assuming all positive probabilities).

A and B are independent iffP(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)

#/57Conditional independenceIf I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache:(1) P(catch | toothache, cavity) = P(catch | cavity)The same independence holds if I haven't got a cavity:(2) P(catch | toothache,cavity) = P(catch | cavity)Catch is conditionally independent of Toothache given Cavity:P(Catch | Toothache,Cavity) = P(Catch | Cavity)The equivalences for independence also holds for conditional independence, e.g.:

P(Toothache | Catch, Cavity) = P(Toothache | Cavity)

P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

#/57Conditioning yields just another distribution.38Bayes Nets Graphical Semantics#/57Common Causes: Spot the PatternCavityCatchtoothacheCatch is independent of toothache given Cavity.#/57In UBC Simple Diagnostic Example:Fever is independent of Coughing Given Influenza.Coughing is independent of Fever given Bronchitis.

40Burglary ExampleJohnCalls, MaryCalls are conditionally independent given Alarm.

#/5741Spot the Pattern: Chain ScenarioMaryCalls is independent of Burglary given Alarm.JohnCalls is independent of Earthquake given Alarm.

#/5742This is typical for sequential data. In UBC Simple Diagnostic Example:Wheezing is independent of Influenza Given Bronchitis.Coughing is independent of Smoke given Bronchitis.Influenze is independent of Smokes. [check assignment]

The Markov ConditionA Bayes net is constructed so that:each variable is conditionally independent of its nondescendants given its parents.

The graph alone (without specified probabilities) entails conditional independencies.Causal Interpretation: Each parent is a direct cause.

#/57This can always be achieved by letting a node have enough parents. See text for details on how to construct a Bayesian network.2. Helps with the retrieve relevant problem: from the graph we can tell that certain information is ignorable.

Can work on this using spot the pattern.43Derivation of the Product Formula#/57The Chain RuleWe can always write P(a, b, c, z) = P(a | b, c, . z) P(b, c, z) (Product Rule)

Repeatedly applying this idea, we obtain P(a, b, c, z) = P(a | b, c, . z) P(b | c,.. z) P(c| .. z)..P(z)

Order the variables such that children come before parents.Then given its parents, each node is independent of its other ancestors by the topological independence.

P(a,b,c, z) = x. P(x|parents)

#/5745Example in Burglary NetworkP(M, J,A,E,B) = P(M| J,A,E,B) p(J,A,E,B)= P(M|A) p(J,A,E,B) = P(M|A) p(J|A,E,B) p(A,E,B) = P(M|A) p(J|A) p(A,E,B) = P(M|A) p(J|A) p(A|E,B) P(E,B) = P(M|A) p(J|A) p(A|E,B) P(E)P(B) Colours show applications of the Bayes net topological independence.

#/57Explaining Away#/57Common Effects: Spot the Pattern Influenza and Smokes are independent.

Given Bronchitis, they become dependent. InfluenzaSmokesBronchitisBattery AgeCharging System OKBattery Voltage Battery Age and Charging System are independent.

Given Battery Voltage, they become dependent. #/5748From UBC Simple diagnostic problemDoes this make sense?

Conditioning on ChildrenABC Independent Causes:A and B are independent.

Explaining away effect:Given C, observing A makes B less likely. E.g. Bronchitis in UBC Simple Diagnostic Problem.

A and B are (marginally) independent, become dependent once C is known. #/5749Another example: Wumpus on one square explains stench.Characterizes Bayes nets.D-separationA, B, and C are non-intersecting subsets of nodes in a directed graph.A path from A to B is blocked if it contains a node such that eitherthe arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, orthe arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.If all paths from A to B are blocked, A is said to be d-separated from B by C. If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies .

#/5750D-separation: Example

#/57More examples in AI-space51Mathematical AnalysisTheorem: If A, B have no common ancestors and neither is a descendant of the other, then they are independent of each other.Proof for our example:P(a,b) = c P(a,b,c) = c P(a) P(b) P(c|a,b)c P(a) P(b) P(c|a,b) = P(a) P(b) c P(c|a,b) = P(a) P(b)ABC#/57Not quite a proof because we A,B may have parents. But illustrates general idea.First step follows from marginilization. Second step follows from product formula.3rd step follows since P(a), P(b) do not depend on c. Last step follows P(c|a,b) adds up to 1 over possible c values.

52Bayes Theorem#/57Abductive ReasoningImplications are often causal, from cause to effect. Many important queries are diagnostic, from effect to cause.BurglaryAlarmCavityToothache#/57But not impossible in logic: see nonmonotonic reasoning.Black: causal direction.Red: direction of inference.I mean that wumpus and stench are on adjacent squares.54Bayes Theorem: Another ExampleA doctor knows the following.The disease meningitis causes the patient to have a stiff neck 50% of the time. The prior probability that someone has meningitis is 1/50,000. The prior that someone has a stiff neck is 1/20.Question: knowing that a person has a stiff neck what is the probability that they have meningitis?#/5755Spot the Pattern: DiagnosisP(Cavity)P(Toothache|Cavity)P(Toothache)P(Cavity|Toothache)0.20.60.20.6P(Meningitis)P(Stiff Neck| Meningitis)P(Stiff Neck)P(Meningitis|Stiff Neck)1/50,0001/21/200.6#/57How does the last number depend on the first three?56Spot the Pattern: DiagnosisP(Cavity)xP(Toothache|Cavity)/P(Toothache)=P(Cavity|Toothache)0.20.60.20.6P(Meningitis)xP(Stiff Neck| Meningitis)/P(Stiff Neck)=P(Meningitis|Stiff Neck)1/50,0001/21/201/5,000#/5757Explain the Pattern: Bayes TheoremExercise: Prove Bayes Theorem

P(A | B) = P(B | A) P(A) / P(B).

#/5758On Bayes TheoremP(a | b) = P(b | a) P(a) / P(b).

Useful for assessing diagnostic probability from causal probability:

P(Cause|Effect) = Posterior probability =

P(Effect|Cause) x P(Cause) / P(Effect).

Likelihood: how well does the cause explain the effect?Prior: how plausible is the explanation before any evidence?Evidence Term/Normalization Constant: how surprising is the evidence?#/57The better the cause explains the effect, the more likely it is.The more plausible the cause is, the more likely it is.The more surprising the evidence (the lower its prior probability), the greater its impact.59