the mind is a neural computer, tted by natural selection

Post on 12-Jul-2022

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

“The mind is a neural computer, fitted by naturalselection with combinatorial algorithms for causaland probabilistic reasoning about plants, animals,objects, and people.”

. . .“In a universe with any regularities at all,

decisions informed about the past are better thandecisions made at random. That has always beentrue, and we would expect organisms, especiallyinformavores such as humans, to have evolved acuteintuitions about probability. The founders ofprobability, like the founders of logic, assumed theywere just formalizing common sense.”

Steven Pinker, How the Mind Works, 1997, pp. 524, 343.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 1 1 / 17

Learning Objectives

At the end of the class you should be able to:

justify the use and semantics of probability

know how to compute marginals and apply Bayes’theorem

build a belief network for a domain

predict the inferences for a belief network

explain the predictions of a causal model

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 2 2 / 17

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 3 3 / 17

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 4 3 / 17

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 5 3 / 17

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 6 3 / 17

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 7 3 / 17

Probability

Probability is an agent’s measure of belief in someproposition — subjective probability.

An agent’s belief depends on its prior belief and what itobserves.

Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by

what the agent knows about that bird.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 8 4 / 17

Probability

Probability is an agent’s measure of belief in someproposition — subjective probability.

An agent’s belief depends on its prior belief and what itobserves.

Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by

what the agent knows about that bird.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 9 4 / 17

Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 10 5 / 17

Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 11 5 / 17

Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 12 5 / 17

Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 13 5 / 17

Possible World Semantics

A possible world specifies an assignment of one value toeach random variable.

A random variable is a function from possible worlds intothe domain of the random variable.

ω |= X = xmeans variable X is assigned value x in world ω.

Logical connectives have their standard meaning:

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Let Ω be the set of all possible worlds.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 14 6 / 17

Possible World Semantics

A possible world specifies an assignment of one value toeach random variable.

A random variable is a function from possible worlds intothe domain of the random variable.

ω |= X = xmeans variable X is assigned value x in world ω.

Logical connectives have their standard meaning:

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Let Ω be the set of all possible worlds.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 15 6 / 17

Possible World Semantics

A possible world specifies an assignment of one value toeach random variable.

A random variable is a function from possible worlds intothe domain of the random variable.

ω |= X = xmeans variable X is assigned value x in world ω.

Logical connectives have their standard meaning:

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Let Ω be the set of all possible worlds.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 16 6 / 17

Semantics of Probability

Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:

µ(Ω) = 1

µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .

Then P(α) = µ(ω | ω |= α).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 17 7 / 17

Semantics of Probability

Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:

µ(Ω) = 1

µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .

Then P(α) = µ(ω | ω |= α).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 18 7 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 19 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 20 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 21 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 22 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 23 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 24 8 / 17

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 25 8 / 17

Axioms of Probability (finite case)

Three axioms define what follows from a set of probabilities:

Axiom 1 0 ≤ P(a) for any proposition a.

Axiom 2 P(true) = 1

Axiom 3 P(a ∨ b) = P(a) + P(b) if a and b cannot bothbe true.

These axioms are sound and complete with respect to thesemantics.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 26 9 / 17

Conditioning

Probabilistic conditioning specifies how to revise beliefsbased on new information.

An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.

All other information must be conditioned on.

If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 27 10 / 17

Conditioning

Probabilistic conditioning specifies how to revise beliefsbased on new information.

An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.

All other information must be conditioned on.

If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 28 10 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S)

if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 29 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S)

if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 30 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S)

if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 31 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 32 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

0

if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 33 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 34 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

We can show c =

1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 35 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) =

µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 36 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=

P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 37 11 / 17

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

.

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

P(e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 38 11 / 17

Conditioning

Possible Worlds:

Observe Color = orange:

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 39 12 / 17

Conditioning

Possible Worlds:

Observe Color = orange:

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 40 12 / 17

Conditioning

Possible Worlds:

Observe Color = orange:

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 41 12 / 17

Exercise

Flu Sneeze Snore µtrue true true 0.064true true false 0.096true false true 0.016true false false 0.024false true true 0.096false true false 0.144false false true 0.224false false false 0.336

What is:

(a) P(flu ∧ sneeze)

(b) P(flu ∧ ¬sneeze)

(c) P(flu)

(d) P(sneeze | flu)

(e) P(¬flu ∧ sneeze)

(f) P(flu | sneeze)

(g) P(sneeze | flu∧snore)

(h) P(flu | sneeze∧snore)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 42 13 / 17

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

=

P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

=n∏

i=1

P(fi | f1 ∧ · · · ∧ fi−1)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 43 14 / 17

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

=

P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

=n∏

i=1

P(fi | f1 ∧ · · · ∧ fi−1)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 44 14 / 17

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

=n∏

i=1

P(fi | f1 ∧ · · · ∧ fi−1)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 45 14 / 17

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) =

P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

P(e).

This is Bayes’ theorem.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 46 15 / 17

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

P(e).

This is Bayes’ theorem.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 47 15 / 17

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

P(e).

This is Bayes’ theorem.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 48 15 / 17

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =

P(e | h)× P(h)

P(e).

This is Bayes’ theorem.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 49 15 / 17

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

P(e).

This is Bayes’ theorem.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 50 15 / 17

Why is Bayes’ theorem interesting?

Often you have causal knowledge:P(symptom | disease)P(light is off | status of switches and switch positions)P(alarm | fire)

P(image looks like | a tree is in front of a car)

and want to do evidential reasoning:P(disease | symptom)P(status of switches | light is off and switch positions)P(fire | alarm).

P(a tree is in front of a car | image looks like )

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 51 16 / 17

Exercise

A cab was involved in a hit-and-run accident at night. Twocab companies, the Green and the Blue, operate in the city.You are given the following data:

85% of the cabs in the city are Green and 15% are Blue.

A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existedon the night of the accident and concluded that thewitness correctly identifies each one of the two colours80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accidentwas Blue?

[From D. Kahneman, Thinking Fast and Slow, 2011, p. 166.]

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 52 17 / 17

Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 1 1 / 12

Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 2 1 / 12

Example

Consider a student writing an exam.What are reasonable independences among the following?

Whether the student works hard

Whether the student is intelligent

The student’s answers on the exam

The student’s mark on an exam

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 3 2 / 12

Example domain (diagnostic assistant)

light

two-wayswitch

switch

off

on

poweroutlet

circuitbreaker

outside power

l1

l2

w1

w0

w2

w4

w3

w6

w5

p2

p1

cb2

cb1s1

s2s3

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 4 3 / 12

Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 5 4 / 12

Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 6 4 / 12

Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given

whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 7 4 / 12

Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 8 4 / 12

Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

w1 w2

s2_pos

s2_st

w0

l1_lit

l1_st

... ... ......

W 0 depends only on

whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 9 5 / 12

Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

w1 w2

s2_pos

s2_st

w0

l1_lit

l1_st

... ... ......

W 0 depends only on whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 10 5 / 12

Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is,

parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 11 6 / 12

Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is, parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 12 6 / 12

Example: fire alarm belief network

Variables:

Fire: there is a fire in the building

Tampering: someone has been tampering with the firealarm

Smoke: what appears to be smoke is coming from anupstairs window

Alarm: the fire alarm goes off

Leaving: people are leaving the building en masse.

Report: a colleague says that people are leaving thebuilding en masse. (A noisy sensor for leaving.)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 13 7 / 12

Components of a belief network

A belief network consists of:

a directed acyclic graph with nodes labeled with randomvariables

a domain for each random variable

a set of conditional probability tables for each variablegiven its parents (including prior probabilities for nodeswith no parents).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 14 8 / 12

Example belief network

Outside_power

W3

Cb1_st Cb2_st

W6

W2

W0

W1

W4

S1_st

S2_st

P1P2

S1_pos

S2_pos

S3_pos

S3_st

L2_st

L2_lit

L1_st

L1_lit

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 15 9 / 12

Example belief network (continued)

The belief network also specifies:

The domain of the variables:W0, . . . ,W6 have domain live, deadS1 pos, S2 pos, and S3 pos have domain up, downS1 st has ok , upside down, short, intermittent, broken.Conditional probabilities, including:P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = live)P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = dead)P(S1 pos = up)P(S1 st = upside down)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 16 10 / 12

Belief network summary

A belief network is a directed acyclic graph (DAG) wherenodes are random variables.

The parents of a node n are those variables on which ndirectly depends.

A belief network is automatically acyclic by construction.

A belief network is a graphical representation ofdependence and independence:

I A variable is independent of its non-descendants givenits parents.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 17 11 / 12

Constructing belief networks

To represent a domain in a belief network, you need toconsider:

What are the relevant variables?I What will you observe?I What would you like to find out (query)?I What other features make the model simpler?

What values should these variables take?

What is the relationship between them? This should beexpressed in terms of a directed graph, representing howeach variable is generated from its predecessors.

How does the value of each variable depend on itsparents? This is expressed in terms of the conditionalprobabilities.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.2, Page 18 12 / 12

Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke are

dependent

alarm and smoke are

independent

given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 1 1 / 20

Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke aredependent

alarm and smoke are

independent

given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 2 1 / 20

Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke aredependent

alarm and smoke are

independent

given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 3 1 / 20

Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke aredependent

alarm and smoke areindependent given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 4 1 / 20

Understanding Independence: Common ancestors

smokealarm

fire

alarm and smoke aredependent

alarm and smoke areindependent given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 5 1 / 20

Understanding Independence: Chain

report

alarm

leaving

alarm and report are

dependent

alarm and report are

independent

givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 6 2 / 20

Understanding Independence: Chain

report

alarm

leaving

alarm and report aredependent

alarm and report are

independent

givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 7 2 / 20

Understanding Independence: Chain

report

alarm

leaving

alarm and report aredependent

alarm and report are

independent

givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 8 2 / 20

Understanding Independence: Chain

report

alarm

leaving

alarm and report aredependent

alarm and report areindependent givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 9 2 / 20

Understanding Independence: Chain

report

alarm

leaving

alarm and report aredependent

alarm and report areindependent givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 10 2 / 20

Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire are

independent

tampering and fire are

dependent

given alarm

Intuitively, tamperingcan explain away fire

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 11 3 / 20

Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire areindependent

tampering and fire are

dependent

given alarm

Intuitively, tamperingcan explain away fire

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 12 3 / 20

Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire areindependent

tampering and fire are

dependent

given alarm

Intuitively, tamperingcan explain away fire

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 13 3 / 20

Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire areindependent

tampering and fire aredependent given alarm

Intuitively, tamperingcan explain away fire

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 14 3 / 20

Understanding Independence: Common

descendants

tampering

alarm

fire tampering and fire areindependent

tampering and fire aredependent given alarm

Intuitively, tamperingcan explain away fire

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 15 3 / 20

Understanding independence: example

B CA D E F

G H

M

I J K L

N

Q

R

O

S

P

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 16 4 / 20

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 17 5 / 20

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 18 5 / 20

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 19 5 / 20

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 20 5 / 20

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 21 5 / 20

What variables are affected by observing?

If you observe variable(s) Y , the variables whose posteriorprobability is different from their prior are:

I The ancestors of Y andI their descendants.

Intuitively (if you have a causal belief network):I You do abduction to possible causes andI prediction from the causes.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 22 6 / 20

d-separation

A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:

I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.

X is d-connected from Y given Z if there is a path fromX to Y , along open connections.

X is d-separated from Y given Z if it is not d-connected.

X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 23 7 / 20

d-separation

A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:

I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.

X is d-connected from Y given Z if there is a path fromX to Y , along open connections.

X is d-separated from Y given Z if it is not d-connected.

X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 24 7 / 20

d-separation

A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:

I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.

X is d-connected from Y given Z if there is a path fromX to Y , along open connections.

X is d-separated from Y given Z if it is not d-connected.

X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 25 7 / 20

Markov Random Field

A Markov random field is composed of

of a set of discrete-valued random variables:X = X1,X2, . . . ,Xn and

a set of factors f1, . . . , fm, where a factor is anon-negative function of a subset of the variables.

and defines a joint probability distribution:

P(X = x) =1

Z

∏k

fk(Xk = xk) .

Z =∑

x

∏k

fk(Xk = xk)

where fk(Xk) is a factor on Xk ⊆ X, andxk is x projected onto Xk .Z is a normalization constant known as the partition function.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 26 8 / 20

Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 27 9 / 20

Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 28 9 / 20

Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a

type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 29 9 / 20

Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 30 9 / 20

Independence in a Markov Network

The Markov blanket of a variable X is the set of variablesthat are in factors with X .

A variable is independent of the other variables given itsMarkov blanket.

X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .

X is separated from Y given Z if it is not connected.

A positive distribution is one that does not contain zeroprobabilities.

X is independent Y given Z for all positive distributionsiff X is separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 31 10 / 20

Independence in a Markov Network

The Markov blanket of a variable X is the set of variablesthat are in factors with X .

A variable is independent of the other variables given itsMarkov blanket.

X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .

X is separated from Y given Z if it is not connected.

A positive distribution is one that does not contain zeroprobabilities.

X is independent Y given Z for all positive distributionsiff X is separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 32 10 / 20

Independence in a Markov Network

The Markov blanket of a variable X is the set of variablesthat are in factors with X .

A variable is independent of the other variables given itsMarkov blanket.

X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .

X is separated from Y given Z if it is not connected.

A positive distribution is one that does not contain zeroprobabilities.

X is independent Y given Z for all positive distributionsiff X is separated from Y given Z

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 33 10 / 20

Canonical Representations

The parameters of a graphical model are the numbersthat define the model.

A belief network is a canonical representation: given thestructure and the distribution, the parameters areuniquely determined.

A Markov random field is not a canonical representation.Many different parameterizations result in the samedistribution.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 34 11 / 20

Representations of Conditional Probabilities

There are many representations of conditional probabilities andfactors:

Tables

Decision Trees

Rules

Weighted Logical Formulae

Contextual Tables

Logistic Functions

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 35 12 / 20

Tabular Representation

P(D | A,B ,C ) :

A B C D Prob

true true true true 0.9true true true false 0.1true true false true 0.9true true false false 0.1true false true true 0.2true false true false 0.8true false false true 0.2true false false false 0.8false true true true 0.3false true true false 0.7false true false true 0.4false true false false 0.6false false true true 0.3false false true false 0.7false false false true 0.4false false false false 0.6

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 36 13 / 20

Decision Tree Representation

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 37 14 / 20

Rule Representation

0.9 : d ← a ∧ b

0.2 : d ← a ∧ ¬b

0.3 : d ← ¬a ∧ c

0.4 : d ← ¬a ∧ ¬c

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 38 15 / 20

Weighted Logical Formulae

d ↔ ((a ∧ b ∧ n0)

∨ (a ∧ ¬b ∧ n1)

∨ (¬a ∧ c ∧ n2)

∨ (¬a ∧ ¬c ∧ n2))

ni are independent:

P(n0) = 0.9

P(n1) = 0.2

P(n2) = 0.3

P(n3) = 0.4

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 39 16 / 20

Contextual-Table Representation

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 40 17 / 20

Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 41 18 / 20

Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 42 18 / 20

Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 43 18 / 20

Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 44 18 / 20

Logistic Functions

P(h | e) =P(h ∧ e)

P(e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

=1

1 + P(¬h ∧ e)/P(h ∧ e)

=1

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 45 18 / 20

Logistic Functions

A conditional probability is the sigmoid of the log-odds.

00.10.20.30.40.50.60.70.80.9

1

-10 -5 0 5 10

1

1 + e- x

sigmoid(x) =1

1 + e−x

A logistic function is the sigmoid of a linear function.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 46 19 / 20

Logistic Representation of Conditional Probability

P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B

+ 0.2† ∗ A ∗ (1− B)

+ 0.3† ∗ (1− A) ∗ C

+ 0.4† ∗ (1− A) ∗ (1− C ))

where 0.9† is sigmoid−1(0.9).

P(d | A,B ,C ) = sigmoid(0.4†

+ (0.2† − 0.4†) ∗ A

+ (0.9† − 0.2†) ∗ A ∗ B

+ ...

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 47 20 / 20

Logistic Representation of Conditional Probability

P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B

+ 0.2† ∗ A ∗ (1− B)

+ 0.3† ∗ (1− A) ∗ C

+ 0.4† ∗ (1− A) ∗ (1− C ))

where 0.9† is sigmoid−1(0.9).

P(d | A,B ,C ) = sigmoid(0.4†

+ (0.2† − 0.4†) ∗ A

+ (0.9† − 0.2†) ∗ A ∗ B

+ ...

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 48 20 / 20

Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 1 1 / 17

Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 2 1 / 17

Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 3 1 / 17

Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 4 1 / 17

Factors

A factor is a representation of a function from a tuple ofrandom variables into a number.

We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).

We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on

X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value

of f when each Xi has value vi .

The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 5 2 / 17

Factors

A factor is a representation of a function from a tuple ofrandom variables into a number.

We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).

We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on

X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value

of f when each Xi has value vi .

The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 6 2 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f

0.9

f t

0.2

f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 7 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t

0.2

f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 8 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f

0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 9 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 10 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 11 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t

0.9

f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 12 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f

0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 13 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 14 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) =

0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 15 3 / 17

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

Y val

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 16 3 / 17

Multiplying factors

The product of factor f1(X ,Y ) and f2(Y ,Z ), where Y are thevariables in common, is the factor (f1 ∗ f2)(X ,Y ,Z ) defined by:

(f1 ∗ f2)(X ,Y ,Z ) = f1(X ,Y )f2(Y ,Z ).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 17 4 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f

0.07

t f t

0.54

t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 18 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t

0.54

t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 19 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f

0.36

f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 20 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t

0.06

f t f

0.14

f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 21 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f

0.14

f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 22 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t

0.48

f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 23 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f

0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 24 5 / 17

Multiplying factors example

f1:

A B valt t 0.1t f 0.9f t 0.2f f 0.8

f2:

B C valt t 0.3t f 0.7f t 0.6f f 0.4

f1 ∗ f2:

A B C valt t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 25 5 / 17

Summing out variables

We can sum out a variable, say X1 with range v1, . . . , vk, fromfactor f (X1, . . . ,Xj), resulting in a factor on X2, . . . ,Xj defined by:

(∑X1

f )(X2, . . . ,Xj)

= f (X1 = v1, . . . ,Xj) + · · ·+ f (X1 = vk , . . . ,Xj)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 26 6 / 17

Summing out a variable example

f3:

A B C val

t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32

∑B f3:

A C val

t t 0.57t f

0.43

f t

0.54

f f

0.46

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 27 7 / 17

Summing out a variable example

f3:

A B C val

t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32

∑B f3:

A C val

t t 0.57t f 0.43f t

0.54

f f

0.46

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 28 7 / 17

Summing out a variable example

f3:

A B C val

t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32

∑B f3:

A C val

t t 0.57t f 0.43f t 0.54f f

0.46

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 29 7 / 17

Summing out a variable example

f3:

A B C val

t t t 0.03t t f 0.07t f t 0.54t f f 0.36f t t 0.06f t f 0.14f f t 0.48f f f 0.32

∑B f3:

A C val

t t 0.57t f 0.43f t 0.54f f 0.46

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 30 7 / 17

Exercise

Given factors:

s:

A val

t 0.75f 0.25

t:

A B val

t t 0.6t f 0.4f t 0.2f f 0.8

o:

A val

t 0.3f 0.1

What are the following factors over?

(a) s ∗ t(b)

∑A s ∗ t

(c)∑

B s ∗ t(d)

∑A

∑B s ∗ t

(e) s ∗ t ∗ o(f)

∑B s ∗ t ∗ o

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 31 8 / 17

Evidence

If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :

P(Z |Y1 = v1, . . . ,Yj = vj)

=

P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).

So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).

Normalize at the end, by summing out Z and dividing.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 32 9 / 17

Evidence

If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :

P(Z |Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=

P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).

So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).

Normalize at the end, by summing out Z and dividing.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 33 9 / 17

Evidence

If we want to compute the posterior probability of Z givenevidence Y1 = v1 ∧ . . . ∧ Yj = vj :

P(Z |Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)

P(Y1 = v1, . . . ,Yj = vj)

=P(Z ,Y1 = v1, . . . ,Yj = vj)∑Z P(Z ,Y1 = v1, . . . ,Yj = vj).

So the computation reduces to the probability ofP(Z ,Y1 = v1, . . . ,Yj = vj).

Normalize at the end, by summing out Z and dividing.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 34 9 / 17

Probability of a conjunction

Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.

P(Z ,Y1 = v1, . . . ,Yj = vj)

=

∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=∑Zk

· · ·∑Z1

n∏i=1

P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 35 10 / 17

Probability of a conjunction

Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.

P(Z ,Y1 = v1, . . . ,Yj = vj)

=∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=

∑Zk

· · ·∑Z1

n∏i=1

P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 36 10 / 17

Probability of a conjunction

Suppose the variables of the belief network are X1, . . . ,Xn.To compute P(Z ,Y1 = v1, . . . ,Yj = vj), we sum out the othervariables, Z1, . . . ,Zk = X1, . . . ,Xn − Z − Y1, . . . ,Yj.We order the Zi into an elimination ordering.

P(Z ,Y1 = v1, . . . ,Yj = vj)

=∑Zk

· · ·∑Z1

P(X1, . . . ,Xn)Y1 = v1,...,Yj = vj .

=∑Zk

· · ·∑Z1

n∏i=1

P(Xi |parents(Xi ))Y1 = v1,...,Yj = vj .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 37 10 / 17

Computing sums of products

Computation in belief networks reduces to computing the sums ofproducts.

How can we compute ab + ac efficiently?

Distribute out the a giving a(b + c)

How can we compute∑

Z1

∏ni=1 P(Xi |parents(Xi )) efficiently?

Distribute out those factors that don’t involve Z1.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 38 11 / 17

Computing sums of products

Computation in belief networks reduces to computing the sums ofproducts.

How can we compute ab + ac efficiently?

Distribute out the a giving a(b + c)

How can we compute∑

Z1

∏ni=1 P(Xi |parents(Xi )) efficiently?

Distribute out those factors that don’t involve Z1.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 39 11 / 17

Computing sums of products

Computation in belief networks reduces to computing the sums ofproducts.

How can we compute ab + ac efficiently?

Distribute out the a giving a(b + c)

How can we compute∑

Z1

∏ni=1 P(Xi |parents(Xi )) efficiently?

Distribute out those factors that don’t involve Z1.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 40 11 / 17

Computing sums of products

Computation in belief networks reduces to computing the sums ofproducts.

How can we compute ab + ac efficiently?

Distribute out the a giving a(b + c)

How can we compute∑

Z1

∏ni=1 P(Xi |parents(Xi )) efficiently?

Distribute out those factors that don’t involve Z1.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 41 11 / 17

Variable elimination algorithm

To compute P(Z |Y1 = v1 ∧ . . . ∧ Yj = vj):

Construct a factor for each conditional probability.

Set the observed variables to their observed values.

Sum out each of the other variables (the Z1, . . . ,Zk)according to some elimination ordering.

Multiply the remaining factors. Normalize by dividing theresulting factor f (Z ) by

∑Z f (Z ).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 42 12 / 17

Summing out a variable

To sum out a variable Zj from a product f1, . . . , fk of factors:

Partition the factors intoI those that don’t contain Zj , say f1, . . . , fi ,I those that contain Zj , say fi+1, . . . , fk

We know:

∑Zj

f1∗ · · · ∗fk = f1∗ · · · ∗fi∗

∑Zj

fi+1∗ · · · ∗fk

.

Explicitly construct a representation of the rightmost factor.Replace the factors fi+1, . . . , fk by the new factor.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 43 13 / 17

Example

A C

B D

E

F G

P(E | g) =

P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 44 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 45 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 46 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 47 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)

(∑D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 48 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 49 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)

∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 50 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)

∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 51 14 / 17

Example

A C

B D

E

F G

P(E | g) =P(E ∧ g)∑E P(E ∧ g)

P(E ∧ g)

=∑F

∑B

∑C

∑A

∑D

P(A)P(B | AC )

P(C )P(D | C )P(E | B)P(F | E )P(g | ED)

=

(∑F

P(F | E )

)∑B

P(E | B)∑C

(P(C )

(∑A

P(A)P(B | AC )

)(∑

D

P(D | C )P(g | ED)

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 52 14 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→

f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G )

obs H−→ f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 53 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G )

obs H−→ f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 54 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→

f2(BDE )

P(F |D)P(G |FE )

P(H|G )

obs H−→ f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 55 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G )

obs H−→ f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 56 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G ) obs H−→

f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 57 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G ) obs H−→ f3(G )

P(I |G )

elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 58 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G ) obs H−→ f3(G )

P(I |G ) elim I−→

f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 59 15 / 17

Variable elimination example

A

BC

D

E

F

G

H I

P(A)P(B|A)

elim A−→ f1(B)

P(C )P(D|BC )P(E |C )

elim C−→ f2(BDE )

P(F |D)P(G |FE )

P(H|G ) obs H−→ f3(G )

P(I |G ) elim I−→ f4(G )

P(D, h) = ...(∑

A P(A)P(B|A))(∑

I P(I |G ))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 60 15 / 17

Variable Elimination example

A B C D E F

G H

Query: P(G |f ); elimination ordering: A,H,E ,D,B,C

P(G |f ) ∝

∑C

∑B

∑D

∑E

∑H

∑A

P(A)P(B|A)P(C |B)

P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )

=∑C

(∑B

(∑A

P(A)P(B|A)

)P(C |B)

)P(G |C )(∑

D

P(D|C )

(∑E

P(E |D)P(f |E )∑H

P(H|E )

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 61 16 / 17

Variable Elimination example

A B C D E F

G H

Query: P(G |f ); elimination ordering: A,H,E ,D,B,C

P(G |f ) ∝∑C

∑B

∑D

∑E

∑H

∑A

P(A)P(B|A)P(C |B)

P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )

=∑C

(∑B

(∑A

P(A)P(B|A)

)P(C |B)

)P(G |C )(∑

D

P(D|C )

(∑E

P(E |D)P(f |E )∑H

P(H|E )

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 62 16 / 17

Variable Elimination example

A B C D E F

G H

Query: P(G |f ); elimination ordering: A,H,E ,D,B,C

P(G |f ) ∝∑C

∑B

∑D

∑E

∑H

∑A

P(A)P(B|A)P(C |B)

P(D|C )P(E |D)P(f |E )P(G |C )P(H|E )

=∑C

(∑B

(∑A

P(A)P(B|A)

)P(C |B)

)P(G |C )(∑

D

P(D|C )

(∑E

P(E |D)P(f |E )∑H

P(H|E )

))

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 63 16 / 17

Pruning Irrelevant Variables (Belief networks)

Suppose you want to compute P(X | e1 . . . ek):

Prune any variables that have no observed or querieddescendents.

Connect the parents of any observed variable.

Remove arc directions.

Remove observed variables.

Remove any variables not connected to X in the resulting(undirected) graph.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.4, Page 64 17 / 17

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 1 1 / 23

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 2 1 / 23

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 3 1 / 23

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 4 1 / 23

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 5 1 / 23

Markov chain

A Markov chain is a special sort of belief network:

S0 S1 S2 S3 S4

What probabilities need to be specified?

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

What independence assumptions are made?

P(Si+1|S0, . . . , Si) = P(Si+1|Si).

Often St represents the state at time t. Intuitively St

conveys all of the information about the history that canaffect the future states.

“The future is independent of the past given the present.”c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 6 1 / 23

Stationary Markov chain

A stationary Markov chain is when for all i > 0, i ′ > 0,P(Si+1|Si) = P(Si ′+1|Si ′).

We specify P(S0) and P(Si+1|Si).I Simple model, easy to specifyI Often the natural modelI The network can extend indefinitely

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 7 2 / 23

Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 8 3 / 23

Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 9 3 / 23

Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 10 3 / 23

Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 11 3 / 23

Stationary Distribution

A distribution over states, P is a stationary distribution iffor each state s, P(Si+1=s) = P(Si=s).

Every Markov chain has a stationary distribution.

A Markov chain is ergodic if, for any two states s1 and s2,there is a non-zero probability of eventually reaching s2from s1.

A Markov chain is periodic if there is a strict temporalregularity in visiting states. A state is only visited divisibleat time t if t mod n = m for some n,m.

An ergodic and aperiodic Markov chain has a uniquestationary distribution P andP(s) = limi→∞ Pi(s) — equilibrium distribution

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 12 3 / 23

Pagerank

Consider the Markov chain:

Domain of Si is the set of all web pages

P(S0) is uniform; P(S0 = pj) = 1/N

P(Si+1 = pj | Si = pk)

= (1− d)/N + d ∗

1/nk if pk links to pj1/N if pk has no links0 otherwise

where there are N web pages and nk links from page pk

d ≈ 0.85 is the probability someone keeps surfing web

This Markov chain converges to a distribution over webpages (original P(Si) for i = 52 for 322 million links):Pagerank - basis for Google’s initial search engine

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 13 4 / 23

Pagerank

Consider the Markov chain:

Domain of Si is the set of all web pages

P(S0) is uniform; P(S0 = pj) = 1/N

P(Si+1 = pj | Si = pk)

= (1− d)/N + d ∗

1/nk if pk links to pj1/N if pk has no links0 otherwise

where there are N web pages and nk links from page pk

d ≈ 0.85 is the probability someone keeps surfing web

This Markov chain converges to a distribution over webpages (original P(Si) for i = 52 for 322 million links):Pagerank - basis for Google’s initial search engine

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 14 4 / 23

Hidden Markov Model

A Hidden Markov Model (HMM) is a belief network:

S0 S1 S2 S3 S4

O0 O1 O2 O3 O4

The probabilities that need to be specified:

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

P(Oi |Si) specifies the sensor model

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 15 5 / 23

Hidden Markov Model

A Hidden Markov Model (HMM) is a belief network:

S0 S1 S2 S3 S4

O0 O1 O2 O3 O4

The probabilities that need to be specified:

P(S0) specifies initial conditions

P(Si+1|Si) specifies the dynamics

P(Oi |Si) specifies the sensor model

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 16 5 / 23

Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 17 6 / 23

Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 18 6 / 23

Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.

then observe O1, query S1.then observe O2, query S2.. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 19 6 / 23

Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.

then observe O2, query S2.. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 20 6 / 23

Filtering

Filtering:

P(Si |o1, . . . , oi)

What is the current belief state based on the observationhistory?

P(Si |o1, . . . , oi) ∝ P(oi |Sio1, . . . , oi−1)P(Si |o1, . . . , oi−1)

=???∑Si−1

P(SiSi−1|o1, . . . , oi−1)

=???

Observe O0, query S0.then observe O1, query S1.then observe O2, query S2.. . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 21 6 / 23

Example: localization

Suppose a robot wants to determine its location based onits actions and its sensor readings: Localization

This can be represented by the augmented HMM:

Loc0 Loc1 Loc2 Loc3 Loc4

Obs0 Obs1 Obs2 Obs3 Obs4

Act0 Act1 Act2 Act3

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 22 7 / 23

Example localization domain

Circular corridor, with 16 locations:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Doors at positions: 2, 4, 7, 11.

Noisy Sensors

Stochastic Dynamics

Robot starts at an unknown location and must determinewhere it is.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 23 8 / 23

Example Sensor Model

P(Observe Door | At Door) = 0.8

P(Observe Door | Not At Door) = 0.1

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 24 9 / 23

Example Dynamics Model

P(loct+1 = L|actiont = goRight ∧ loct = L) = 0.1

P(loct+1 = L + 1|actiont = goRight ∧ loct = L) = 0.8

P(loct+1 = L + 2|actiont = goRight ∧ loct = L) = 0.074

P(loct+1 = L′|actiont = goRight ∧ loct = L) = 0.002 forany other location L′.

I All location arithmetic is modulo 16.I The action goLeft works the same but to the left.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 25 10 / 23

Combining sensor information

Example: we can combine information from a light sensorand the door sensor Sensor Fusion

Loc0 Loc1 Loc2 Loc3 Loc4

D0 D1 D2 D3 D4

Act0 Act1 Act2 Act3

L0 L1 L2 L3 L4

St robot location at time tDt door sensor value at time tLt light sensor value at time t

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 26 11 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.

What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 27 12 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 28 12 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 29 12 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 30 12 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query:

condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 31 12 / 23

Naive Bayes Classifier: User’s request for help

H

"able" "absent" "add" "zoom". . .

H is the help page the user is interested in.What probabilities are required?

P(hi) for each help page hi . The user is interested in onebest web page, so

∑i P(hi) = 1.

P(wj | hi) for each word wj given page hi . There can bemultiple words used in a query.

Given a help query: condition on the words in the queryand display the most likely help page.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 32 12 / 23

Simple Language Models: set-of-words

Sentence: w1,w2,w3, . . . .Set-of-words model:

“aardvark” “zzz”“a” ...

Each variable is Boolean: true when word is in thesentence and false otherwise.

What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 33 13 / 23

Simple Language Models: set-of-words

Sentence: w1,w2,w3, . . . .Set-of-words model:

“aardvark” “zzz”“a” ...

Each variable is Boolean: true when word is in thesentence and false otherwise.

What probabilities are provided?

I P(”a”), P(”aardvark”), . . . , P(”zzz”)

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 34 13 / 23

Simple Language Models: set-of-words

Sentence: w1,w2,w3, . . . .Set-of-words model:

“aardvark” “zzz”“a” ...

Each variable is Boolean: true when word is in thesentence and false otherwise.

What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 35 13 / 23

Simple Language Models: set-of-words

Sentence: w1,w2,w3, . . . .Set-of-words model:

“aardvark” “zzz”“a” ...

Each variable is Boolean: true when word is in thesentence and false otherwise.

What probabilities are provided?I P(”a”), P(”aardvark”), . . . , P(”zzz”)

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 36 13 / 23

Simple Language Models: bag-of-words

Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi ) is a distribution over words for each position

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 37 14 / 23

Simple Language Models: bag-of-words

Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?

I P(wi ) is a distribution over words for each position

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 38 14 / 23

Simple Language Models: bag-of-words

Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi ) is a distribution over words for each position

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 39 14 / 23

Simple Language Models: bag-of-words

Sentence: w1,w2,w3, . . . ,wn.Bag-of-words or unigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi ) is a distribution over words for each position

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 40 14 / 23

Simple Language Models: bigram

Sentence: w1,w2,w3, . . . ,wn.bigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position

given the previous word

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 41 15 / 23

Simple Language Models: bigram

Sentence: w1,w2,w3, . . . ,wn.bigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?

I P(wi |wi−1) is a distribution over words for each positiongiven the previous word

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 42 15 / 23

Simple Language Models: bigram

Sentence: w1,w2,w3, . . . ,wn.bigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position

given the previous word

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 43 15 / 23

Simple Language Models: bigram

Sentence: w1,w2,w3, . . . ,wn.bigram:

W2 ...W3 WnW1

Domain of each variable is the set of all words.

What probabilities are provided?I P(wi |wi−1) is a distribution over words for each position

given the previous word

How do we condition on the question “how can I phonemy phone”?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 44 15 / 23

Simple Language Models: trigram

Sentence: w1,w2,w3, . . . ,wn.trigram:

W2 ...W3 WnW1 W4

Domain of each variable is the set of all words.

What probabilities are provided?

P(wi |wi−1,wi−2)

N-gram

P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 45 16 / 23

Simple Language Models: trigram

Sentence: w1,w2,w3, . . . ,wn.trigram:

W2 ...W3 WnW1 W4

Domain of each variable is the set of all words.What probabilities are provided?

P(wi |wi−1,wi−2)

N-gram

P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 46 16 / 23

Simple Language Models: trigram

Sentence: w1,w2,w3, . . . ,wn.trigram:

W2 ...W3 WnW1 W4

Domain of each variable is the set of all words.What probabilities are provided?

P(wi |wi−1,wi−2)

N-gram

P(wi |wi−1, . . .wi−n+1) is a distribution over words giventhe previous n − 1 words

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 47 16 / 23

Logic, Probability, Statistics, Ontology over time

From: Google Books Ngram Viewer(https://books.google.com/ngrams)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 48 17 / 23

Predictive Typing and Error Correction

W2 ...W1

L11 Lk1L21 ... L12 Lk2L22 ...

W3

L13 Lk3L23 ...

domain(Wi) = ”a”, ”aarvark”, . . . , ”zzz”, ”⊥”, ”?”domain(Lji) = ”a”, ”b”, ”c”, . . . , ”z”, ”1”, ”2”, . . .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 49 18 / 23

Beyond N-grams

A man with a big hairy cat drank the cold milk.

Who or what drank the milk?

Simple syntax diagram:s

np vp

pp

np

npa man

with

a big hairy cat

drank

the cold milk

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 50 19 / 23

Beyond N-grams

A man with a big hairy cat drank the cold milk.

Who or what drank the milk?

Simple syntax diagram:s

np vp

pp

np

npa man

with

a big hairy cat

drank

the cold milk

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 51 19 / 23

Topic Model

tools food topics

"nut" "tuna""bolt" words

fish

“shark”“salmon”

finance

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 52 20 / 23

Topic Model

tools food topics

"nut" "tuna""bolt" words

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 53 21 / 23

Google’s rephil

900,000 topics

"Aaron's beard" "zzz""aardvark" 12,000,000 words...

350,000,000 links

...

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 54 22 / 23

Deep Belief Networks

...

...

...

...

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.5, Page 55 23 / 23

Stochastic Simulation

Idea: probabilities ↔ samples

Get probabilities from samples:

X count

x1 n1...

...xk nk

total m

X probability

x1 n1/m...

...xk nk/m

If we could sample from a variable’s (posterior)probability, we could estimate its (posterior) probability.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 1 1 / 15

Generating samples from a distribution

For a variable X with a discrete domain or a (one-dimensional)real domain:

Totally order the values of the domain of X .

Generate the cumulative probability distribution:f (x) = P(X ≤ x).

Select a value y uniformly in the range [0, 1].

Select the x such that f (x) = y .

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 2 2 / 15

Cumulative Distribution

v1 v2 v3 v4 v1 v2 v3 v4

P(X)

f(X)

0

1

0

1

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 3 3 / 15

Hoeffding’s inequality

Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then

P(|s − p| > ε) ≤ 2e−2nε2

.

Guarantees a probably approximately correct estimate ofprobability.

If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε

2< δ for n, which gives

n >− ln δ

2

2ε2.

ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 4 4 / 15

Hoeffding’s inequality

Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then

P(|s − p| > ε) ≤ 2e−2nε2

.

Guarantees a probably approximately correct estimate ofprobability.If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε

2< δ for n, which gives

n >− ln δ

2

2ε2.

ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 5 4 / 15

Hoeffding’s inequality

Theorem (Hoeffding): Suppose p is the true probability, and sis the sample average from n independent samples; then

P(|s − p| > ε) ≤ 2e−2nε2

.

Guarantees a probably approximately correct estimate ofprobability.If you are willing to have an error greater than ε in δ of thecases, solve 2e−2nε

2< δ for n, which gives

n >− ln δ

2

2ε2.

ε δ n0.1 0.05 1850.01 0.05 18,4450.1 0.01 265

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 6 4 / 15

Forward sampling in a belief network

Sample the variables one at a time; sample parents of Xbefore sampling X .

Given values for the parents of X , sample from theprobability of X given its parents.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 7 5 / 15

Rejection Sampling

To estimate a posterior probability given evidenceY1 = v1 ∧ . . . ∧ Yj = vj :

Reject any sample that assigns Yi to a value other thanvi .

The non-rejected samples are distributed according to theposterior probability:

P(α | evidence) ≈∑

sample|=α 1∑sample 1

where we consider only samples consistent with evidence.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 8 6 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false

8

s2 false true true true true true 4

s3 true false true false — — 8

s4 true true true true true true 4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 9 7 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true

4

s3 true false true false — — 8

s4 true true true true true true 4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 10 7 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true 4

s3 true false true false

— — 8

s4 true true true true true true 4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 11 7 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true 4

s3 true false true false — — 8

s4 true true true true true true

4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 12 7 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true 4

s3 true false true false — — 8

s4 true true true true true true 4

. . .s1000 false false false false

— — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 13 7 / 15

Rejection Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Observe Sm = true,Re = true

Ta Fi Al Sm Le Res1 false true false true false false 8

s2 false true true true true true 4

s3 true false true false — — 8

s4 true true true true true true 4

. . .s1000 false false false false — — 8

P(sm) = 0.02P(re | sm) = 0.32How many samples are rejected?How many samples are used?

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 14 7 / 15

Importance Sampling

Samples have weights: a real number associated witheach sample that takes the evidence into account.

Probability of a proposition is weighted average ofsamples:

P(α | evidence) ≈∑

sample|=α weight(sample)∑sample weight(sample)

Mix exact inference with sampling: don’t sample all ofthe variables, but weight each sample according toP(evidence | sample).

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 15 8 / 15

Importance Sampling (Likelihood Weighting)

procedure likelihood weighting(Bn, e,Q, n):ans[1 : k] := 0 where k is size of dom(Q)repeat n times:

weight := 1for each variable Xi in order:

if Xi = oi is observedweight := weight × P(Xi = oi | parents(Xi))

else assign Xi a random sample of P(Xi | parents(Xi))if Q has value v :

ans[v ] := ans[v ] + weightreturn ans/

∑v ans[v ]

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 16 9 / 15

Importance Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Ta Fi Al Le Weights1 true false true false

0.01× 0.01

s2 false true false false

0.9× 0.01

s3 false true true true

0.9× 0.75

s4 true true true true

0.9× 0.75

. . .s1000 false false true true

0.01× 0.75

P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 17 10 / 15

Importance Sampling Example: P(ta | sm, re)

Ta Fi

SmAl

Le

Re

Ta Fi Al Le Weights1 true false true false 0.01× 0.01s2 false true false false 0.9× 0.01s3 false true true true 0.9× 0.75s4 true true true true 0.9× 0.75. . .s1000 false false true true 0.01× 0.75

P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(re | le) = 0.75P(re | ¬le) = 0.01

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 18 10 / 15

Importance Sampling Example: P(le | sm, ta,¬re)

Ta Fi

SmAl

Le

Re

P(ta) = 0.02P(fi) = 0.01P(al | fi ∧ ta) = 0.5P(al | fi ∧ ¬ta) = 0.99P(al | ¬fi ∧ ta) = 0.85P(al | ¬fi ∧ ¬ta) = 0.0001P(sm | fi) = 0.9P(sm | ¬fi) = 0.01P(le | al) = 0.88P(le | ¬al) = 0.001P(re | le) = 0.75P(re | ¬le) = 0.01

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 19 11 / 15

Computing Expectations & Proposal Distributions

Expected value of f with respect to distribution P :

EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)

s is sampled with probability P . There are n samples.

EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)

s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 20 12 / 15

Computing Expectations & Proposal Distributions

Expected value of f with respect to distribution P :

EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)

s is sampled with probability P . There are n samples.

EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)

s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 21 12 / 15

Computing Expectations & Proposal Distributions

Expected value of f with respect to distribution P :

EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)

s is sampled with probability P . There are n samples.

EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)

s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 22 12 / 15

Computing Expectations & Proposal Distributions

Expected value of f with respect to distribution P :

EP(f ) =∑w

f (w) ∗ P(w)

≈ 1

n

∑s

f (s)

s is sampled with probability P . There are n samples.

EP(f ) =∑w

f (w) ∗ P(w)/Q(w) ∗ Q(w)

≈ 1

n

∑s

f (s) ∗ P(s)/Q(s)

s is selected according the distribution Q.The distribution Q is called a proposal distribution.P(c) > 0 then Q(c) > 0.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 23 12 / 15

Particle Filtering

Importance sampling can be seen as:

for each particle:for each variable:

sample / absorb evidence / update query

where particle is one of the samples.

Instead we could do:

for each variable:for each particle:

sample / absorb evidence / update query

Why?

We can have a new operation of resampling

It works with infinitely many variables (e.g., HMM)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 24 13 / 15

Particle Filtering

Importance sampling can be seen as:

for each particle:for each variable:

sample / absorb evidence / update query

where particle is one of the samples.Instead we could do:

for each variable:for each particle:

sample / absorb evidence / update query

Why?

We can have a new operation of resampling

It works with infinitely many variables (e.g., HMM)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 25 13 / 15

Particle Filtering

Importance sampling can be seen as:

for each particle:for each variable:

sample / absorb evidence / update query

where particle is one of the samples.Instead we could do:

for each variable:for each particle:

sample / absorb evidence / update query

Why?

We can have a new operation of resampling

It works with infinitely many variables (e.g., HMM)

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 26 13 / 15

Particle Filtering for HMMs

Start with random chosen particles (say 1000)

Each particle represents a history.

Initially, sample states in proportion to their probability.

Repeat:I Absorb evidence: weight each particle by the probability

of the evidence given the state of the particle.I Resample: select each particle at random, in proportion

to the weight of the particle.Some particles may be duplicated, some may beremoved. All new particles have same weight.

I Transition: sample the next state for each particleaccording to the transition probabilities.

To answer a query about the current state, use the set ofparticles as data.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 27 14 / 15

Markov Chain Monte Carlo

To sample from a distribution P :

Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)

After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 28 15 / 15

Markov Chain Monte Carlo

To sample from a distribution P :

Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .

Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 29 15 / 15

Markov Chain Monte Carlo

To sample from a distribution P :

Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”

Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 30 15 / 15

Markov Chain Monte Carlo

To sample from a distribution P :

Create (ergodic and aperiodic) Markov chain with P asequilibrium distribution.Let T (Si+1 | Si) be the transition probability.Given state s, sample state s ′ from T (S | s)After a while, the states sampled will be distributedaccording to P .Ignore the first samples “burn-in”— use the remaining samples.Samples are not independent of each other“autocorrelation”.Sometimes use subset (e.g., 1/1000) of them “thinning”Gibbs sampler: sample each non-observed variable fromthe distribution of the variable given the current (orobserved) value of the variables in its Markov blanket.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.6, Page 31 15 / 15

top related