the mind is a neural computer, tted by natural selection

“The mind is a neural computer, fitted by naturalselection with combinatorial algorithms for causaland probabilistic reasoning about plants, animals,objects, and people.”

. . .“In a universe with any regularities at all,

decisions informed about the past are better thandecisions made at random. That has always beentrue, and we would expect organisms, especiallyinformavores such as humans, to have evolved acuteintuitions about probability. The founders ofprobability, like the founders of logic, assumed theywere just formalizing common sense.”

Steven Pinker, How the Mind Works, 1997, pp. 524, 343.

c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.1, Page 1 1 / 17

Learning Objectives

At the end of the class you should be able to:

justify the use and semantics of probability

know how to compute marginals and apply Bayes’theorem

build a belief network for a domain

predict the inferences for a belief network

explain the predictions of a causal model

Using Uncertain Knowledge

Agents don’t have complete knowledge about the world.

Agents need to make (informed) decisions given theiruncertainty.

It isn’t enough to assume what the world is like.Example: wearing a seat belt.

An agent needs to reason about its uncertainty.

When an agent makes an action under uncertainty, it isgambling =⇒ probability.

Probability

Probability is an agent’s measure of belief in someproposition — subjective probability.

An agent’s belief depends on its prior belief and what itobserves.

Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by

what the agent knows about that bird.

Probability

Probability is an agent’s measure of belief in someproposition — subjective probability.

An agent’s belief depends on its prior belief and what itobserves.

Example: An agent’s probability of a particular bird flyingI Other agents may have different probabilitiesI An agent’s belief in a bird’s flying ability is affected by

what the agent knows about that bird.

Random Variables

A random variable starts with upper case.

The domain of a variable X , written domain(X ), is theset of values X can take. (Sometimes use “range”,“frame”, “possible values”).

A tuple of random variables 〈X1, . . . ,Xn〉 is a complexrandom variable with domaindomain(X1)× · · · × domain(Xn).Often the tuple is written as X1, . . . ,Xn.

Assignment X = x means variable X has value x .

A proposition is a Boolean formula made fromassignments of values to variables or inequality (e.g., <,≤,. . . ) between variables and values.

Random Variables

Possible World Semantics

A possible world specifies an assignment of one value toeach random variable.

A random variable is a function from possible worlds intothe domain of the random variable.

ω |= X = xmeans variable X is assigned value x in world ω.

Logical connectives have their standard meaning:

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Let Ω be the set of all possible worlds.

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

ω |= α ∧ β if ω |= α and ω |= β

ω |= α ∨ β if ω |= α or ω |= β

ω |= ¬α if ω 6|= α

Semantics of Probability

Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:

µ(Ω) = 1

µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .

Then P(α) = µ(ω | ω |= α).

Semantics of Probability

Probability defines a measure on sets of possible worlds.A probability measure is a function µ from sets of worlds intothe non-negative real numbers such that:

µ(Ω) = 1

µ(S1 ∪ S2) = µ(S1) + µ(S2)if S1 ∩ S2 = .

Then P(α) = µ(ω | ω |= α).

Semantics

Possible Worlds:

Suppose the measure of each singleton world is 0.1.

What is the probability of circle?

What us the probability of star?

What is the probability of triangle?

What is the probability of orange?

What is the probability of orange and star?

What are the random variables?

Semantics

Possible Worlds:

Semantics

Possible Worlds:

Semantics

Possible Worlds:

Semantics

Possible Worlds:

Semantics

Possible Worlds:

Semantics

Possible Worlds:

Axioms of Probability (finite case)

Three axioms define what follows from a set of probabilities:

Axiom 1 0 ≤ P(a) for any proposition a.

Axiom 2 P(true) = 1

Axiom 3 P(a ∨ b) = P(a) + P(b) if a and b cannot bothbe true.

These axioms are sound and complete with respect to thesemantics.

Conditioning

Probabilistic conditioning specifies how to revise beliefsbased on new information.

An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.

All other information must be conditioned on.

If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.

Conditioning

Probabilistic conditioning specifies how to revise beliefsbased on new information.

An agent builds a probabilistic model taking allbackground information into account.This gives a prior probability.

All other information must be conditioned on.

If evidence e is the all of the information obtainedsubsequently, the conditional probability P(h | e) of hgiven e is the posterior probability of h.

Semantics of Conditional Probability

Evidence e rules out possible worlds incompatible with e.

Evidence e induces a new measure, µe , over possibleworlds:

µe(S) =

c × µ(S)

if ω |= e for all ω ∈ S

0 if ω 6|= e for all ω ∈ S

We can show c = 1P(e)

The conditional probability of formula h given evidence eis

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

c × µ(S)

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

c × µ(S)

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S

if ω 6|= e for all ω ∈ S

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

c × µ(S) if ω |= e for all ω ∈ S0 if ω 6|= e for all ω ∈ S

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

We can show c =

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

P(h | e) =

µe(ω : ω |= h)

=P(h ∧ e)

µe(S) =

P(h | e) = µe(ω : ω |= h)

P(h ∧ e)

µe(S) =

P(h | e) = µe(ω : ω |= h)

=P(h ∧ e)

Conditioning

Possible Worlds:

Observe Color = orange:

Conditioning

Possible Worlds:

Conditioning

Possible Worlds:

Exercise

Flu Sneeze Snore µtrue true true 0.064true true false 0.096true false true 0.016true false false 0.024false true true 0.096false true false 0.144false false true 0.224false false false 0.336

What is:

(a) P(flu ∧ sneeze)

(b) P(flu ∧ ¬sneeze)

(c) P(flu)

(d) P(sneeze | flu)

(e) P(¬flu ∧ sneeze)

(f) P(flu | sneeze)

(g) P(sneeze | flu∧snore)

(h) P(flu | sneeze∧snore)

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

P(fi | f1 ∧ · · · ∧ fi−1)

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

P(fi | f1 ∧ · · · ∧ fi−1)

Chain Rule

P(f1 ∧ f2 ∧ . . . ∧ fn)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(f1 ∧ · · · ∧ fn−1)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)×P(f1 ∧ · · · ∧ fn−2)

= P(fn | f1 ∧ · · · ∧ fn−1)×P(fn−1 | f1 ∧ · · · ∧ fn−2)

× · · · × P(f3 | f1 ∧ f2)× P(f2 | f1)× P(f1)

P(fi | f1 ∧ · · · ∧ fi−1)

Bayes’ theorem

The chain rule and commutativity of conjunction (h ∧ e isequivalent to e ∧ h) gives us:

P(h ∧ e) =

P(h | e)× P(e)

= P(e | h)× P(h).

If P(e) 6= 0, divide the right hand sides by P(e):

P(h | e) =P(e | h)× P(h)

This is Bayes’ theorem.

Bayes’ theorem

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

P(h | e) =P(e | h)× P(h)

Bayes’ theorem

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

P(h | e) =P(e | h)× P(h)

Bayes’ theorem

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

P(h | e) =

P(e | h)× P(h)

Bayes’ theorem

P(h ∧ e) = P(h | e)× P(e)

= P(e | h)× P(h).

P(h | e) =P(e | h)× P(h)

Why is Bayes’ theorem interesting?

Often you have causal knowledge:P(symptom | disease)P(light is off | status of switches and switch positions)P(alarm | fire)

P(image looks like | a tree is in front of a car)

and want to do evidential reasoning:P(disease | symptom)P(status of switches | light is off and switch positions)P(fire | alarm).

P(a tree is in front of a car | image looks like )

Exercise

A cab was involved in a hit-and-run accident at night. Twocab companies, the Green and the Blue, operate in the city.You are given the following data:

85% of the cabs in the city are Green and 15% are Blue.

A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existedon the night of the accident and concluded that thewitness correctly identifies each one of the two colours80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accidentwas Blue?

[From D. Kahneman, Thinking Fast and Slow, 2011, p. 166.]

Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .

Conditional independence

Random variable X is independent of random variable Y givenrandom variable(s) Z if,

P(X | YZ ) = P(X | Z )

i.e. for all x ∈ dom(X ), y , y ′ ∈ dom(Y ), and z ∈ dom(Z ),

P(X = x | Y = y ∧ Z = z)

= P(X = x | Y = y ′ ∧ Z = z)

= P(X = x | Z = z).

That is, knowledge of Y ’s value doesn’t affect the belief in thevalue of X , given a value of Z .

Example

Consider a student writing an exam.What are reasonable independences among the following?

Whether the student works hard

Whether the student is intelligent

The student’s answers on the exam

The student’s mark on an exam

Example domain (diagnostic assistant)

two-wayswitch

switch

poweroutlet

circuitbreaker

outside power

Examples of conditional independence?

Suppose you know whether there was power in w1 andwhether there was power in w2 what information isrelevant to whether light l1 is lit? What is independent?

Whether light l1 is lit is independent of the position oflight switch s2 given what?

Every other variable may be independent of whether lightl1 is lit given whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

Every other variable may be independent of whether lightl1 is lit given

whether there is power in wire w0 and thestatus of light l1 (if it’s ok , or if not, how it’s broken).

Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

s2_pos

l1_lit

... ... ......

W 0 depends only on

whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).

Idea of belief networks

l1 is lit (L1 lit) dependsonly on the status of thelight (L1 st) and whetherthere is power in wire w0.

In a belief network, W 0and L1 st are parents ofL1 lit.

s2_pos

l1_lit

... ... ......

W 0 depends only on whether there is power in w1,whether there is power in w2, the position of switch s2(S2 pos), and the status of switch s2 (S2 st).

Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is,

parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.

Belief networks

Totally order the variables of interest: X1, . . . ,Xn

Theorem of probability theory (chain rule):P(X1, . . . ,Xn) =

∏ni=1 P(Xi | X1, . . . ,Xi−1)

The parents parents(Xi) of Xi are those predecessors ofXi that render Xi independent of the other predecessors.That is, parents(Xi) ⊆ X1, . . . ,Xi−1 andP(Xi | parents(Xi)) = P(Xi | X1, . . . ,Xi−1)

So P(X1, . . . ,Xn) =∏n

i=1 P(Xi | parents(Xi))

A belief network is a graph: the nodes are randomvariables; there is an arc from the parents of each nodeinto that node.

Example: fire alarm belief network

Variables:

Fire: there is a fire in the building

Tampering: someone has been tampering with the firealarm

Smoke: what appears to be smoke is coming from anupstairs window

Alarm: the fire alarm goes off

Leaving: people are leaving the building en masse.

Report: a colleague says that people are leaving thebuilding en masse. (A noisy sensor for leaving.)

Components of a belief network

A belief network consists of:

a directed acyclic graph with nodes labeled with randomvariables

a domain for each random variable

a set of conditional probability tables for each variablegiven its parents (including prior probabilities for nodeswith no parents).

Example belief network

Outside_power

Cb1_st Cb2_st

S1_pos

S2_pos

S3_pos

L2_lit

L1_lit

Example belief network (continued)

The belief network also specifies:

The domain of the variables:W0, . . . ,W6 have domain live, deadS1 pos, S2 pos, and S3 pos have domain up, downS1 st has ok , upside down, short, intermittent, broken.Conditional probabilities, including:P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = live)P(W1 = live | s1 pos = up ∧ S1 st = ok ∧W3 = dead)P(S1 pos = up)P(S1 st = upside down)

Belief network summary

A belief network is a directed acyclic graph (DAG) wherenodes are random variables.

The parents of a node n are those variables on which ndirectly depends.

A belief network is automatically acyclic by construction.

A belief network is a graphical representation ofdependence and independence:

I A variable is independent of its non-descendants givenits parents.

Constructing belief networks

To represent a domain in a belief network, you need toconsider:

What are the relevant variables?I What will you observe?I What would you like to find out (query)?I What other features make the model simpler?

What values should these variables take?

What is the relationship between them? This should beexpressed in terms of a directed graph, representing howeach variable is generated from its predecessors.

How does the value of each variable depend on itsparents? This is expressed in terms of the conditionalprobabilities.

Understanding Independence: Common ancestors

smokealarm

alarm and smoke are

dependent

alarm and smoke are

independent

given fire

Intuitively, fire canexplain alarm andsmoke; learning onecan affect the other bychanging your belief infire.

smokealarm

alarm and smoke aredependent

alarm and smoke are

independent

given fire

smokealarm

alarm and smoke are

independent

given fire

smokealarm

alarm and smoke areindependent given fire

smokealarm

alarm and smoke areindependent given fire

Understanding Independence: Chain

report

leaving

alarm and report are

dependent

independent

givenleaving

Intuitively, the onlyway that the alarmaffects report is byaffecting leaving .

report

leaving

alarm and report aredependent

independent

givenleaving

report

leaving

independent

givenleaving

report

leaving

alarm and report areindependent givenleaving

report

leaving

alarm and report areindependent givenleaving

Understanding Independence: Common

descendants

tampering

fire tampering and fire are

independent

tampering and fire are

dependent

given alarm

Intuitively, tamperingcan explain away fire

descendants

tampering

fire tampering and fire areindependent

dependent

given alarm

descendants

tampering

dependent

given alarm

descendants

tampering

tampering and fire aredependent given alarm

descendants

tampering

tampering and fire aredependent given alarm

Understanding independence: example

B CA D E F

I J K L

Understanding independence: questions

1. On which given probabilities does P(N) depend?

2. If you were to observe a value for B , which variables’probabilities will change?

3. If you were to observe a value for N , which variables’probabilities will change?

4. Suppose you had observed a value for M ; if you were tothen observe a value for N , which variables’ probabilitieswill change?

5. Suppose you had observed B and Q; which variables’probabilities will change when you observe N?

What variables are affected by observing?

If you observe variable(s) Y , the variables whose posteriorprobability is different from their prior are:

I The ancestors of Y andI their descendants.

Intuitively (if you have a causal belief network):I You do abduction to possible causes andI prediction from the causes.

d-separation

A connection is a meeting of arcs in a belief network. Aconnection is open is defined as follows:

I If there are arcs A→ B and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs B → A and B → C such that B 6∈ Z ,then the connection at B between A and C is open.

I If there are arcs A→ B and C → B such that B (or adescendent of B) is in Z , then the connection at Bbetween A and C is open.

X is d-connected from Y given Z if there is a path fromX to Y , along open connections.

X is d-separated from Y given Z if it is not d-connected.

X is independent Y given Z for all conditionalprobabilities iff X is d-separated from Y given Z

d-separation

Markov Random Field

A Markov random field is composed of

of a set of discrete-valued random variables:X = X1,X2, . . . ,Xn and

a set of factors f1, . . . , fm, where a factor is anon-negative function of a subset of the variables.

and defines a joint probability distribution:

P(X = x) =1

fk(Xk = xk) .

Z =∑

fk(Xk = xk)

where fk(Xk) is a factor on Xk ⊆ X, andxk is x projected onto Xk .Z is a normalization constant known as the partition function.

Markov Networks and Factor graphs

A Markov network is a graphical representation of aMarkov random field where the nodes are the randomvariables and there is an arc between any two variablesthat are in a factor together.

A factor graph is a bipartite graph, which contains avariable node for each random variable and a factor nodefor each factor. There is an edge between a variable nodeand a factor node if the variable appears in the factor.

A belief network is a type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

A belief network is a

type of Markov random field wherethe factors represent conditional probabilities, there is afactor for each variable, and directed graph is acyclic.

Independence in a Markov Network

The Markov blanket of a variable X is the set of variablesthat are in factors with X .

A variable is independent of the other variables given itsMarkov blanket.

X is connected to Y given Z if there is a path from X toY in the Markov network, which does not contain anelement of Z .

X is separated from Y given Z if it is not connected.

A positive distribution is one that does not contain zeroprobabilities.

X is independent Y given Z for all positive distributionsiff X is separated from Y given Z

Canonical Representations

The parameters of a graphical model are the numbersthat define the model.

A belief network is a canonical representation: given thestructure and the distribution, the parameters areuniquely determined.

A Markov random field is not a canonical representation.Many different parameterizations result in the samedistribution.

Representations of Conditional Probabilities

There are many representations of conditional probabilities andfactors:

Tables

Decision Trees

Weighted Logical Formulae

Contextual Tables

Logistic Functions

Tabular Representation

P(D | A,B ,C ) :

A B C D Prob

true true true true 0.9true true true false 0.1true true false true 0.9true true false false 0.1true false true true 0.2true false true false 0.8true false false true 0.2true false false false 0.8false true true true 0.3false true true false 0.7false true false true 0.4false true false false 0.6false false true true 0.3false false true false 0.7false false false true 0.4false false false false 0.6

Decision Tree Representation

Rule Representation

0.9 : d ← a ∧ b

0.2 : d ← a ∧ ¬b

0.3 : d ← ¬a ∧ c

0.4 : d ← ¬a ∧ ¬c

Weighted Logical Formulae

d ↔ ((a ∧ b ∧ n0)

∨ (a ∧ ¬b ∧ n1)

∨ (¬a ∧ c ∧ n2)

∨ (¬a ∧ ¬c ∧ n2))

ni are independent:

P(n0) = 0.9

P(n1) = 0.2

P(n2) = 0.3

P(n3) = 0.4

Contextual-Table Representation

Logistic Functions

P(h | e) =P(h ∧ e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

1 + P(¬h ∧ e)/P(h ∧ e)

1 + e− logP(h∧e)/P(¬h∧e)

= sigmoid(log odds(h | e))

sigmoid(x) =1

1 + e−x

odds(h | e) =P(h ∧ e)

P(¬h ∧ e)

Logistic Functions

P(h | e) =P(h ∧ e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

1 + P(¬h ∧ e)/P(h ∧ e)

sigmoid(x) =1

1 + e−x

P(¬h ∧ e)

Logistic Functions

P(h | e) =P(h ∧ e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

1 + P(¬h ∧ e)/P(h ∧ e)

sigmoid(x) =1

1 + e−x

P(¬h ∧ e)

Logistic Functions

P(h | e) =P(h ∧ e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

1 + P(¬h ∧ e)/P(h ∧ e)

sigmoid(x) =1

1 + e−x

P(¬h ∧ e)

Logistic Functions

P(h | e) =P(h ∧ e)

=P(h ∧ e)

P(h ∧ e) + P(¬h ∧ e)

1 + P(¬h ∧ e)/P(h ∧ e)

sigmoid(x) =1

1 + e−x

P(¬h ∧ e)c©D. Poole and A. Mackworth 2017 Artificial Intelligence, Lecture 8.3, Page 45 18 / 20

Logistic Functions

A conditional probability is the sigmoid of the log-odds.

00.10.20.30.40.50.60.70.80.9

-10 -5 0 5 10

1 + e- x

sigmoid(x) =1

1 + e−x

A logistic function is the sigmoid of a linear function.

Logistic Representation of Conditional Probability

P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B

+ 0.2† ∗ A ∗ (1− B)

+ 0.3† ∗ (1− A) ∗ C

+ 0.4† ∗ (1− A) ∗ (1− C ))

where 0.9† is sigmoid−1(0.9).

P(d | A,B ,C ) = sigmoid(0.4†

+ (0.2† − 0.4†) ∗ A

+ (0.9† − 0.2†) ∗ A ∗ B

Logistic Representation of Conditional Probability

P(d | A,B ,C ) = sigmoid(0.9† ∗ A ∗ B

+ 0.2† ∗ A ∗ (1− B)

+ 0.3† ∗ (1− A) ∗ C

+ 0.4† ∗ (1− A) ∗ (1− C ))

where 0.9† is sigmoid−1(0.9).

P(d | A,B ,C ) = sigmoid(0.4†

+ (0.2† − 0.4†) ∗ A

+ (0.9† − 0.2†) ∗ A ∗ B

Belief network inference

Main approaches to determine posterior distributions in graphicalmodels:

Variable Elimination, recursive conditioning: exploit thestructure of the network to eliminate (sum out) thenon-observed, non-query variables one at a time.

Stochastic simulation: random cases are generated accordingto the probability distributions.

Variational methods: find the closest tractable distribution tothe (posterior) distribution we are interested in.

Bounding approaches: bound the conditional probabilitesabove and below and iteratively reduce the bounds.

Factors

A factor is a representation of a function from a tuple ofrandom variables into a number.

We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).

We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on

X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value

of f when each Xi has value vi .

The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.

Factors

A factor is a representation of a function from a tuple ofrandom variables into a number.

We write factor f on variables X1, . . . ,Xj as f (X1, . . . ,Xj).

We can assign some or all of the variables of a factor:I f (X1 = v1,X2, . . . ,Xj), where v1 ∈ dom(X1), is a factor on

X2, . . . ,Xj .I f (X1 = v1,X2 = v2, . . . ,Xj = vj) is a number that is the value

of f when each Xi has value vi .

The former is also written as f (X1,X2, . . . ,Xj)X1 = v1 , etc.

Example factors

r(X ,Y ,Z ):

X Y Z val

t t t 0.1t t f 0.9t f t 0.2t f f 0.8f t t 0.4f t f 0.6f f t 0.3f f f 0.7

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

t 0.9f

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) =

Example factors

r(X ,Y ,Z ):

X Y Z val

r(X=t,Y ,Z ):

Y Z val

t t 0.1t f 0.9f t 0.2f f 0.8

r(X=t,Y ,Z=f ):

t 0.9f 0.8

r(X=t,Y=f ,Z=f ) = 0.8

Multiplying factors

The product of factor f1(X ,Y ) and f2(Y ,Z ), where Y are thevariables in common, is the factor (f1 ∗ f2)(X ,Y ,Z ) defined by:

(f1 ∗ f2)(X ,Y ,Z ) = f1(X ,Y )f2(Y ,Z ).

Multiplying factors example

A B valt t 0.1t f 0.9f t 0.2f f 0.8