bayesian networks: inference and learningrussell/classes/cs194/f11/lectures/cs194 fall...

Bayesian networks: Inference and learning

CS194-10 Fall 2011 Lecture 22

CS194-10 Fall 2011 Lecture 22 1

Outline

♦ Exact inference (briefly)

♦ Approximate inference (rejection sampling, MCMC)

♦ Parameter learning


Inference by enumeration

Slightly intelligent way to sum out variables from the joint without actuallyconstructing its explicit representation

Simple query on the burglary network:B E

JA

M

P(B|j, m)= P(B, j, m)/P (j, m)= αP(B, j, m)= α Σe Σa P(B, e, a, j, m)

Rewrite full joint entries using product of CPT entries:P(B|j, m)= α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a)= αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(D) space, O(KD) time


Evaluation tree

P(j|a).90

P(m|a).70 .01

P(m| a)

.05P(j| a) P(j|a)

.90

P(m|a).70 .01

P(m| a)

.05P(j| a)

P(b).001

P(e).002

P( e).998

P(a|b,e).95 .06

P( a|b, e).05P( a|b,e)

.94P(a|b, e)

Enumeration is inefficient: repeated computatione.g., computes P (j|a)P (m|a) for each value of e


Efficient exact inference

Junction tree and variable elimination algorithmsavoid repeated computation (generalized from of dynamic programming)♦ Singly connected networks (or polytrees):

– any two nodes are connected by at most one (undirected) path– time and space cost of exact inference are O(KLD)

♦ Multiply connected networks:– can reduce 3SAT to exact inference ⇒ NP-hard– equivalent to counting 3SAT models ⇒ #P-complete

A B C D

1 2 3

AND

0.5

LL

L

L1. A v B v C2. C v D v A3. B v C v D

0.50.50.5


Inference by stochastic simulation

Idea: replace sum over hidden-variable assignments with a random sam-ple.

1) Draw N samples from a sampling distribution S

Coin

0.52) Compute an approximate posterior probability P3) Show this converges to the true probability P

Outline:– Sampling from an empty network– Rejection sampling: reject samples disagreeing with evidence– Markov chain Monte Carlo (MCMC): sample from a stochastic process

whose stationary distribution is the true posterior


Sampling from an empty network

function Prior-Sample(bn) returns an event sampled from prior specified by bn

inputs: bn, a Bayesian network specifying joint distribution P(X1, . . . , XD)

x← an event with D elements

for j = 1, . . . , D dox[j]← a random sample from P(Xj | values of parents(Xj) in x)

return x


Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01


Sampling from an empty network contd.

Probability that PriorSample generates a particular event

SPS(x1 . . . xD) = ΠDj = 1P (xj|parents(Xj)) = P (x1 . . . xD)

i.e., the true prior probability

E.g., SPS(t, f, t, t) = 0.5× 0.9× 0.8× 0.9 = 0.324 = P (t, f, t, t)

Let NPS(x1 . . . xD) be the number of samples generated for event x1, . . . , xD

Then we have

limN→∞

P (x1, . . . , xD) = limN→∞

NPS(x1, . . . , xD)/N

= SPS(x1, . . . , xD)

= P (x1 . . . xD)

That is, estimates derived from PriorSample are consistent

Shorthand: P (x1, . . . , xD) ≈ P (x1 . . . xD)


Rejection sampling

P(X|e) estimated from samples agreeing with e

function Rejection-Sampling(X ,e, bn,N ) returns an estimate of P(X|e)

local variables: N, a vector of counts for each value of X , initially zero

for i = 1 to N dox←Prior-Sample(bn)

if x is consistent with e thenN[x ]←N[x ]+1 where x is the value of X in x

return Normalize(N)

E.g., estimate P(Rain|Sprinkler = true) using 100 samples27 samples have Sprinkler = true

Of these, 8 have Rain = true and 19 have Rain = false.

P(Rain|Sprinkler = true) = Normalize(〈8, 19〉) = 〈0.296, 0.704〉

Similar to a basic real-world empirical estimation procedure


Analysis of rejection sampling

P(X|e) = αNPS(X, e) (algorithm defn.)= NPS(X, e)/NPS(e) (normalized by NPS(e))≈ P(X, e)/P (e) (property of PriorSample)= P(X|e) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates

Problem: hopelessly expensive if P (e) is small

P (e) drops off exponentially with number of evidence variables!


Approximate inference using MCMC

General idea of Markov chain Monte Carlo

♦ Sample space Ω, probability π(ω) (e.g., posterior given e)

♦ Would like to sample directly from π(ω), but it’s hard

♦ Instead, wander around Ω randomly, collecting samples

♦ Random wandering is controlled by transition kernel φ(ω → ω′)specifying the probability of moving to ω′ from ω(so the random state sequence ω0, ω1, . . . , ωt is a Markov chain)

♦ If φ is defined appropriately, the stationary distribution is π(ω)so that, after a while (mixing time) the collected samples are drawn from π


Gibbs sampling in Bayes nets

Markov chain state ωt = current assignment xt to all variables

Transition kernel: pick a variable Xj, sample it conditioned on all others

Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj))so generate next state by sampling a variable given its Markov blanket

function Gibbs-Ask(X ,e, bn,N ) returns an estimate of P(X|e)

local variables: N, a vector of counts for each value of X , initially zero

Z, the nonevidence variables in bn

z, the current state of variables Z, initially random

for i = 1 to N dochoose Zj in Z uniformly at random

set the value of Zj in z by sampling from P(Zj|mb(Zj))

N[x ]←N[x ] + 1 where x is the value of X in z

return Normalize(N)


The Markov chain

With Sprinkler = true, WetGrass = true, there are four states:

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass


MCMC example contd.

Estimate P(Rain|Sprinkler = true, WetGrass = true)

Sample Cloudy or Rain given its Markov blanket, repeat.Count number of times Rain is true and false in the samples.

E.g., visit 100 states31 have Rain = true, 69 have Rain = false

P(Rain|Sprinkler = true, WetGrass = true)= Normalize(〈31, 69〉) = 〈0.31, 0.69〉


Markov blanket sampling

Markov blanket of Cloudy is Cloudy

RainSprinkler

WetGrass

Sprinkler and RainMarkov blanket of Rain is

Cloudy, Sprinkler, and WetGrass

Probability given the Markov blanket is calculated as follows:

P (x′j|mb(Xj)) = αP (x′j|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`))

E.g., φ(¬cloudy , rain → cloudy , rain)= 0.5×αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5×α× 0.5× 0.1× 0.8= 0.5× 0.040

0.040+0.050 = 0.2222

(Easy for discrete variables; continuous case requires mathematicalanalysis for each combination of distribution types.)

Easily implemented in message-passing parallel systems, brains

Can converge slowly, especially for near-deterministic models


Theory for Gibbs sampling

Theorem: stationary distribution for Gibbs transition kernel is P (z | e);i.e., long-run fraction of time spent in each state is exactlyproportional to its posterior probability

Proof sketch:– The Gibbs transition kernel satisfies detailed balance for P (z | e)

i.e., for all z, z′ the “flow” from z to z′ is the same as from z′ to z– π is the unique stationary distribution for any ergodic transition kernel

satisfying detailed balance for π


Detailed balance

Let πt(z) be the probability the chain is in state z at time t

Detailed balance condition: “outflow” = “inflow” for each pair of states:

πt(z)φ(z→ z′) = πt(z′)φ(z′ → z) for all z, z′

Detailed balance ⇒ stationarity:

πt+1(z) = Σz′πt(z′)φ(z′ → z) = Σz′πt(z)φ(z→ z′)

= πt(z)Σz′φ(z→ z′)

= πt(z′)

MCMC algorithms typically constructed by designing a transitionkernel φ that is in detailed balance with desired π


Summary (inference)

Exact inference:– polytime on polytrees, NP-hard on general graphs– space = time, very sensitive to topology

Approximate inference by MCMC:– Generally insensitive to topology– Convergence can be very slow with probabilities close to 1 or 0– Can handle arbitrary combinations of discrete and continuous variables


Parameter learning: Complete data

θjk` = P (Xj = k | Parents(Xj) = `)

Let x(i)j = value of Xj in example i; assume Boolean for simplicity

Log likelihood

L(θ) =N∑

i = 1

D∑j = 1

log P (x(i)j | parents(Xj)

(i)

=N∑

i = 1

D∑j = 1

log θx

(i)j

j1`(i)(1− θj1`(i))

1−x(i)j

∂L

∂θj1`=

Nj1`

θj1`− Nj0`

1− θj1`= 0 gives

θj1` =Nj1`

Nj0` + Nj1`=

Nj1`

Nj`.

I.e., learning is completely decomposed; MLE = observed condition frequency


Example

Red/green wrapper depends probabilistically on flavor:

P(F=cherry)

Flavor

Wrapper

!

Fcherrylime

P(W=red | F)!1!2

Likelihood for, e.g., cherry candy in green wrapper:

P (F = cherry ,W = green|hθ,θ1,θ2)

= P (F = cherry |hθ,θ1,θ2)P (W = green|F = cherry , hθ,θ1,θ2)

= θ · (1− θ1)

N candies, rc red-wrapped cherry candies, etc.:

P (X|hθ,θ1,θ2) = θc(1− θ)` · θrc1 (1− θ1)

gc · θr`2 (1− θ2)

g`

L = [c log θ + ` log(1− θ)]

+ [rc log θ1 + gc log(1− θ1)]

+ [r` log θ2 + g` log(1− θ2)]


Example contd.

Derivatives of L contain only the relevant parameter:

∂L

∂θ=

c

θ− `

1− θ= 0 ⇒ θ =

c

c + `

∂L

∂θ1=

rc

θ1− gc

1− θ1= 0 ⇒ θ1 =

rc

rc + gc

∂L

∂θ2=

r`

θ2− g`

1− θ2= 0 ⇒ θ2 =

r`

r` + g`


Hidden variables

Why learn models with hidden variables?

Smoking Diet Exercise

Symptom1 Symptom2 Symptom3

(a) (b)

HeartDisease

Smoking Diet Exercise

Symptom1 Symptom2 Symptom3

2 2 2

54

6 6 6

2 2 2

54 162 486

Hidden variables ⇒ simplified structure, fewer parameters⇒ faster learning


Learning with and without hidden variables

3

3.05

3.1

3.15

3.2

3.25

3.3

3.35

3.4

3.45

3.5

0 1000 2000 3000 4000 5000

Aver

age

nega

tive

log lik

eliho

od p

er ca

se

Number of training cases

3-3 network, APN algorithm3-1-3 network, APN algorithm

1 hidden node, NN algorithm3 hidden nodes, NN algorithm5 hidden nodes, NN algorithm

Target network


EM for Bayes nets

For t = 0 to ∞ (until convergence) do

E step: Compute all pijk` = P (Xj = k, Parents(Xj) = ` | e(i), θ(t))

M step: θ(t+1)jk` =

Njk`

Σk′Njk′`=

Σipijk`

ΣiΣk′pijk′`

E step can be any exact or approximate inference algorithm

With MCMC, can treat each sample as a complete-data example


Candy example

Hole

Bag

P(Bag=1)!

WrapperFlavor

Bag12

P(F=cherry | B)

!F2

!F1

-2020

-2010

-2000

-1990

-1980

0 20 40 60 80 100 120

Log-

likel

ihoo

d L

Iteration number


Example: Car insurance

SocioEconAge

GoodStudentExtraCar

MileageVehicleYear

RiskAversion

SeniorTrain

DrivingSkill MakeModelDrivingHist

DrivQualityAntilock

Airbag CarValue HomeBase AntiTheftt

TheftOwnDamage

PropertyCostLiabilityCostMedicalCost

Cushioning

Ruggedness Accident

OtherCost OwnCost


Example: Car insurance contd.

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

0 1000 2000 3000 4000 5000

Aver

age

nega

tive

log lik

eliho

od p

er ca

se

Number of training cases

12--3 network, APN algorithmInsurance network, APN algorithm

1 hidden node, NN algorithm5 hidden nodes, NN algorithm

10 hidden nodes, NN algorithmTarget network


Bayesian learning in Bayes nets

Parameters become variables (parents of their previous owners):

P (Wrapper = red | Flavor = cherry , Θ1 = θ1, Θ2 = θ2) = θ1 .

network replicated for each example, parameters shared across all examples:

P(F=cherry)

Flavor

Wrapper

!

Fcherrylime

P(W=red | F)!1!2

Flavor1

Wrapper1

Flavor2

Wrapper2

Flavor3

Wrapper3

!

!1 !2


Bayesian learning contd.

Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc.

With independent Beta or Dirichlet priors,MAP EM learning = pseudocounts + expected counts:

M step: θ(t+1)jk` = ajk` +

Njk`

Σk′Njk′`

Implemented in EM training mode for Hugin and other Bayes net packages


Summary (parameter learning)

Complete data: likelihood factorizes; each parameter θjk` learned separatelyfrom the observed counts for conditional frequency Njk`/Nj`

Incomplete data: likelihood is a summation over all values of hidden variables;can apply EM by computing “expected counts”

Bayesian learning: parameters become variables in a replicated modelwith their own prior distributions defined by hyperparameters; thenBayesian learning is just ordinary inference in the model


bayesian networks: inference and learningrussell/classes/cs194/f11/lectures/cs194 fall...

Documents