bayesian networks: inference and learningrussell/classes/cs194/f11/lectures/cs194 fall...

38
Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall 2011 Lecture 22 1

Upload: others

Post on 14-Feb-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Bayesian networks: Inference and learning

CS194-10 Fall 2011 Lecture 22

CS194-10 Fall 2011 Lecture 22 1

Page 2: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Outline

♦ Exact inference (briefly)

♦ Approximate inference (rejection sampling, MCMC)

♦ Parameter learning

CS194-10 Fall 2011 Lecture 22 2

Page 3: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Inference by enumeration

Slightly intelligent way to sum out variables from the joint without actuallyconstructing its explicit representation

Simple query on the burglary network:B E

JA

M

P(B|j, m)= P(B, j, m)/P (j, m)= αP(B, j, m)= α Σe Σa P(B, e, a, j, m)

Rewrite full joint entries using product of CPT entries:P(B|j, m)= α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a)= αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(D) space, O(KD) time

CS194-10 Fall 2011 Lecture 22 3

Page 4: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Evaluation tree

P(j|a).90

P(m|a).70 .01

P(m| a)

.05P(j| a) P(j|a)

.90

P(m|a).70 .01

P(m| a)

.05P(j| a)

P(b).001

P(e).002

P( e).998

P(a|b,e).95 .06

P( a|b, e).05P( a|b,e)

.94P(a|b, e)

Enumeration is inefficient: repeated computatione.g., computes P (j|a)P (m|a) for each value of e

CS194-10 Fall 2011 Lecture 22 4

Page 5: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Efficient exact inference

Junction tree and variable elimination algorithmsavoid repeated computation (generalized from of dynamic programming)♦ Singly connected networks (or polytrees):

– any two nodes are connected by at most one (undirected) path– time and space cost of exact inference are O(KLD)

♦ Multiply connected networks:– can reduce 3SAT to exact inference ⇒ NP-hard– equivalent to counting 3SAT models ⇒ #P-complete

A B C D

1 2 3

AND

0.5

LL

L

L1. A v B v C2. C v D v A3. B v C v D

0.50.50.5

CS194-10 Fall 2011 Lecture 22 5

Page 6: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Inference by stochastic simulation

Idea: replace sum over hidden-variable assignments with a random sam-ple.

1) Draw N samples from a sampling distribution S

Coin

0.52) Compute an approximate posterior probability P3) Show this converges to the true probability P

Outline:– Sampling from an empty network– Rejection sampling: reject samples disagreeing with evidence– Markov chain Monte Carlo (MCMC): sample from a stochastic process

whose stationary distribution is the true posterior

CS194-10 Fall 2011 Lecture 22 6

Page 7: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Sampling from an empty network

function Prior-Sample(bn) returns an event sampled from prior specified by bn

inputs: bn, a Bayesian network specifying joint distribution P(X1, . . . , XD)

x← an event with D elements

for j = 1, . . . , D dox[j]← a random sample from P(Xj | values of parents(Xj) in x)

return x

CS194-10 Fall 2011 Lecture 22 7

Page 8: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 8

Page 9: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 9

Page 10: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 10

Page 11: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 11

Page 12: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 12

Page 13: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 13

Page 14: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Cloudy

RainSprinkler

WetGrass

CTF

.80

.20

P(R|C)CTF

.10

.50

P(S|C)

S RT TT FF TF F

.90

.90

.99P(W|S,R)

P(C).50

.01

CS194-10 Fall 2011 Lecture 22 14

Page 15: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Sampling from an empty network contd.

Probability that PriorSample generates a particular event

SPS(x1 . . . xD) = ΠDj = 1P (xj|parents(Xj)) = P (x1 . . . xD)

i.e., the true prior probability

E.g., SPS(t, f, t, t) = 0.5× 0.9× 0.8× 0.9 = 0.324 = P (t, f, t, t)

Let NPS(x1 . . . xD) be the number of samples generated for event x1, . . . , xD

Then we have

limN→∞

P (x1, . . . , xD) = limN→∞

NPS(x1, . . . , xD)/N

= SPS(x1, . . . , xD)

= P (x1 . . . xD)

That is, estimates derived from PriorSample are consistent

Shorthand: P (x1, . . . , xD) ≈ P (x1 . . . xD)

CS194-10 Fall 2011 Lecture 22 15

Page 16: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Rejection sampling

P(X|e) estimated from samples agreeing with e

function Rejection-Sampling(X ,e, bn,N ) returns an estimate of P(X|e)

local variables: N, a vector of counts for each value of X , initially zero

for i = 1 to N dox←Prior-Sample(bn)

if x is consistent with e thenN[x ]←N[x ]+1 where x is the value of X in x

return Normalize(N)

E.g., estimate P(Rain|Sprinkler = true) using 100 samples27 samples have Sprinkler = true

Of these, 8 have Rain = true and 19 have Rain = false.

P(Rain|Sprinkler = true) = Normalize(〈8, 19〉) = 〈0.296, 0.704〉

Similar to a basic real-world empirical estimation procedure

CS194-10 Fall 2011 Lecture 22 16

Page 17: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Analysis of rejection sampling

P(X|e) = αNPS(X, e) (algorithm defn.)= NPS(X, e)/NPS(e) (normalized by NPS(e))≈ P(X, e)/P (e) (property of PriorSample)= P(X|e) (defn. of conditional probability)

Hence rejection sampling returns consistent posterior estimates

Problem: hopelessly expensive if P (e) is small

P (e) drops off exponentially with number of evidence variables!

CS194-10 Fall 2011 Lecture 22 17

Page 18: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Approximate inference using MCMC

General idea of Markov chain Monte Carlo

♦ Sample space Ω, probability π(ω) (e.g., posterior given e)

♦ Would like to sample directly from π(ω), but it’s hard

♦ Instead, wander around Ω randomly, collecting samples

♦ Random wandering is controlled by transition kernel φ(ω → ω′)specifying the probability of moving to ω′ from ω(so the random state sequence ω0, ω1, . . . , ωt is a Markov chain)

♦ If φ is defined appropriately, the stationary distribution is π(ω)so that, after a while (mixing time) the collected samples are drawn from π

CS194-10 Fall 2011 Lecture 22 18

Page 19: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Gibbs sampling in Bayes nets

Markov chain state ωt = current assignment xt to all variables

Transition kernel: pick a variable Xj, sample it conditioned on all others

Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj))so generate next state by sampling a variable given its Markov blanket

function Gibbs-Ask(X ,e, bn,N ) returns an estimate of P(X|e)

local variables: N, a vector of counts for each value of X , initially zero

Z, the nonevidence variables in bn

z, the current state of variables Z, initially random

for i = 1 to N dochoose Zj in Z uniformly at random

set the value of Zj in z by sampling from P(Zj|mb(Zj))

N[x ]←N[x ] + 1 where x is the value of X in z

return Normalize(N)

CS194-10 Fall 2011 Lecture 22 19

Page 20: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

The Markov chain

With Sprinkler = true, WetGrass = true, there are four states:

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

Cloudy

RainSprinkler

WetGrass

CS194-10 Fall 2011 Lecture 22 20

Page 21: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

MCMC example contd.

Estimate P(Rain|Sprinkler = true, WetGrass = true)

Sample Cloudy or Rain given its Markov blanket, repeat.Count number of times Rain is true and false in the samples.

E.g., visit 100 states31 have Rain = true, 69 have Rain = false

P(Rain|Sprinkler = true, WetGrass = true)= Normalize(〈31, 69〉) = 〈0.31, 0.69〉

CS194-10 Fall 2011 Lecture 22 21

Page 22: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Markov blanket sampling

Markov blanket of Cloudy is Cloudy

RainSprinkler

WetGrass

Sprinkler and RainMarkov blanket of Rain is

Cloudy, Sprinkler, and WetGrass

Probability given the Markov blanket is calculated as follows:

P (x′j|mb(Xj)) = αP (x′j|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`))

E.g., φ(¬cloudy , rain → cloudy , rain)= 0.5×αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5×α× 0.5× 0.1× 0.8= 0.5× 0.040

0.040+0.050 = 0.2222

(Easy for discrete variables; continuous case requires mathematicalanalysis for each combination of distribution types.)

Easily implemented in message-passing parallel systems, brains

Can converge slowly, especially for near-deterministic models

CS194-10 Fall 2011 Lecture 22 22

Page 23: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Theory for Gibbs sampling

Theorem: stationary distribution for Gibbs transition kernel is P (z | e);i.e., long-run fraction of time spent in each state is exactlyproportional to its posterior probability

Proof sketch:– The Gibbs transition kernel satisfies detailed balance for P (z | e)

i.e., for all z, z′ the “flow” from z to z′ is the same as from z′ to z– π is the unique stationary distribution for any ergodic transition kernel

satisfying detailed balance for π

CS194-10 Fall 2011 Lecture 22 23

Page 24: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Detailed balance

Let πt(z) be the probability the chain is in state z at time t

Detailed balance condition: “outflow” = “inflow” for each pair of states:

πt(z)φ(z→ z′) = πt(z′)φ(z′ → z) for all z, z′

Detailed balance ⇒ stationarity:

πt+1(z) = Σz′πt(z′)φ(z′ → z) = Σz′πt(z)φ(z→ z′)

= πt(z)Σz′φ(z→ z′)

= πt(z′)

MCMC algorithms typically constructed by designing a transitionkernel φ that is in detailed balance with desired π

CS194-10 Fall 2011 Lecture 22 24

Page 25: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Gibbs sampling transition kernel

Probability of choosing variable Zj to sample is 1/(D − |E|)

Let Zj be all other nonevidence variables, i.e., Z− ZjCurrent values are zj and zj; e is fixed; transition probability is given by

φ(z→ z′) = φ(zj, zj → z′j, zj) = P (z′j|zj, e)/(D − |E|)

This gives detailed balance with P (z | e):

π(z)φ(z→ z′) =1

D − |E|P (z | e)P (z′j|zj, e) =

1

D − |E|P (zj, zj | e)P (z′j|zj, e)

=1

D − |E|P (zj|zj, e)P (zj | e)P (z′j|zj, e) (chain rule)

=1

D − |E|P (zj|zj, e)P (z′j, zj | e) (chain rule backwards)

= φ(z′ → z)π(z′) = π(z′)φ(z′ → z)

CS194-10 Fall 2011 Lecture 22 25

Page 26: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Summary (inference)

Exact inference:– polytime on polytrees, NP-hard on general graphs– space = time, very sensitive to topology

Approximate inference by MCMC:– Generally insensitive to topology– Convergence can be very slow with probabilities close to 1 or 0– Can handle arbitrary combinations of discrete and continuous variables

CS194-10 Fall 2011 Lecture 22 26

Page 27: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Parameter learning: Complete data

θjk` = P (Xj = k | Parents(Xj) = `)

Let x(i)j = value of Xj in example i; assume Boolean for simplicity

Log likelihood

L(θ) =N∑

i = 1

D∑j = 1

log P (x(i)j | parents(Xj)

(i)

=N∑

i = 1

D∑j = 1

log θx

(i)j

j1`(i)(1− θj1`(i))

1−x(i)j

∂L

∂θj1`=

Nj1`

θj1`− Nj0`

1− θj1`= 0 gives

θj1` =Nj1`

Nj0` + Nj1`=

Nj1`

Nj`.

I.e., learning is completely decomposed; MLE = observed condition frequency

CS194-10 Fall 2011 Lecture 22 27

Page 28: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example

Red/green wrapper depends probabilistically on flavor:

P(F=cherry)

Flavor

Wrapper

!

Fcherrylime

P(W=red | F)!1!2

Likelihood for, e.g., cherry candy in green wrapper:

P (F = cherry ,W = green|hθ,θ1,θ2)

= P (F = cherry |hθ,θ1,θ2)P (W = green|F = cherry , hθ,θ1,θ2)

= θ · (1− θ1)

N candies, rc red-wrapped cherry candies, etc.:

P (X|hθ,θ1,θ2) = θc(1− θ)` · θrc1 (1− θ1)

gc · θr`2 (1− θ2)

g`

L = [c log θ + ` log(1− θ)]

+ [rc log θ1 + gc log(1− θ1)]

+ [r` log θ2 + g` log(1− θ2)]

CS194-10 Fall 2011 Lecture 22 28

Page 29: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example contd.

Derivatives of L contain only the relevant parameter:

∂L

∂θ=

c

θ− `

1− θ= 0 ⇒ θ =

c

c + `

∂L

∂θ1=

rc

θ1− gc

1− θ1= 0 ⇒ θ1 =

rc

rc + gc

∂L

∂θ2=

r`

θ2− g`

1− θ2= 0 ⇒ θ2 =

r`

r` + g`

CS194-10 Fall 2011 Lecture 22 29

Page 30: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Hidden variables

Why learn models with hidden variables?

Smoking Diet Exercise

Symptom1 Symptom2 Symptom3

(a) (b)

HeartDisease

Smoking Diet Exercise

Symptom1 Symptom2 Symptom3

2 2 2

54

6 6 6

2 2 2

54 162 486

Hidden variables ⇒ simplified structure, fewer parameters⇒ faster learning

CS194-10 Fall 2011 Lecture 22 30

Page 31: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Learning with and without hidden variables

3

3.05

3.1

3.15

3.2

3.25

3.3

3.35

3.4

3.45

3.5

0 1000 2000 3000 4000 5000

Aver

age

nega

tive

log lik

eliho

od p

er ca

se

Number of training cases

3-3 network, APN algorithm3-1-3 network, APN algorithm

1 hidden node, NN algorithm3 hidden nodes, NN algorithm5 hidden nodes, NN algorithm

Target network

CS194-10 Fall 2011 Lecture 22 31

Page 32: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

EM for Bayes nets

For t = 0 to ∞ (until convergence) do

E step: Compute all pijk` = P (Xj = k, Parents(Xj) = ` | e(i), θ(t))

M step: θ(t+1)jk` =

Njk`

Σk′Njk′`=

Σipijk`

ΣiΣk′pijk′`

E step can be any exact or approximate inference algorithm

With MCMC, can treat each sample as a complete-data example

CS194-10 Fall 2011 Lecture 22 32

Page 33: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Candy example

Hole

Bag

P(Bag=1)!

WrapperFlavor

Bag12

P(F=cherry | B)

!F2

!F1

-2020

-2010

-2000

-1990

-1980

0 20 40 60 80 100 120

Log-

likel

ihoo

d L

Iteration number

CS194-10 Fall 2011 Lecture 22 33

Page 34: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example: Car insurance

SocioEconAge

GoodStudentExtraCar

MileageVehicleYear

RiskAversion

SeniorTrain

DrivingSkill MakeModelDrivingHist

DrivQualityAntilock

Airbag CarValue HomeBase AntiTheftt

TheftOwnDamage

PropertyCostLiabilityCostMedicalCost

Cushioning

Ruggedness Accident

OtherCost OwnCost

CS194-10 Fall 2011 Lecture 22 34

Page 35: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Example: Car insurance contd.

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

0 1000 2000 3000 4000 5000

Aver

age

nega

tive

log lik

eliho

od p

er ca

se

Number of training cases

12--3 network, APN algorithmInsurance network, APN algorithm

1 hidden node, NN algorithm5 hidden nodes, NN algorithm

10 hidden nodes, NN algorithmTarget network

CS194-10 Fall 2011 Lecture 22 35

Page 36: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Bayesian learning in Bayes nets

Parameters become variables (parents of their previous owners):

P (Wrapper = red | Flavor = cherry , Θ1 = θ1, Θ2 = θ2) = θ1 .

network replicated for each example, parameters shared across all examples:

P(F=cherry)

Flavor

Wrapper

!

Fcherrylime

P(W=red | F)!1!2

Flavor1

Wrapper1

Flavor2

Wrapper2

Flavor3

Wrapper3

!

!1 !2

CS194-10 Fall 2011 Lecture 22 36

Page 37: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Bayesian learning contd.

Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc.

With independent Beta or Dirichlet priors,MAP EM learning = pseudocounts + expected counts:

M step: θ(t+1)jk` = ajk` +

Njk`

Σk′Njk′`

Implemented in EM training mode for Hugin and other Bayes net packages

CS194-10 Fall 2011 Lecture 22 37

Page 38: Bayesian networks: Inference and learningrussell/classes/cs194/f11/lectures/CS194 Fall 2011...Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall

Summary (parameter learning)

Complete data: likelihood factorizes; each parameter θjk` learned separatelyfrom the observed counts for conditional frequency Njk`/Nj`

Incomplete data: likelihood is a summation over all values of hidden variables;can apply EM by computing “expected counts”

Bayesian learning: parameters become variables in a replicated modelwith their own prior distributions defined by hyperparameters; thenBayesian learning is just ordinary inference in the model

CS194-10 Fall 2011 Lecture 22 38