directed graphical models

99
Directed Probabilistic Graphical Models CMSC 678 UMBC

Upload: others

Post on 19-Jul-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Directed Graphical Models

Directed Probabilistic Graphical Models

CMSC 678

UMBC

Page 2: Directed Graphical Models

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Page 3: Directed Graphical Models

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

Page 4: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 5: Directed Graphical Models

Recap from last time…

Page 6: Directed Graphical Models

Expectation Maximization (EM): E-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑖 , 𝑀𝑖)𝑝(𝑧𝑖)

𝑝 𝑑+1 (𝑧)𝑝(𝑑)(𝑧)estimated

countshttp://blog.innotas.com/wp-

Page 7: Directed Graphical Models

EM Math

maxπœƒ

𝔼𝑧 ~ π‘πœƒ(𝑑)

(β‹…|𝑀) log π‘πœƒ(𝑧, 𝑀)

E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parametersnew parametersposterior distribution

π’ž πœƒ = log-likelihood of complete data (X,Y)

𝒫 πœƒ = posterior log-likelihood of incomplete data Y

β„³ πœƒ = marginal log-likelihood of observed data X

β„³ πœƒ = π”Όπ‘ŒβˆΌπœƒ(𝑑)[π’ž πœƒ |𝑋] βˆ’ π”Όπ‘ŒβˆΌπœƒ(𝑑)[𝒫 πœƒ |𝑋]

EM does not decrease the marginal log-likelihood

Page 8: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 9: Directed Graphical Models

Assume an original optimization problem

Lagrange multipliers

Page 10: Directed Graphical Models

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Page 11: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 12: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 13: Directed Graphical Models

Lagrange multipliers: an equivalent problem?

Page 14: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 15: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

𝑀1 = 1

𝑀2 = 5

𝑀3 = 4

β‹―

for roll 𝑖 = 1 to 𝑁:𝑀𝑖 ∼ Cat(πœƒ)

Generative Story

a probability distribution over 6

sides of the die

π‘˜=1

6

πœƒπ‘˜ = 1 0 ≀ πœƒπ‘˜ ≀ 1, βˆ€π‘˜

Page 16: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

𝑀1 = 1

𝑀2 = 5

𝑀3 = 4

β‹―

for roll 𝑖 = 1 to 𝑁:𝑀𝑖 ∼ Cat(πœƒ)

Generative Story

β„’ πœƒ =

𝑖

log π‘πœƒ(𝑀𝑖)

=

𝑖

log πœƒπ‘€π‘–

Maximize Log-likelihood

Page 17: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑀𝑖 ∼ Cat(πœƒ)

Generative Story

β„’ πœƒ =

𝑖

log πœƒπ‘€π‘–

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

Page 18: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑀𝑖 ∼ Cat(πœƒ)

Generative Story

β„’ πœƒ =

𝑖

log πœƒπ‘€π‘–

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

A: Just keep increasing πœƒπ‘˜ (we know πœƒ must be a distribution, but it’s not specified)

Page 19: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

β„’ πœƒ =

𝑖

log πœƒπ‘€π‘–s. t.

π‘˜=1

6

πœƒπ‘˜ = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≀ πœƒπ‘˜, but it complicates the problem and, right

now, is not needed)

solve using Lagrange multipliers

Page 20: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

β„± πœƒ =

𝑖

log πœƒπ‘€π‘–βˆ’ πœ†

π‘˜=1

6

πœƒπ‘˜ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≀ πœƒπ‘˜, but it complicates the

problem and, right now, is not needed)

πœ•β„± πœƒ

πœ•πœƒπ‘˜=

𝑖:𝑀𝑖=π‘˜

1

πœƒπ‘€π‘–

βˆ’ πœ† πœ•β„± πœƒ

πœ•πœ†= βˆ’

π‘˜=1

6

πœƒπ‘˜ + 1

Page 21: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

β„± πœƒ =

𝑖

log πœƒπ‘€π‘–βˆ’ πœ†

π‘˜=1

6

πœƒπ‘˜ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≀ πœƒπ‘˜, but it complicates the

problem and, right now, is not needed)

πœƒπ‘˜ =σ𝑖:𝑀𝑖=π‘˜

1

πœ† optimal πœ† when

π‘˜=1

6

πœƒπ‘˜ = 1

Page 22: Directed Graphical Models

Probabilistic Estimation of Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

N different (independent) rolls

β„± πœƒ =

𝑖

log πœƒπ‘€π‘–βˆ’ πœ†

π‘˜=1

6

πœƒπ‘˜ βˆ’ 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≀ πœƒπ‘˜, but it complicates the

problem and, right now, is not needed)

πœƒπ‘˜ =σ𝑖:𝑀𝑖=π‘˜

1

Οƒπ‘˜Οƒπ‘–:𝑀𝑖=π‘˜1=π‘π‘˜π‘ optimal πœ† when

π‘˜=1

6

πœƒπ‘˜ = 1

Page 23: Directed Graphical Models

Example: Conditionally Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑀1 = 1

𝑀2 = 5

β‹―

𝑧1 = 𝐻

𝑧2 = 𝑇

𝑝 heads = πœ†

𝑝 tails = 1 βˆ’ πœ†

𝑝 heads = 𝛾

𝑝 heads = πœ“

𝑝 tails = 1 βˆ’ 𝛾

𝑝 tails = 1 βˆ’ πœ“

Page 24: Directed Graphical Models

Example: Conditionally Rolling a Die

𝑝 𝑀1, 𝑀2, … , 𝑀𝑁 = 𝑝 𝑀1 𝑝 𝑀2 ⋯𝑝 𝑀𝑁 =ΰ·‘

𝑖

𝑝 𝑀𝑖

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑝 heads = πœ†

𝑝 tails = 1 βˆ’ πœ†

𝑝 heads = 𝛾

𝑝 heads = πœ“

𝑝 tails = 1 βˆ’ 𝛾

𝑝 tails = 1 βˆ’ πœ“

for item 𝑖 = 1 to 𝑁:𝑧𝑖 ~ Bernoulli πœ†

Generative Story

πœ† = distribution over penny𝛾 = distribution for dollar coinπœ“ = distribution over dime

if 𝑧𝑖 = 𝐻:𝑀𝑖 ~ Bernoulli 𝛾

else: 𝑀𝑖 ~ Bernoulli πœ“

Page 25: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 26: Directed Graphical Models

Classify with Bayes Rule

argmaxπ‘Œ log 𝑝 𝑋 π‘Œ) + log 𝑝(π‘Œ)

likelihood prior

argmaxπ‘Œπ‘ π‘Œ 𝑋)

Page 27: Directed Graphical Models

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

Page 28: Directed Graphical Models

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

Page 29: Directed Graphical Models

The Bag of Words Representation

29Adapted from Jurafsky & Martin (draft)

Page 30: Directed Graphical Models

Bag of Words Representation

Ξ³( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Adapted from Jurafsky & Martin (draft)

Page 31: Directed Graphical Models

NaΓ―ve Bayes: A Generative StoryGenerative Story

πœ™ = distribution over 𝐾 labelsfor label π‘˜ = 1 to 𝐾:

global parameters

πœƒπ‘˜ = generate parameters

Page 32: Directed Graphical Models

NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

πœ™ = distribution over 𝐾 labelsy

for label π‘˜ = 1 to 𝐾:

πœƒπ‘˜ = generate parameters

Page 33: Directed Graphical Models

NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

πœ™ = distribution over 𝐾 labels

for each feature 𝑗π‘₯𝑖𝑗 ∼ F𝑗(πœƒπ‘¦π‘–)

π‘₯𝑖1 π‘₯𝑖2 π‘₯𝑖3 π‘₯𝑖4 π‘₯𝑖5

y

for label π‘˜ = 1 to 𝐾:

πœƒπ‘˜ = generate parameters

local variables

Page 34: Directed Graphical Models

NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

πœ™ = distribution over 𝐾 labels

π‘₯𝑖1 π‘₯𝑖2 π‘₯𝑖3 π‘₯𝑖4 π‘₯𝑖5

y

for label π‘˜ = 1 to 𝐾:

each xij conditionally independent of one

another (given the label)

πœƒπ‘˜ = generate parameters

for each feature 𝑗π‘₯𝑖𝑗 ∼ F𝑗(πœƒπ‘¦π‘–)

Page 35: Directed Graphical Models

NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

β„’ πœƒ =

𝑖

𝑗

log 𝐹𝑦𝑖(π‘₯𝑖𝑗; πœƒπ‘¦π‘–) +

𝑖

log πœ™π‘¦π‘– s. t.

Maximize Log-likelihood

πœ™ = distribution over 𝐾 labels

π‘₯𝑖1 π‘₯𝑖2 π‘₯𝑖3 π‘₯𝑖4 π‘₯𝑖5

y

for label π‘˜ = 1 to 𝐾:

π‘˜

πœ™π‘˜ = 1 πœƒπ‘˜ is valid for πΉπ‘˜

πœƒπ‘˜ = generate parameters

for each feature 𝑗π‘₯𝑖𝑗 ∼ F𝑗(πœƒπ‘¦π‘–)

Page 36: Directed Graphical Models

Multinomial NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

β„’ πœƒ =

𝑖

𝑗

log πœƒπ‘¦π‘–,π‘₯𝑖,𝑗 +

𝑖

logπœ™π‘¦π‘– s. t.

Maximize Log-likelihood

πœ™ = distribution over 𝐾 labels

for each feature 𝑗π‘₯𝑖𝑗 ∼ Cat(πœƒπ‘¦π‘–,𝑗)

π‘₯𝑖1 π‘₯𝑖2 π‘₯𝑖3 π‘₯𝑖4 π‘₯𝑖5

y

for label π‘˜ = 1 to 𝐾:πœƒπ‘˜ = distribution over J feature values

π‘˜

πœ™π‘˜ = 1

𝑗

πœƒπ‘˜π‘— = 1 βˆ€π‘˜

Page 37: Directed Graphical Models

Multinomial NaΓ―ve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat πœ™

Generative Story

β„’ πœƒ

=

𝑖

𝑗

log πœƒπ‘¦π‘–,π‘₯𝑖,𝑗 +

𝑖

logπœ™π‘¦π‘– βˆ’ πœ‡

π‘˜

πœ™π‘˜ βˆ’ 1 βˆ’

π‘˜

πœ†π‘˜

𝑗

πœƒπ‘˜π‘— βˆ’ 1

Maximize Log-likelihood via Lagrange Multipliers

πœ™ = distribution over 𝐾 labels

for each feature 𝑗π‘₯𝑖𝑗 ∼ Cat(πœƒπ‘¦π‘–,𝑗)

π‘₯𝑖1 π‘₯𝑖2 π‘₯𝑖3 π‘₯𝑖4 π‘₯𝑖5

y

for label π‘˜ = 1 to 𝐾:πœƒπ‘˜ = distribution over J feature values

Page 38: Directed Graphical Models

Multinomial NaΓ―ve Bayes: Learning

Calculate class priors

For each k:itemsk = all items with class = k

Calculate feature generation termsFor each k:

obsk = single object containing all items labeled as k

For each feature jnkj = # of occurrences of j in obsk

𝑝 π‘˜ =|itemsπ‘˜|

# items𝑝 𝑗|π‘˜ =

π‘›π‘˜π‘—Οƒπ‘—β€² π‘›π‘˜π‘—β€²

Page 39: Directed Graphical Models

Brill and Banko (2001)

With enough data, the classifier may not matter

Adapted from Jurafsky & Martin (draft)

Page 40: Directed Graphical Models

Summary: NaΓ―ve Bayes is Not So NaΓ―ve, but not without issue

ProVery Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

ConModel the posterior in one go? (e.g., use conditional maxent)

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there β€œbetter” ways of handling missing/noisy data?

(automated, more principled)

Adapted from Jurafsky & Martin (draft)

Page 41: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 42: Directed Graphical Models

Hidden Markov Models

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Class-based

model

Bigram model

of the classes

Model all

class sequences

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖|π‘§π‘–βˆ’1

𝑧1,..,𝑧𝑁

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁

Page 43: Directed Graphical Models

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

Page 44: Directed Graphical Models

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Page 45: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

Page 46: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

transitionprobabilities/parameters

Page 47: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

emissionprobabilities/parameters

transitionprobabilities/parameters

Page 48: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

emissionprobabilities/parameters

transitionprobabilities/parameters

Page 49: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

Page 50: Directed Graphical Models

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

A: VK emission values and K2 transition values

Page 51: Directed Graphical Models

Hidden Markov Model Representation

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

Page 52: Directed Graphical Models

Hidden Markov Model Representation

𝑝 𝑧1, 𝑀1, 𝑧2, 𝑀2, … , 𝑧𝑁, 𝑀𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑀1|𝑧1 ⋯𝑝 𝑧𝑁| π‘§π‘βˆ’1 𝑝 𝑀𝑁|𝑧𝑁

=ΰ·‘

𝑖

𝑝 𝑀𝑖|𝑧𝑖 𝑝 𝑧𝑖| π‘§π‘–βˆ’1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

𝑝 𝑀1|𝑧1 𝑝 𝑀2|𝑧2 𝑝 𝑀3|𝑧3 𝑝 𝑀4|𝑧4

𝑝 𝑧2| 𝑧1 𝑝 𝑧3| 𝑧2 𝑝 𝑧4| 𝑧3𝑝 𝑧1| 𝑧0

initial starting distribution

(β€œ__SEQ_SYM__”)

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

Page 53: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V …

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 54: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

𝑝 𝑀1|𝑁 𝑝 𝑀2|𝑁 𝑝 𝑀3|𝑁 𝑝 𝑀4|𝑁

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z4 = V …

𝑝 𝑀4|𝑉𝑝 𝑀3|𝑉𝑝 𝑀2|𝑉𝑝 𝑀1|𝑉

z3 = V

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 55: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

𝑝 𝑀1|𝑁 𝑝 𝑀2|𝑁 𝑝 𝑀3|𝑁 𝑝 𝑀4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…

𝑝 𝑀4|𝑉𝑝 𝑀3|𝑉𝑝 𝑀2|𝑉𝑝 𝑀1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 56: Directed Graphical Models

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

𝑝 𝑀1|𝑁 𝑝 𝑀2|𝑁 𝑝 𝑀3|𝑁 𝑝 𝑀4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑁| 𝑉 𝑝 𝑁| 𝑉

𝑝 𝑀4|𝑉𝑝 𝑀3|𝑉𝑝 𝑀2|𝑉𝑝 𝑀1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 57: Directed Graphical Models

A Latent Sequence is a Path through the Graph

z1 = N

w1 w2 w3 w4

𝑝 𝑀1|𝑁𝑝 𝑀4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉

𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑀3|𝑉𝑝 𝑀2|𝑉

Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Page 58: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 59: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 60: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 61: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 62: Directed Graphical Models

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Page 63: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

Page 64: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

Page 65: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

Page 66: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

Page 67: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

Page 68: Directed Graphical Models

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

Page 69: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 70: Directed Graphical Models

What’s the Maximum Value?

consider β€œany shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

𝑣 𝑖, 𝐡 = max𝑠′

𝑣 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝐡 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝐡)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

Page 71: Directed Graphical Models

What’s the Maximum Value?

consider β€œany shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

𝑣 𝑖, 𝐡 = max𝑠′

𝑣 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝐡 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝐡)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

Page 72: Directed Graphical Models

What’s the Maximum Value?

consider β€œany shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= Acomputing v at time i-1will correctly incorporate

(maximize over) paths through time i-2:

we correctly obey the Markov property

𝑣 𝑖, 𝐡 = max𝑠′

𝑣 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝐡 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝐡)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

Viterbi algorithm

Page 73: Directed Graphical Models

Viterbi Algorithmv = double[N+2][K*]

b = int[N+2][K*]

v[*][*] = 0

v[0][START] = 1

for(i = 1; i ≀ N+1; ++i) {

for(state = 0; state < K*; ++state) {

pobs = pemission(obsi | state)

for(old = 0; old < K*; ++old) {

pmove = ptransition(state | old)

if(v[i-1][old] * pobs * pmove > v[i][state]) {

v[i][state] = v[i-1][old] * pobs * pmove

b[i][state] = old}

}

}

}

backpointers/book-keeping

Page 74: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 75: Directed Graphical Models

Forward Probability

Ξ±(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Page 76: Directed Graphical Models

Forward Probability

marginalize across the previous hidden state values

𝛼 𝑖, 𝐡 =

𝑠′

𝛼 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝐡 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝐡)

computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

Ξ±(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Page 77: Directed Graphical Models

Forward Probability

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝑠 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝑠)

Page 78: Directed Graphical Models

Forward Probability

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝑠 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Page 79: Directed Graphical Models

Forward Probability

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 βˆ’ 1, 𝑠′ βˆ— 𝑝 𝑠 𝑠′) βˆ— 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?) A: Ξ±[N+1][end]

Page 80: Directed Graphical Models

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Page 81: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

Ξ²(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Page 82: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

Ξ²(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i, B)

Page 83: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

Ξ²(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i, B)

Ξ±(i, B) * Ξ²(i, B) = total probability of paths through state B at step i

Page 84: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

Ξ²(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i+1, s)

zi+1

= C

zi+1

= B

zi+1

= A

Page 85: Directed Graphical Models

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

Ξ±(i, B) Ξ²(i+1, s’)

zi+1

= C

zi+1

= B

zi+1

= A

Ξ±(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

Ξ²(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Ξ±(i, B) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) =total probability of paths through the Bs’ arc (at time i)

Page 86: Directed Graphical Models

With Both Forward and Backward Values

Ξ±(i, s) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) =total probability of paths through the ss’ arc (at time i)

Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i

𝑝 𝑧𝑖 = 𝑠 𝑀1, β‹― , 𝑀𝑁) =𝛼 𝑖, 𝑠 βˆ— 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑀1, β‹― , 𝑀𝑁) =𝛼 𝑖, 𝑠 βˆ— 𝑝 𝑠′ 𝑠 βˆ— 𝑝 obs𝑖+1 𝑠′ βˆ— 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Page 87: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

Page 88: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

Page 89: Directed Graphical Models

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

π‘βˆ— 𝑧𝑖 = 𝑠 𝑀1, β‹― ,𝑀𝑁) =𝛼 𝑖, 𝑠 βˆ— 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

π‘βˆ— 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑀1, β‹― ,𝑀𝑁) =𝛼 𝑖, 𝑠 βˆ— 𝑝 𝑠′ 𝑠 βˆ— 𝑝 obs𝑖+1 𝑠′ βˆ— 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Page 90: Directed Graphical Models

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝑐(𝑠 β†’ 𝑠′)

Οƒπ‘₯ 𝑐(𝑠 β†’ π‘₯)

if we observed the hidden transitions…

Page 91: Directed Graphical Models

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 β†’ 𝑠′ ]

Οƒπ‘₯𝔼𝑠→π‘₯[𝑐 𝑠 β†’ π‘₯ ]

we don’t the hidden transitions, but we can approximately count

Page 92: Directed Graphical Models

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 β†’ 𝑠′ ]

Οƒπ‘₯𝔼𝑠→π‘₯[𝑐 𝑠 β†’ π‘₯ ]

we don’t the hidden transitions, but we can approximately count

we compute these in the E-step, with our Ξ± and Ξ² values

Page 93: Directed Graphical Models

EM For HMMs (Baum-Welch

Algorithm)

Ξ± = computeForwards()

Ξ² = computeBackwards()

L = Ξ±[N+1][END]

for(i = N; i β‰₯ 0; --i) {

for(next = 0; next < K*; ++next) {

cobs(obsi+1 | next) += Ξ±[i+1][next]* Ξ²[i+1][next]/L

for(state = 0; state < K*; ++state) {

u = pobs(obsi+1 | next) * ptrans (next | state)

ctrans(next| state) +=

Ξ±[i][state] * u * Ξ²[i+1][next]/L

}

}

}

Page 94: Directed Graphical Models

Bayesian Networks:Directed Acyclic Graphs

𝑝 π‘₯1, π‘₯2, π‘₯3, … , π‘₯𝑁 =ΰ·‘

𝑖

𝑝 π‘₯𝑖 πœ‹(π‘₯𝑖))

β€œparents of”

topological sort

Page 95: Directed Graphical Models

Bayesian Networks:Directed Acyclic Graphs

𝑝 π‘₯1, π‘₯2, π‘₯3, … , π‘₯𝑁 =ΰ·‘

𝑖

𝑝 π‘₯𝑖 πœ‹(π‘₯𝑖))

exact inference in general DAGs is NP-hard

inference in trees can be exact

Page 96: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally

independent given Z if all (undirected) paths from

(any variable in) X to (any variable in) Y are

d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a β€œv-structure” or β€œcollider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Y

X Y

X Z Y

Page 97: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a β€œv-structure” or β€œcollider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

Page 98: Directed Graphical Models

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a β€œv-structure” or β€œcollider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

𝑝 π‘₯, 𝑦, 𝑧 = 𝑝 π‘₯ 𝑝 𝑦 𝑝(𝑧|π‘₯, 𝑦)

𝑝 π‘₯, 𝑦 =

𝑧

𝑝 π‘₯ 𝑝 𝑦 𝑝(𝑧|π‘₯, 𝑦) = 𝑝 π‘₯ 𝑝 𝑦

Page 99: Directed Graphical Models

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝 π‘₯𝑖 π‘₯𝑗≠𝑖 =𝑝(π‘₯1, … , π‘₯𝑁)

∫ 𝑝 π‘₯1, … , π‘₯𝑁 𝑑π‘₯𝑖

=Ο‚π‘˜ 𝑝(π‘₯π‘˜|πœ‹ π‘₯π‘˜ )

∫ Ο‚π‘˜ 𝑝 π‘₯π‘˜ πœ‹ π‘₯π‘˜ ) 𝑑π‘₯𝑖factor out terms not dependent on xi

factorization of graph

=Ο‚π‘˜:π‘˜=𝑖 or π‘–βˆˆπœ‹ π‘₯π‘˜

𝑝(π‘₯π‘˜|πœ‹ π‘₯π‘˜ )

∫ Ο‚π‘˜:π‘˜=𝑖 or π‘–βˆˆπœ‹ π‘₯π‘˜π‘ π‘₯π‘˜ πœ‹ π‘₯π‘˜ ) 𝑑π‘₯𝑖

the set of nodes needed to form the complete conditional for a variable xi