directed graphical models

Post on 19-Jul-2022

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Directed Probabilistic Graphical Models

CMSC 678

UMBC

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM

Any questions?

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM

Build on the proposal:Update to address commentsDiscuss the progress you’ve madeDiscuss what remains to be doneDiscuss any new blocks you’ve experienced

(or anticipate experiencing)

Any questions?

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Recap from last time…

Expectation Maximization (EM): E-step0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

count(𝑧𝑖 , 𝑤𝑖)𝑝(𝑧𝑖)

𝑝 𝑡+1 (𝑧)𝑝(𝑡)(𝑧)estimated

countshttp://blog.innotas.com/wp-

EM Math

max𝜃

𝔼𝑧 ~ 𝑝𝜃(𝑡)

(⋅|𝑤) log 𝑝𝜃(𝑧, 𝑤)

E-step: count under uncertainty

M-step: maximize log-likelihood

old parameters

new parametersnew parametersposterior distribution

𝒞 𝜃 = log-likelihood of complete data (X,Y)

𝒫 𝜃 = posterior log-likelihood of incomplete data Y

ℳ 𝜃 = marginal log-likelihood of observed data X

ℳ 𝜃 = 𝔼𝑌∼𝜃(𝑡)[𝒞 𝜃 |𝑋] − 𝔼𝑌∼𝜃(𝑡)[𝒫 𝜃 |𝑋]

EM does not decrease the marginal log-likelihood

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Assume an original optimization problem

Lagrange multipliers

Assume an original optimization problem

We convert it to a new optimization problem:

Lagrange multipliers

Lagrange multipliers: an equivalent problem?

Lagrange multipliers: an equivalent problem?

Lagrange multipliers: an equivalent problem?

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

a probability distribution over 6

sides of the die

𝑘=1

6

𝜃𝑘 = 1 0 ≤ 𝜃𝑘 ≤ 1, ∀𝑘

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

𝑤1 = 1

𝑤2 = 5

𝑤3 = 4

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝑝𝜃(𝑤𝑖)

=

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

for roll 𝑖 = 1 to 𝑁:𝑤𝑖 ∼ Cat(𝜃)

Generative Story

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖

Maximize Log-likelihood

Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

A: Just keep increasing 𝜃𝑘 (we know 𝜃 must be a distribution, but it’s not specified)

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℒ 𝜃 =

𝑖

log 𝜃𝑤𝑖s. t.

𝑘=1

6

𝜃𝑘 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the problem and, right

now, is not needed)

solve using Lagrange multipliers

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜕ℱ 𝜃

𝜕𝜃𝑘=

𝑖:𝑤𝑖=𝑘

1

𝜃𝑤𝑖

− 𝜆 𝜕ℱ 𝜃

𝜕𝜆= −

𝑘=1

6

𝜃𝑘 + 1

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

𝜆 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1

Probabilistic Estimation of Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

N different (independent) rolls

ℱ 𝜃 =

𝑖

log 𝜃𝑤𝑖− 𝜆

𝑘=1

6

𝜃𝑘 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints

0 ≤ 𝜃𝑘, but it complicates the

problem and, right now, is not needed)

𝜃𝑘 =σ𝑖:𝑤𝑖=𝑘

1

σ𝑘σ𝑖:𝑤𝑖=𝑘1=𝑁𝑘𝑁 optimal 𝜆 when

𝑘=1

6

𝜃𝑘 = 1

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑤1 = 1

𝑤2 = 5

𝑧1 = 𝐻

𝑧2 = 𝑇

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

Example: Conditionally Rolling a Die

𝑝 𝑤1, 𝑤2, … , 𝑤𝑁 = 𝑝 𝑤1 𝑝 𝑤2 ⋯𝑝 𝑤𝑁 =ෑ

𝑖

𝑝 𝑤𝑖

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖

add complexity to better explain what we see

𝑝 heads = 𝜆

𝑝 tails = 1 − 𝜆

𝑝 heads = 𝛾

𝑝 heads = 𝜓

𝑝 tails = 1 − 𝛾

𝑝 tails = 1 − 𝜓

for item 𝑖 = 1 to 𝑁:𝑧𝑖 ~ Bernoulli 𝜆

Generative Story

𝜆 = distribution over penny𝛾 = distribution for dollar coin𝜓 = distribution over dime

if 𝑧𝑖 = 𝐻:𝑤𝑖 ~ Bernoulli 𝛾

else: 𝑤𝑖 ~ Bernoulli 𝜓

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Classify with Bayes Rule

argmax𝑌 log 𝑝 𝑋 𝑌) + log 𝑝(𝑌)

likelihood prior

argmax𝑌𝑝 𝑌 𝑋)

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

The Bag of Words Representation

29Adapted from Jurafsky & Martin (draft)

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Adapted from Jurafsky & Martin (draft)

Naïve Bayes: A Generative StoryGenerative Story

𝜙 = distribution over 𝐾 labelsfor label 𝑘 = 1 to 𝐾:

global parameters

𝜃𝑘 = generate parameters

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labelsy

for label 𝑘 = 1 to 𝐾:

𝜃𝑘 = generate parameters

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

𝜃𝑘 = generate parameters

local variables

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

𝜙 = distribution over 𝐾 labels

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

each xij conditionally independent of one

another (given the label)

𝜃𝑘 = generate parameters

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝐹𝑦𝑖(𝑥𝑖𝑗; 𝜃𝑦𝑖) +

𝑖

log 𝜙𝑦𝑖 s. t.

Maximize Log-likelihood

𝜙 = distribution over 𝐾 labels

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:

𝑘

𝜙𝑘 = 1 𝜃𝑘 is valid for 𝐹𝑘

𝜃𝑘 = generate parameters

for each feature 𝑗𝑥𝑖𝑗 ∼ F𝑗(𝜃𝑦𝑖)

Multinomial Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃 =

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 s. t.

Maximize Log-likelihood

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

𝑘

𝜙𝑘 = 1

𝑗

𝜃𝑘𝑗 = 1 ∀𝑘

Multinomial Naïve Bayes: A Generative Story

for item 𝑖 = 1 to 𝑁:𝑦𝑖 ~ Cat 𝜙

Generative Story

ℒ 𝜃

=

𝑖

𝑗

log 𝜃𝑦𝑖,𝑥𝑖,𝑗 +

𝑖

log𝜙𝑦𝑖 − 𝜇

𝑘

𝜙𝑘 − 1 −

𝑘

𝜆𝑘

𝑗

𝜃𝑘𝑗 − 1

Maximize Log-likelihood via Lagrange Multipliers

𝜙 = distribution over 𝐾 labels

for each feature 𝑗𝑥𝑖𝑗 ∼ Cat(𝜃𝑦𝑖,𝑗)

𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖5

y

for label 𝑘 = 1 to 𝐾:𝜃𝑘 = distribution over J feature values

Multinomial Naïve Bayes: Learning

Calculate class priors

For each k:itemsk = all items with class = k

Calculate feature generation termsFor each k:

obsk = single object containing all items labeled as k

For each feature jnkj = # of occurrences of j in obsk

𝑝 𝑘 =|items𝑘|

# items𝑝 𝑗|𝑘 =

𝑛𝑘𝑗σ𝑗′ 𝑛𝑘𝑗′

Brill and Banko (2001)

With enough data, the classifier may not matter

Adapted from Jurafsky & Martin (draft)

Summary: Naïve Bayes is Not So Naïve, but not without issue

ProVery Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

ConModel the posterior in one go? (e.g., use conditional maxent)

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Adapted from Jurafsky & Martin (draft)

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Hidden Markov Models

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Class-based

model

Bigram model

of the classes

Model all

class sequences

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖|𝑧𝑖−1

𝑧1,..,𝑧𝑁

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

transitionprobabilities/parameters

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1

emissionprobabilities/parameters

transitionprobabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

A: VK emission values and K2 transition values

Hidden Markov Model Representation

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

Hidden Markov Model Representation

𝑝 𝑧1, 𝑤1, 𝑧2, 𝑤2, … , 𝑧𝑁, 𝑤𝑁 = 𝑝 𝑧1| 𝑧0 𝑝 𝑤1|𝑧1 ⋯𝑝 𝑧𝑁| 𝑧𝑁−1 𝑝 𝑤𝑁|𝑧𝑁

=ෑ

𝑖

𝑝 𝑤𝑖|𝑧𝑖 𝑝 𝑧𝑖| 𝑧𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑝 𝑤1|𝑧1 𝑝 𝑤2|𝑧2 𝑝 𝑤3|𝑧3 𝑝 𝑤4|𝑧4

𝑝 𝑧2| 𝑧1 𝑝 𝑧3| 𝑧2 𝑝 𝑧4| 𝑧3𝑝 𝑧1| 𝑧0

initial starting distribution

(“__SEQ_SYM__”)

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V …

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z4 = V …

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

z3 = V

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑝 𝑤1|𝑁 𝑝 𝑤2|𝑁 𝑝 𝑤3|𝑁 𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉 𝑝 𝑉| 𝑉𝑝 𝑉| start

…𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁 𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑁| 𝑉 𝑝 𝑁| 𝑉

𝑝 𝑤4|𝑉𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉𝑝 𝑤1|𝑉

𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁 𝑝 𝑁| 𝑁

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

A Latent Sequence is a Path through the Graph

z1 = N

w1 w2 w3 w4

𝑝 𝑤1|𝑁𝑝 𝑤4|𝑁

𝑝 𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝 𝑉| 𝑉

𝑝 𝑉| 𝑁𝑝 𝑁| 𝑉

𝑝 𝑤3|𝑉𝑝 𝑤2|𝑉

Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you.

If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you.

If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

What’s the Maximum Weighted Path?

9 6

7

3

32 1

4+310

+37

+10 19 +10 16 +10 42 +10 11

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B”

maximize across the previous hidden state values

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= Acomputing v at time i-1will correctly incorporate

(maximize over) paths through time i-2:

we correctly obey the Markov property

𝑣 𝑖, 𝐵 = max𝑠′

𝑣 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

v(i, B) is the maximum

probability of any paths to that state B from the beginning (and

emitting the observation)

Viterbi algorithm

Viterbi Algorithmv = double[N+2][K*]

b = int[N+2][K*]

v[*][*] = 0

v[0][START] = 1

for(i = 1; i ≤ N+1; ++i) {

for(state = 0; state < K*; ++state) {

pobs = pemission(obsi | state)

for(old = 0; old < K*; ++old) {

pmove = ptransition(state | old)

if(v[i-1][old] * pobs * pmove > v[i][state]) {

v[i][state] = v[i-1][old] * pobs * pmove

b[i][state] = old}

}

}

}

backpointers/book-keeping

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Forward Probability

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Forward Probability

marginalize across the previous hidden state values

𝛼 𝑖, 𝐵 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝐵 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝐵)

computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi-2

= C

zi-2

= B

zi-2

= A

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Forward Probability

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

𝛼 𝑖, 𝑠 =

𝑠′

𝛼 𝑖 − 1, 𝑠′ ∗ 𝑝 𝑠 𝑠′) ∗ 𝑝(obs at 𝑖 | 𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?) A: α[N+1][end]

OutlineRecap of EM

Math: Lagrange Multipliers for constrained optimization

Probabilistic Modeling Example: Die Rolling

Directed Graphical ModelsNaïve BayesHidden Markov Models

Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

α(i, B) * β(i, B) = total probability of paths through state B at step i

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) β(i+1, s)

zi+1

= C

zi+1

= B

zi+1

= A

Forward & Backward Message Passing

zi-1

= C

zi-1

= B

zi-1

= A

zi = C

zi = B

zi = A

α(i, B) β(i+1, s’)

zi+1

= C

zi+1

= B

zi+1

= A

α(i, s) is the total probability of all paths:

1. that start from the beginning

2. that end (currently) in s at step i

3. that emit the observation obs at i

β(i, s) is the total probability of all paths:

1. that start at step i at state s

2. that terminate at the end3. (that emit the observation obs at i+1)

α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the Bs’ arc (at time i)

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) =total probability of paths through the ss’ arc (at time i)

α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑝 𝑧𝑖 = 𝑠 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ , 𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

pobs(w | s)

ptrans(s’ | s)

𝑝∗ 𝑧𝑖 = 𝑠 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝛽(𝑖, 𝑠)

𝛼(𝑁 + 1, END)

𝑝∗ 𝑧𝑖 = 𝑠, 𝑧𝑖+1 = 𝑠′ 𝑤1, ⋯ ,𝑤𝑁) =𝛼 𝑖, 𝑠 ∗ 𝑝 𝑠′ 𝑠 ∗ 𝑝 obs𝑖+1 𝑠′ ∗ 𝛽(𝑖 + 1, 𝑠′)

𝛼(𝑁 + 1, END)

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝑐(𝑠 → 𝑠′)

σ𝑥 𝑐(𝑠 → 𝑥)

if we observed the hidden transitions…

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑝new 𝑠′ 𝑠) =𝔼𝑠→𝑠′[𝑐 𝑠 → 𝑠′ ]

σ𝑥𝔼𝑠→𝑥[𝑐 𝑠 → 𝑥 ]

we don’t the hidden transitions, but we can approximately count

we compute these in the E-step, with our α and β values

EM For HMMs (Baum-Welch

Algorithm)

α = computeForwards()

β = computeBackwards()

L = α[N+1][END]

for(i = N; i ≥ 0; --i) {

for(next = 0; next < K*; ++next) {

cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L

for(state = 0; state < K*; ++state) {

u = pobs(obsi+1 | next) * ptrans (next | state)

ctrans(next| state) +=

α[i][state] * u * β[i+1][next]/L

}

}

}

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

“parents of”

topological sort

Bayesian Networks:Directed Acyclic Graphs

𝑝 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑁 =ෑ

𝑖

𝑝 𝑥𝑖 𝜋(𝑥𝑖))

exact inference in general DAGs is NP-hard

inference in trees can be exact

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally

independent given Z if all (undirected) paths from

(any variable in) X to (any variable in) Y are

d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Y

X Y

X Z Y

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable

in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node

P has a fork with an observed parent node

P includes a “v-structure” or “collider” with all unobserved descendants

X & Y are d-separated if for all paths P, one of the following is true:

X Z Y

X

Z

Y

X Z Y

observing Z blocks the path from X to Y

observing Z blocks the path from X to Y

not observing Z blocks the path from X to Y

𝑝 𝑥, 𝑦, 𝑧 = 𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦)

𝑝 𝑥, 𝑦 =

𝑧

𝑝 𝑥 𝑝 𝑦 𝑝(𝑧|𝑥, 𝑦) = 𝑝 𝑥 𝑝 𝑦

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝 𝑥𝑖 𝑥𝑗≠𝑖 =𝑝(𝑥1, … , 𝑥𝑁)

∫ 𝑝 𝑥1, … , 𝑥𝑁 𝑑𝑥𝑖

=ς𝑘 𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘 𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖factor out terms not dependent on xi

factorization of graph

=ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘

𝑝(𝑥𝑘|𝜋 𝑥𝑘 )

∫ ς𝑘:𝑘=𝑖 or 𝑖∈𝜋 𝑥𝑘𝑝 𝑥𝑘 𝜋 𝑥𝑘 ) 𝑑𝑥𝑖

the set of nodes needed to form the complete conditional for a variable xi

top related