directed graphical models
TRANSCRIPT
Directed Probabilistic Graphical Models
CMSC 678
UMBC
Announcement 1: Assignment 3
Due Wednesday April 11th, 11:59 AM
Any questions?
Announcement 2: Progress Report on Project
Due Monday April 16th, 11:59 AM
Build on the proposal:Update to address commentsDiscuss the progress youβve madeDiscuss what remains to be doneDiscuss any new blocks youβve experienced
(or anticipate experiencing)
Any questions?
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Recap from last timeβ¦
Expectation Maximization (EM): E-step0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
count(π§π , π€π)π(π§π)
π π‘+1 (π§)π(π‘)(π§)estimated
countshttp://blog.innotas.com/wp-
EM Math
maxπ
πΌπ§ ~ ππ(π‘)
(β |π€) log ππ(π§, π€)
E-step: count under uncertainty
M-step: maximize log-likelihood
old parameters
new parametersnew parametersposterior distribution
π π = log-likelihood of complete data (X,Y)
π« π = posterior log-likelihood of incomplete data Y
β³ π = marginal log-likelihood of observed data X
β³ π = πΌπβΌπ(π‘)[π π |π] β πΌπβΌπ(π‘)[π« π |π]
EM does not decrease the marginal log-likelihood
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Assume an original optimization problem
Lagrange multipliers
Assume an original optimization problem
We convert it to a new optimization problem:
Lagrange multipliers
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
Lagrange multipliers: an equivalent problem?
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
π€1 = 1
π€2 = 5
π€3 = 4
β―
for roll π = 1 to π:π€π βΌ Cat(π)
Generative Story
a probability distribution over 6
sides of the die
π=1
6
ππ = 1 0 β€ ππ β€ 1, βπ
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
π€1 = 1
π€2 = 5
π€3 = 4
β―
for roll π = 1 to π:π€π βΌ Cat(π)
Generative Story
β π =
π
log ππ(π€π)
=
π
log ππ€π
Maximize Log-likelihood
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
for roll π = 1 to π:π€π βΌ Cat(π)
Generative Story
β π =
π
log ππ€π
Maximize Log-likelihood
Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)?
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
for roll π = 1 to π:π€π βΌ Cat(π)
Generative Story
β π =
π
log ππ€π
Maximize Log-likelihood
Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)?
A: Just keep increasing ππ (we know π must be a distribution, but itβs not specified)
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
β π =
π
log ππ€πs. t.
π=1
6
ππ = 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 β€ ππ, but it complicates the problem and, right
now, is not needed)
solve using Lagrange multipliers
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
β± π =
π
log ππ€πβ π
π=1
6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 β€ ππ, but it complicates the
problem and, right now, is not needed)
πβ± π
πππ=
π:π€π=π
1
ππ€π
β π πβ± π
ππ= β
π=1
6
ππ + 1
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
β± π =
π
log ππ€πβ π
π=1
6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 β€ ππ, but it complicates the
problem and, right now, is not needed)
ππ =Οπ:π€π=π
1
π optimal π when
π=1
6
ππ = 1
Probabilistic Estimation of Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
N different (independent) rolls
β± π =
π
log ππ€πβ π
π=1
6
ππ β 1
Maximize Log-likelihood (with distribution constraints)
(we can include the inequality constraints
0 β€ ππ, but it complicates the
problem and, right now, is not needed)
ππ =Οπ:π€π=π
1
ΟπΟπ:π€π=π1=πππ optimal π when
π=1
6
ππ = 1
Example: Conditionally Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1 π π€1|π§1 β―π π§π π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π
add complexity to better explain what we see
π€1 = 1
π€2 = 5
β―
π§1 = π»
π§2 = π
π heads = π
π tails = 1 β π
π heads = πΎ
π heads = π
π tails = 1 β πΎ
π tails = 1 β π
Example: Conditionally Rolling a Die
π π€1, π€2, β¦ , π€π = π π€1 π π€2 β―π π€π =ΰ·
π
π π€π
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1 π π€1|π§1 β―π π§π π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π
add complexity to better explain what we see
π heads = π
π tails = 1 β π
π heads = πΎ
π heads = π
π tails = 1 β πΎ
π tails = 1 β π
for item π = 1 to π:π§π ~ Bernoulli π
Generative Story
π = distribution over pennyπΎ = distribution for dollar coinπ = distribution over dime
if π§π = π»:π€π ~ Bernoulli πΎ
else: π€π ~ Bernoulli π
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Classify with Bayes Rule
argmaxπ log π π π) + log π(π)
likelihood prior
argmaxππ π π)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
Adapted from Jurafsky & Martin (draft)
The Bag of Words Representation
29Adapted from Jurafsky & Martin (draft)
Bag of Words Representation
Ξ³( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...classifier
classifier
Adapted from Jurafsky & Martin (draft)
NaΓ―ve Bayes: A Generative StoryGenerative Story
π = distribution over πΎ labelsfor label π = 1 to πΎ:
global parameters
ππ = generate parameters
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
π = distribution over πΎ labelsy
for label π = 1 to πΎ:
ππ = generate parameters
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
π = distribution over πΎ labels
for each feature ππ₯ππ βΌ Fπ(ππ¦π)
π₯π1 π₯π2 π₯π3 π₯π4 π₯π5
y
for label π = 1 to πΎ:
ππ = generate parameters
local variables
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
π = distribution over πΎ labels
π₯π1 π₯π2 π₯π3 π₯π4 π₯π5
y
for label π = 1 to πΎ:
each xij conditionally independent of one
another (given the label)
ππ = generate parameters
for each feature ππ₯ππ βΌ Fπ(ππ¦π)
NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
β π =
π
π
log πΉπ¦π(π₯ππ; ππ¦π) +
π
log ππ¦π s. t.
Maximize Log-likelihood
π = distribution over πΎ labels
π₯π1 π₯π2 π₯π3 π₯π4 π₯π5
y
for label π = 1 to πΎ:
π
ππ = 1 ππ is valid for πΉπ
ππ = generate parameters
for each feature ππ₯ππ βΌ Fπ(ππ¦π)
Multinomial NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
β π =
π
π
log ππ¦π,π₯π,π +
π
logππ¦π s. t.
Maximize Log-likelihood
π = distribution over πΎ labels
for each feature ππ₯ππ βΌ Cat(ππ¦π,π)
π₯π1 π₯π2 π₯π3 π₯π4 π₯π5
y
for label π = 1 to πΎ:ππ = distribution over J feature values
π
ππ = 1
π
πππ = 1 βπ
Multinomial NaΓ―ve Bayes: A Generative Story
for item π = 1 to π:π¦π ~ Cat π
Generative Story
β π
=
π
π
log ππ¦π,π₯π,π +
π
logππ¦π β π
π
ππ β 1 β
π
ππ
π
πππ β 1
Maximize Log-likelihood via Lagrange Multipliers
π = distribution over πΎ labels
for each feature ππ₯ππ βΌ Cat(ππ¦π,π)
π₯π1 π₯π2 π₯π3 π₯π4 π₯π5
y
for label π = 1 to πΎ:ππ = distribution over J feature values
Multinomial NaΓ―ve Bayes: Learning
Calculate class priors
For each k:itemsk = all items with class = k
Calculate feature generation termsFor each k:
obsk = single object containing all items labeled as k
For each feature jnkj = # of occurrences of j in obsk
π π =|itemsπ|
# itemsπ π|π =
πππΟπβ² πππβ²
Brill and Banko (2001)
With enough data, the classifier may not matter
Adapted from Jurafsky & Martin (draft)
Summary: NaΓ―ve Bayes is Not So NaΓ―ve, but not without issue
ProVery Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
ConModel the posterior in one go? (e.g., use conditional maxent)
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there βbetterβ ways of handling missing/noisy data?
(automated, more principled)
Adapted from Jurafsky & Martin (draft)
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Hidden Markov Models
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Class-based
model
Bigram model
of the classes
Model all
class sequences
π π€π|π§π π π§π|π§πβ1
π§1,..,π§π
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we donβt actually observe these zvalues; we just see the words w
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we donβt actually observe these zvalues; we just see the words w
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
if we knew the probability parametersthen we could estimate z and evaluate
likelihoodβ¦ but we donβt! :(
if we did observe z, estimating the probability parameters would be easyβ¦
but we donβt! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
emissionprobabilities/parameters
transitionprobabilities/parameters
Q: How many different probability values are there with K states and V vocab items?
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1
emissionprobabilities/parameters
transitionprobabilities/parameters
Q: How many different probability values are there with K states and V vocab items?
A: VK emission values and K2 transition values
Hidden Markov Model Representation
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
represent the probabilities and independence assumptions in a graph
Hidden Markov Model Representation
π π§1, π€1, π§2, π€2, β¦ , π§π, π€π = π π§1| π§0 π π€1|π§1 β―π π§π| π§πβ1 π π€π|π§π
=ΰ·
π
π π€π|π§π π π§π| π§πβ1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
π π€1|π§1 π π€2|π§2 π π€3|π§3 π π€4|π§4
π π§2| π§1 π π§3| π§2 π π§4| π§3π π§1| π§0
initial starting distribution
(β__SEQ_SYM__β)
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V β¦
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π€1|π π π€2|π π π€3|π π π€4|π
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z4 = V β¦
π π€4|ππ π€3|ππ π€2|ππ π€1|π
z3 = V
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π€1|π π π€2|π π π€3|π π π€4|π
π π| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
π π| π π π| π π π| ππ π| start
β¦
π π€4|ππ π€3|ππ π€2|ππ π€1|π
π π| π π π| π π π| π
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π€1|π π π€2|π π π€3|π π π€4|π
π π| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
π π| π π π| π π π| ππ π| start
β¦π π| π π π| π π π| ππ π| π
π π| π π π| π
π π€4|ππ π€3|ππ π€2|ππ π€1|π
π π| π π π| π π π| π
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
A Latent Sequence is a Path through the Graph
z1 = N
w1 w2 w3 w4
π π€1|ππ π€4|π
π π| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
π π| π
π π| ππ π| π
π π€3|ππ π€2|π
Q: Whatβs the probability of(N, w1), (V, w2), (V, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you.
If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you.
If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you.
If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Message Passing: Count the Soldiers
If you are the front soldier in the line, say the number βoneβ to the soldier behind you.
If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you.
If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side
ITILA, Ch 16
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
+10 19 +10 16 +10 42 +10 11
Whatβs the Maximum Weighted Path?
9 6
7
3
32 1
4+310
+37
+10 19 +10 16 +10 42 +10 11
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Whatβs the Maximum Value?
consider βany shared path ending with B (AB, BB, or CB) Bβ
maximize across the previous hidden state values
π£ π, π΅ = maxπ β²
π£ π β 1, π β² β π π΅ π β²) β π(obs at π | π΅)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
Whatβs the Maximum Value?
consider βany shared path ending with B (AB, BB, or CB) Bβ
maximize across the previous hidden state values
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
π£ π, π΅ = maxπ β²
π£ π β 1, π β² β π π΅ π β²) β π(obs at π | π΅)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
Whatβs the Maximum Value?
consider βany shared path ending with B (AB, BB, or CB) Bβ
maximize across the previous hidden state values
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= Acomputing v at time i-1will correctly incorporate
(maximize over) paths through time i-2:
we correctly obey the Markov property
π£ π, π΅ = maxπ β²
π£ π β 1, π β² β π π΅ π β²) β π(obs at π | π΅)
v(i, B) is the maximum
probability of any paths to that state B from the beginning (and
emitting the observation)
Viterbi algorithm
Viterbi Algorithmv = double[N+2][K*]
b = int[N+2][K*]
v[*][*] = 0
v[0][START] = 1
for(i = 1; i β€ N+1; ++i) {
for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)
for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)
if(v[i-1][old] * pobs * pmove > v[i][state]) {
v[i][state] = v[i-1][old] * pobs * pmove
b[i][state] = old}
}
}
}
backpointers/book-keeping
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Forward Probability
Ξ±(i, B) is the total probability of all paths to that state B from the
beginning
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
Forward Probability
marginalize across the previous hidden state values
πΌ π, π΅ =
π β²
πΌ π β 1, π β² β π π΅ π β²) β π(obs at π | π΅)
computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property
Ξ±(i, B) is the total probability of all paths to that state B from the
beginning
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi-2
= C
zi-2
= B
zi-2
= A
Forward Probability
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
πΌ π, π =
π β²
πΌ π β 1, π β² β π π π β²) β π(obs at π | π )
Forward Probability
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
πΌ π, π =
π β²
πΌ π β 1, π β² β π π π β²) β π(obs at π | π )
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
whatβs the total probability up until now?
Forward Probability
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
πΌ π, π =
π β²
πΌ π β 1, π β² β π π π β²) β π(obs at π | π )
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
whatβs the total probability up until now?
Q: What do we return? (How do we return the likelihood of the sequence?) A: Ξ±[N+1][end]
OutlineRecap of EM
Math: Lagrange Multipliers for constrained optimization
Probabilistic Modeling Example: Die Rolling
Directed Graphical ModelsNaΓ―ve BayesHidden Markov Models
Message Passing: Directed Graphical Model InferenceMost likely sequenceTotal (marginal) probabilityEM in D-PGMs
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
Ξ²(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
Ξ²(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i, B)
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
zi+1
= C
zi+1
= B
zi+1
= A
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
Ξ²(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i, B)
Ξ±(i, B) * Ξ²(i, B) = total probability of paths through state B at step i
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
Ξ²(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i+1, s)
zi+1
= C
zi+1
= B
zi+1
= A
Forward & Backward Message Passing
zi-1
= C
zi-1
= B
zi-1
= A
zi = C
zi = B
zi = A
Ξ±(i, B) Ξ²(i+1, sβ)
zi+1
= C
zi+1
= B
zi+1
= A
Ξ±(i, s) is the total probability of all paths:
1. that start from the beginning
2. that end (currently) in s at step i
3. that emit the observation obs at i
Ξ²(i, s) is the total probability of all paths:
1. that start at step i at state s
2. that terminate at the end3. (that emit the observation obs at i+1)
Ξ±(i, B) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) =total probability of paths through the Bsβ arc (at time i)
With Both Forward and Backward Values
Ξ±(i, s) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) =total probability of paths through the ssβ arc (at time i)
Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i
π π§π = π π€1, β― , π€π) =πΌ π, π β π½(π, π )
πΌ(π + 1, END)
π π§π = π , π§π+1 = π β² π€1, β― , π€π) =πΌ π, π β π π β² π β π obsπ+1 π β² β π½(π + 1, π β²)
πΌ(π + 1, END)
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
pobs(w | s)
ptrans(sβ | s)
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
pobs(w | s)
ptrans(sβ | s)
πβ π§π = π π€1, β― ,π€π) =πΌ π, π β π½(π, π )
πΌ(π + 1, END)
πβ π§π = π , π§π+1 = π β² π€1, β― ,π€π) =πΌ π, π β π π β² π β π obsπ+1 π β² β π½(π + 1, π β²)
πΌ(π + 1, END)
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π β² π ) =π(π β π β²)
Οπ₯ π(π β π₯)
if we observed the hidden transitionsβ¦
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π β² π ) =πΌπ βπ β²[π π β π β² ]
Οπ₯πΌπ βπ₯[π π β π₯ ]
we donβt the hidden transitions, but we can approximately count
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π β² π ) =πΌπ βπ β²[π π β π β² ]
Οπ₯πΌπ βπ₯[π π β π₯ ]
we donβt the hidden transitions, but we can approximately count
we compute these in the E-step, with our Ξ± and Ξ² values
EM For HMMs (Baum-Welch
Algorithm)
Ξ± = computeForwards()
Ξ² = computeBackwards()
L = Ξ±[N+1][END]
for(i = N; i β₯ 0; --i) {
for(next = 0; next < K*; ++next) {
cobs(obsi+1 | next) += Ξ±[i+1][next]* Ξ²[i+1][next]/L
for(state = 0; state < K*; ++state) {
u = pobs(obsi+1 | next) * ptrans (next | state)
ctrans(next| state) +=
Ξ±[i][state] * u * Ξ²[i+1][next]/L
}
}
}
Bayesian Networks:Directed Acyclic Graphs
π π₯1, π₯2, π₯3, β¦ , π₯π =ΰ·
π
π π₯π π(π₯π))
βparents ofβ
topological sort
Bayesian Networks:Directed Acyclic Graphs
π π₯1, π₯2, π₯3, β¦ , π₯π =ΰ·
π
π π₯π π(π₯π))
exact inference in general DAGs is NP-hard
inference in trees can be exact
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally
independent given Z if all (undirected) paths from
(any variable in) X to (any variable in) Y are
d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a βv-structureβ or βcolliderβ with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Y
X Y
X Z Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable
in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a βv-structureβ or βcolliderβ with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Z Y
X
Z
Y
X Z Y
observing Z blocks the path from X to Y
observing Z blocks the path from X to Y
not observing Z blocks the path from X to Y
D-Separation: Testing for Conditional Independence
Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable
in) X to (any variable in) Y are d-separated by Z
d-separation
P has a chain with an observed middle node
P has a fork with an observed parent node
P includes a βv-structureβ or βcolliderβ with all unobserved descendants
X & Y are d-separated if for all paths P, one of the following is true:
X Z Y
X
Z
Y
X Z Y
observing Z blocks the path from X to Y
observing Z blocks the path from X to Y
not observing Z blocks the path from X to Y
π π₯, π¦, π§ = π π₯ π π¦ π(π§|π₯, π¦)
π π₯, π¦ =
π§
π π₯ π π¦ π(π§|π₯, π¦) = π π₯ π π¦
Markov Blanket
x
Markov blanket of a node x is its parents, children, and
children's parents
π π₯π π₯πβ π =π(π₯1, β¦ , π₯π)
β« π π₯1, β¦ , π₯π ππ₯π
=Οπ π(π₯π|π π₯π )
β« Οπ π π₯π π π₯π ) ππ₯πfactor out terms not dependent on xi
factorization of graph
=Οπ:π=π or πβπ π₯π
π(π₯π|π π₯π )
β« Οπ:π=π or πβπ π₯ππ π₯π π π₯π ) ππ₯π
the set of nodes needed to form the complete conditional for a variable xi