mark hasegawa-johnson jhasegaw@uiuc university of illinois at urbana-champaign, usa

47
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana- Champaign, USA

Upload: derek-decker

Post on 31-Dec-2015

26 views

Category:

Documents


0 download

DESCRIPTION

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 8. Inference in Non-Tree Graphs. The multiple-parent problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

Page 2: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Lecture 8. Inference in Non-Tree Graphs

• The multiple-parent problem• Solution #1: Parent merger• Solution #2: Moralize, Triangulate, and

create a Junction Tree• Inference in Any DBN: Sum-Product

Algorithm in a Junction Tree• Example: Factorial HMM• Example: Zweig-triangle LVCSR• Example: Articulatory phonology

Page 3: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example Problem: Find p(d|a)

b c

a

d

• The correct answer is– p(d|a) = b c p(b|a) p(c|a) p(d|b,c)

• Try the sum-product algorithm. Propagate up, starting with node b: p(Db|b) = p(d|b) = ???

• p(d|b) is no longer one of the parameters of the model; now it must be calculated from p(d|b,c).

• In fact, the calculation requires us to sum over every variable in the model:– p(d|b) = c a p(a) p(c|a) p(d|b,c)

• High Computational Cost!

Page 4: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Conditional Independence of Descendants and Non-descendants

• The Sum-Product algorithm can use computations that are local at each node, v, because of the following theorem:

• Theorem: if and only if a Bayesian network is a tree, then for every variable v, the descendants and non-descendants of v are conditionally independent given v: p(Dv,Nv | v) = p(Dv | v) p(Nv | v)

Page 5: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Descendants and Non-descendants in a Tree

• p(Dc|c) = p(d|c)p(e|c)p(f|c)

• p(Nc,c) = p(a)p(b|a)p(c|a)

a

b c

d e f

Page 6: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Descendants and Non-descendants in a Non-Tree

• p(Dc,c) = a,b p(a)p(b|a)p(c|a)p(d|b,c)p(e|c)p(f|c)

• p(Nc,c) = d p(a)p(b|a)p(c|a)p(d|b,c)

• So is it necessary for EVERY computation to be global?

a

b c

d e f

Page 7: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Local Computations in a Non-Tree

• Here are some computations that can be local:– d depends only on the combination (b,c)– (b,c) depend only on a– e depends only on c, or equivalently, e depends on (b,c)

a

b c

d e f

Page 8: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The “Parent Merger” Algorithm

• Combine b,c into “super-node” bc – Number of possible values = (# of b values) X (# of c values)– p(bc | a) = p(b | a) p(c | a)– p(d | bc) = p(d | b,c)– p(e | bc) = p(e | c)

• Result is a tree

a

b c

d e f

a

bc

d e f

Page 9: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Sum-Product Algorithm with Super-Nodes

• Propagate Up:– p(Dbc | bc) = p(d|bc) p(e|bc) fp(f|bc)

– p(Da | a) = bc p(Dbc | bc) p(bc | a)

• Propagate Down:– p(bc, Nbc) = a p(a) p(bc | a)

– p(f, Nf) = bc p(bc, Nbc) p(f | bc) p(e | bc) p(d| bc)

• Multiply:– p(bc, d, e) = p(bc, Nbc)p(Dbc | bc)

– p(c, d, e) = b p(bc, d, e)

a

bc

d e fd e

Page 10: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The “Parent Merger” Algorithm• Algorithm #1 for turning a Non-Tree into a Tree:

– If any node has multiple parents, merge them– If any resulting supernode has multiple parents, merge them– Repeat until no node has multiple parents

• Why this algorithm is sometimes undesirable:– In an upward-branching graph, this results in a supernode with

NgNhNiNj = many possible values

– Many values Lots of computation a

b c

e f

i j

m

d

g h

k l

n o

p

{bc}

{def}

{ghij}

{klm}

{no}

Page 11: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Algorithm #2: Junction Trees• Moralize• Triangulate• Read off the cliques into a Junction Tree• Add variables to cliques, as necessary to ensure Locality

of Influence

Page 12: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Moralization• “Moralization” is the process of connecting the parents of

every node.• Goal: to show that values of the parents can not really be

independently computed.• Once the graph has been moralized, we usually show it as

an undirected graph --- dependency structure will still be necessary for inference, but not necessary for finding the best junction tree.

a

b c

e fd

n o

p

a

b c

e fd

n o

pMoralize

Page 13: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Triangulation• A “triangular” or “chordal” graph is a graph with no cycles

of length longer than three.• “Triangulation” is the process of adding edges to a graph

in order to make it triangular.

a b

dc

e f

hg

ji

a b

dc

e f

hg

ji

a b

dc

e f

hg

ji

Moralize Triangulate

Page 14: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Digression: Why are Moralization and Triangulation Allowed?

• An edge connecting a→b means that, during inference, we must use a probability table of size NaNb: p(b|a)

• One special case of the p(b|a) probability table is the case in which every row is the same, i.e., p(b|a)=p(a)

• Therefore this graph: Is a special case of this one:

• Put another way, information about the special features of a problem is coded by absent edges, not present edges.

• Adding edges is equivalent to forcing yourself to solve a harder, more general problem, rather than a simple specific one.

a b

c

a b

c

Page 15: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Cliques• A “clique” is a group of nodes, all of which are connected

together• The “separator” of two cliques is the set of nodes that are

members of both cliques. For example:• Clique efg and• Clique def• have {e,f} as their

separator

a b

dc

e f

hg

ji

Page 16: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Forming a Junction Tree• It is always possible to create a Junction Tree from a

triangular graph using the following algorithm:– Start with any clique as the root node– Next comes the clique whose separator with the root node is

largest– Locality of influence: if cliques A and B both contain node c, then

node c must also be added to every clique in the junction tree between A and B

a b

dc

e f

hg

ji

d,e,f

e,f,g

f,g,h

g,h,j

g,j

c,d,e

a,b,d

Junction Tree

Page 17: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Triangulation: A Hard Example• In this example, the graph on the left is not yet fully

triangulated (for example, the cycle dbcfon has no chord). Here is one possible triangulation algorithm: create a junction tree, then add variables to the cliques as necessary to maintain locality of influence.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

e,n,o?

n,o,p

c,e,f?

b,d,e?

b,c,e?

a,b,c

Page 18: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Triangulation to Maintain Locality of Influence

• Every node that’s in both the second cliquesecond clique and the fourth cliquefourth clique must also exist in the third cliquethird clique

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,d,e?

b,c,e?

a,b,c

e,n,o?

n,o,p

Page 19: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Triangulation to Maintain Locality of Influence

• Add node c to the 3rd clique, because c is in both 2nd and 4th cliques. Putting c into this clique is equivalent to drawing an edge between c and d, so that b,c,d,e are all interconnected.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,c,d,e

b,c,e?

a,b,c

e,n,o?

n,o,p

Page 20: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Triangulation to Maintain Locality of Influence

• … and then delete clique (b,c,e), because it’s now redundant with clique (b,c,d,e).

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,c,d,e

a,b,c

e,n,o?

n,o,p

Page 21: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Triangulation to Maintain Locality of Influence

• Similar reasoning: d is in the 4th clique, so it had better be added to the 3rd. This is equivalent to drawing an edge between d and f.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,d,e,f

b,c,d,e

a,b,c

e,n,o?

n,o,p

Page 22: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Triangulation to Maintain Locality of Influence

• … and f is in the 5th clique, so it had better be added to the 4th. This is equivalent to drawing an edge between f and n.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

e,n,o?

n,o,p

Page 23: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Finishing Up

Junction Tree

a

b c

e fd

n o

p

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,p

Page 24: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference in a Junction Tree

• The “clique” representation means that all inference is contained within one clique. For each clique:– Input: a probability table for values of the lower separator. Size of this

table = NI, where N is the number of possible values of each variable, and I is the number of nodes in the input separator

– Product: multiply by information about other nodes in the clique. Resulting table is of size NC, where C is the number of nodes in the clique.

– Sum: For each possible setting of variables in the output separator (NO possible settings), marginalize out the values of all other variables (a sum operation, containing NC-O terms in the sum). Total complexity: O{NC}

– Pass the resulting table to next clique.

Page 25: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example• Suppose b and o are observed, and we want to find

p(♦,b,o) for all variables ♦.• Clique noq:

– Output separator variables are n,o– Product: p(observation, q | n,o) = p(q | n,o)

– Sum: p(observation | n,o) = q p(q | n,o) = 1

• We could have skipped this step by observing that p(o | o) is always 1

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Page 26: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example• Clique efno:

– Input: p(observation | n,o)– Product: Nodes not in the output separator are moved to the left:

• p(observation,o | e,f,n) = p(observation | n,o) p(o | e,f,n)

– Sum: over unobserved elements not in the output separator• p(observation | e,f,n) = p(observation,o | e,f,n)

– Output: probabilities for every setting of the output separator• p(observation | e,f,n)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Page 27: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example• Clique defn:

– Input: p(observation | e,f,n)– Product: Nodes not in the output separator are moved to the left:

• p(observation,n | d,e,f) = p(observation | e,f,n) p(n | d,e,f)

– Sum: over unobserved elements not in the output separator• p(observation | d,e,f) = n p(observation,n | d,e,f)

– Output• p(observation | d,e,f)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Page 28: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example: “Propagate Down”• Clique abc:

– Product: p(a,b,c) = p(b,c|a)p(a)– Sum: over every unobserved variable that’s not in the output

separator:• p(b,c) = a p(a,b,c)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

Page 29: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example: “Propagate Down”• Clique bcde:

– Product: p(b,c,d,e) = p(d,e|b,e)p(b,c)– Sum: over every unobserved variable that’s not in the output

separator:• p(observations above, c,d,e) = p(b,c,d,e)

– Output: a probability table of size N3:• p(observations above, c,d,e)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

Page 30: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference Example: “Propagate Down”• Clique cdef:

– Input: p(observations above, c,d,e)– Product: p(observation,c,d,e,f) = p(f|c,d,e)p(observation,c,d,e)– Sum: over every unobserved variable that’s not in the output

separator:• p(observations above, d,e,f) = c p(observations above, c,d,e,f)

– Output: a probability table of size N3:• p(observations above, d,e,f)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

Page 31: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

… and so on…

Page 32: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Computational Complexity• Complexity of Inference is O{NC}, where N is the number of

values each node takes, C the number of nodes in the largest clique.– Actually, complexity is O{max i Ni}, where the ith variable in the

clique is defined in the range 1≤vi≤Ni, and max finds the maximum of this number over all cliques

• Therefore, a triangulation algorithm should minimize the maximum clique.

• Unfortunately, automatic minimum-maximum-clique triangulation is NP-hard. Good approximate algorithms exist, but…

• Humans are better at this than machines: design your graph with small cliques.

Page 33: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Factorial HMM (FHMM)

• Factorial HMM:– qt and vt represent two different types of background information, each

with its own history

– Observations xt depend on both hidden processes

• Model parameters:– p(vt+1|vt), p(qt+1|qt), p(xt|qt,vt)

• Computational Complexity of Sum-Product Algorithm:– O{N4T} using “parent-merger” triangulation– O{N3T} using a better triangulation (five slides from now)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

Page 34: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Speech in Music(Deoras and Hasegawa-Johnson, ICSLP 2004)

• qt = one person speaking

– Speech log spectrum given by ps(yt(ej)|qt) = mixture Gaussian

• vt = music playing in the background

– Music log spectrum given by pm(zt(ej)|vt) = mixture Gaussian

• Observed log spectrum = max(speech,music)

– xt(ej) ≈ max(yt(ej),zt(ej)) (xt(ej)≥max(yt(ej),zt(ej))≥xt(ej)-6dB)

– p(xt|qt,vt) = ps(xt|qt)ʃxpm(z|vt)dz + pm(xt|vt)ʃxps(y|qt)dy

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

… …

Page 35: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

AVSR: The Boltzmann Zipper(Hennecke, Stork, and Prasad, 1996)

• Same as AVSR model from last time, except that now vt has memory, independent of qt. Model parameters:

– p(qt+1|qt), p(xt|qt)

– p(vt+1|vt,qt+1), p(yt|qt)

• Sum-Product algorithm: O{N3T}, just like FHMM• The extra observations add complexity only of O{T}

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT Video observations

Viseme states

Audio phoneme states

Audio spectral observationsx2

qt

xt

vt

x2yt

Page 36: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

AVSR: The Coupled HMM(Chu and Huang, 2000)

• Advantage over Boltzmann Zipper: More flexible, because neither vision nor sound is “privileged” over the other.– p(qt+1|vt,qt), p(xt|qt)

– p(vt+1|vt,qt), p(yt|qt)

• Disadvantage: can’t be triangulated like FHMM, so complexity is O{N4T} rather than O{N3T}

q1

x1 x2

qt+1 qT

x5x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT Video observations

Viseme states

Audio phoneme states

Audio spectral observationsx2

qt

xt

vt

x2yt

Page 37: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Inference using Parent Merger

• Nt=observed non-descendants of (qt,vt) = {x1,…,xt-1}

• Dt=observed descendants of (qt,vt) = {xt,…,xT}

• Forward algorithm:– p(Nt+1,qt+1,vt+1) = qtvt p(xt | qt,vt)p(qt+1 | qt)p(vt+1 | vt)p(Nt,qt,vt)

• Backward algorithm:– p(Dt | qt,vt) = p(xt | qt,vt) qt+1vt+1 p(qt+1 | qt)p(vt+1 | vt)p(Dt+1 | qt+1,vt+1)

• Complexity: – (T frames)X(N2 sums/frame)X(N2 terms/sum) = O{N4T}

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

… …

Page 38: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

A Smarter Triangulation

• Forward Algorithm, step 1:– p(qt+1,vt,Nt) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

… …

Page 39: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

A Smarter Triangulation

• Forward Algorithm, step 1:– p(qt+1,vt,Nt+1) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)

• Forward Algorithm, step 2:– p(qt+1,vt+1,Nt+1) = vt p(vt+1|vt) p(qt+1,vt,Nt+1)

• Computational Complexity:– (T frames)X(2N2 sums/frame)X(N terms/sum) = O{N3T}– Complexity is N times higher than that of a one-stream HMM

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

… …

Page 40: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

“Compiling” an FHMM into an HMM

• Purpose: GMTK (Bilmes and Zweig, ICASSP 2002) can implement FHMM directly, but by compiling FHMM to HMM, we can also use HTK (Young, Evermann, Hain et al.,

2002) and other software tools• Method:

– Each state specifies the variables in the output separator of one clique, e.g., (qt+1,vt) is the separator between cliques (qt,qt+1,vt) and (qt+1,vt,vt+1).

– Transition probability matrix p(qt+1,vt|qt,vt) is N2XN2, but only N3 entries can be non-zero, thus complexity is O{N3}

x2xt

qt,vt qt+1,vt… …x3xt+1

qt+1,vt+1

Page 41: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

A Note on Parameter Tying

• Transition probability table specifies p(separator|separator), e.g., p(qt+1,vt | qt,vt)

• The only non-zero entries are those specifying variables that differ between separators, e.g., p(qt+1 | qt,vt)

• With “parameter tying,” we can constrain different elements of transition matrix to equal one another, thus forcing the model to match condition p(qt+1|qt,vt)=p(qt+1|qt).– If the two chains are known to be truly independent, e.g., speech

and background music, parameter tying may help to avoid over-training the model.

– If the two chains are possibly dependent, allow the full transition matrix p(qt+1|qt,vt): result is a Boltzmann zipper.

x2xt

qt,vt qt+1,vt… …x3xt+1

qt+1,vt+1

Page 42: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

“Compiling” an FHMM into an HMM• In order to handle non-emitting states, we need a total of

2N2 junction states: N2 emitting, N2 non-emitting• Finite State Diagram looks like this (NOT A DBN – this is

here to help you design the HTK configuration, if desired):

1,1

1,2

Emitting Statesqt,vt

Non-Emitting Statesqt+1,vt

2,1

2,2

1,1

1,2

2,1

2,2

Blue arrows: left-to-right transitionRed arrows: right-to-left transition

Black arrows: both(Note: no self-loops. Emitting & non-emitting states alternate.)

p(xt | qt=1,vt=1)

p(xt | qt=1,vt=2)

p(xt | qt=2,vt=1)

p(xt | qt=2,vt=2)

Observation PDFs

Page 43: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Graphical Models for Large-Vocabulary Speech

Recognition

Page 44: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

“Zweig Triangles”(Zweig, 1998)

• wt: word. 1≤ wt ≤Nw

– Nw = # words in vocabulary

– p(wt+1=wt | wt, wdTrt=0)=1

– p(wt+1 | wt, wdTrt=1) = bigram word grammar

• it: segment index. 1≤ it ≤Ni

– p(it+1 | it,wdTrt=0)>0 iff it≤it+1≤ it+1

– p(it+1=1 | it,wdTrt=1)=1

• wdTrt: is there a word transition?

– p(wdTrt=0 | it<Ni)=1

– p(wdTrt=1 | it=Ni)= probability word ends

• qt: segment label, for example, qt could equal “/aa/ state 3.”

– p(qt|it,wt)=probability that itth phonetic segment in wt is qt

– Often deterministic: p(qt|it,wt)=1 iff qt is itth phone of wt

• xt: observation

– p(xt|qt) usually mixture Gaussian

wt

it

qt

x2xt

wdTrt

wt+1

it+1

qt+1

x2xt+1

wdTrt+1

Page 45: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

LIP-OP TT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

Example: Pronunciation Variability• Pronunciation variability (e.g., apparent deletions or substitutions

of phonemes) can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent tract variables such as the LIP-OPENING and TONGUE-TIP-OPENING:

(Browman & Goldstein 1990):

Page 46: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

A DBN Model of Articulatory Phonology for Speech Recognition

(Livescu and Glass, 2004)

• wordt: word ID at frame #t• wdTrt: word transition?• indt

i: which gesture, from the canonical word model, should articulator i be trying to implement?• asynct

i;j: how asynchronous are articulators i and j? • Ut

i: canonical setting of articulator #i• St

i: surface setting of articulator #i

Page 47: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Summary• Multiple parents violate conditional independence of descendants and non-descendants sum-

product fails• A fast solution: parent merger• A more computationally efficient solution:

– Moralize– Triangulate– Create a junction tree

• Sum-Product Algorithm in a Junction Tree has complexity of O{NC} where C is number of nodes in the largest clique

• Example: Factorial HMM– Applications: speech with background noise, audiovisual speech– Complexity:O{N4} with parent merger, O{N3} with triangulation

• Example: Large Vocabulary Speech Recognition– Zweig triangles: word grammar and phone model in one graph– Livescu model: a DBN for pronunciation variability