discriminative learning for hidden markov models

22
1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009

Upload: grant

Post on 31-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Discriminative Learning for Hidden Markov Models. Li Deng. Microsoft Research. EE 516; UW Spring 2009. Minimum Classification Error (MCE). The objective function of MCE training is a smoothed recognition error rate. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Discriminative Learning for Hidden Markov Models

1

Discriminative Learning for Hidden Markov Models

Li Deng

Microsoft Research

EE 516; UW Spring 2009

Page 2: Discriminative Learning for Hidden Markov Models

2

Minimum Classification Error (MCE)

The objective function of MCE training is a smoothed recognition error rate.

Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD)

In this work we proposed the Growth Transformation based method for MCE based model estimation

Page 3: Discriminative Learning for Hidden Markov Models

3

Automatic Speech Recognition (ASR)

Spectrum analysis: Xr =

Decoding sr* = argmax pΛ(Xr, sr)

Speech signal of the r-th utt.:

x1, x2, x3, x4 ,…, xt ,…, xT

(sil) OH (sil) SIX EIGHT (sil)

Segment to frames:

| | | | | … | | 1 2 3 4 T

Speech recognition:

* arg max log ( | ) arg max log ( , )r r

r r r r rs s

s p s X p X s

Page 4: Discriminative Learning for Hidden Markov Models

4

Models (feature functions) in ASR

3

1

( , ) exp ( , )r r m m r rm

p X s h s X

h1(sr, Xr) = log p(Xr|sr; Λ) (AM)

h2(sr, Xr) = log p(sr) (LM)

h3(sr, Xr) = |sr| (#word)

λ1 = 1

λ2 = s (LM scale)

λ3 = p (word ins. penalty)

ASR in the log-linear framework

Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

Page 5: Discriminative Learning for Hidden Markov Models

5

MCE: Mis-classification measure

OH EIGHT THREE correct label: Sr

OH EIGHT SIX

competitor: sr,1

Observation. seq.: Xr x1, x2, x3, x4 ,…, xt ,…, xT

, ,1 ,( , ) log ( ) log ( )r r r r r rd X p X s p X S

Define misclassification measure:

sr,1: the top one incorrect (not equal to Sr) competing string

(in the case of using correct and top one incorrect competing tokens)

Page 6: Discriminative Learning for Hidden Markov Models

6

MCE: Loss function

0 d

lo

ss

1

),(1

1),(

rr Xdrrr eXdl

Loss function: smoothed error count func.

Classification: ),(logmaxarg*rr

sr sXps

r

Classifi. error: dr(Xr,Λ) > 0 1 classification error

dr(Xr,Λ) < 0 0 classification error

Page 7: Discriminative Learning for Hidden Markov Models

7

MCE: Objective function

MCE objective function:

R

rrrrMCE Xdl

RL

1

),(1

)(

LMCE(Λ) is the smoothed recognition error rate on the string (token) level.

Model (acoustic model) is trained to minimize LMCE(Λ), i.e., Λ* = argmin Λ{LMCE(Λ)}

Page 8: Discriminative Learning for Hidden Markov Models

8

MCE: Optimization

Traditional Stochastic GD New Growth Transform.

Gradient descent based online optimization

Convergence is unstable

Training process is difficult to be parallelized

Extend Baum-Welch based batch-mode method

Stable convergence

Ready for parallelized processing

Page 9: Discriminative Learning for Hidden Markov Models

9

MCE: Optimization

MinimizingLMCE(Λ) = ∑ l ﴾d(∙)﴿

MaximizingP(Λ) = G(Λ)/H(Λ)

MaximizingF(Λ;Λ′) = G-P′×H+D

MaximizingF(Λ;Λ′) = ∑ f (∙)

MaximizingU(Λ;Λ′) = ∑ f ′(∙)log f

(∙)

GT formula∂U(∙)/∂Λ = 0 Λ =T(Λ′)

If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation of Λ for P(Λ).

o Growth Transformation based MCE:

Page 10: Discriminative Learning for Hidden Markov Models

10

MCE: Optimization

Re-write MCE loss function to

Then, min. LMCE(Λ) max. Q(Λ), where

,1

,1

( , | )( , )

( , | ) ( , | )r r

r r rr r r r

p X sl d X

p X s p X S

,1

,1

{ , }

1 1,1{ , }

( ) 1 ( )

( , | ) ( , )( , | )

( , | ) ( , | ) ( , | )r r r

r r r

MCE

r r r rR Rs s Sr r

r rr r r r r rs s S

Q R L

p X s s Sp X S

p X s p X S p X s

Page 11: Discriminative Learning for Hidden Markov Models

11

MCE: Optimization

)(

)()(

H

GP

Q(Λ) is further re-formulated to a single fractional function P(Λ)

1

1

),...,,,...,()(

),(),...,,,...,()(

11

111

s sRR

s s

R

rrrRR

R

R

ssXXpH

SsssXXpG

where

Page 12: Discriminative Learning for Hidden Markov Models

12

MCE: Optimization

Increasing P(Λ) can be achieved by maximizing

( ; ) ( ) ( ) ( )F G P H D

1( )( ) ( ) ( ; ) ( ; )HP P F F

i.e.,

as long as D is a Λ-independent constant.

( ; ) ( , , | )[ ( ) ( )]q s

F p X q s C s P D Substitute G() and H() into F(),

(Λ′ is the parameter set obtained from last iteration)

Page 13: Discriminative Learning for Hidden Markov Models

13

MCE: Optimization

),|,()()();,,,( sqpsdsqf

dsqfFs q

);,,,();(

Reformulate F(Λ;Λ') to

where

F(Λ;Λ') is ready for EM style optimization

( ) ( , ) ( , )[ ( ) ( )]s

X p q s C s P

1

( ) ( , )R

r rr

C s s S

Note: Γ(Λ′) is a constant, and log p(χ, q | s, Λ) is easy to decompose.

Page 14: Discriminative Learning for Hidden Markov Models

14

MCE: Optimization

s q

dsqfsqfU

);,,,(log);,,,()(

Increasing F(Λ;Λ') can be achieved by maximizing

)(0)(

TU

So the growth transformation of Λ for CDHMM is:

Use extend Baum-Welch for E step.

log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.

Page 15: Discriminative Learning for Hidden Markov Models

15

MCE: Model estimation formulas

For Gaussian mixture CDHMM,

, ,

,

( )

( )

m r r t m mr t

mm r m

j t

t x D

t D

, , ,

,

( )( - )( - ) ( - )( - )

( )

T Tm r r t m r t m m m m m m m m

r tm

m r mr t

t x x D D

t D

,1, ,1 , , , ,( ) ( | , ) ( | , ) ( ) ( )

r rm r r r r r m r S m r st p S X p s X t t where

1 121

( | , ) | | exp ( ) ( )2

Tp x x x

GT of mean and covariance of Gaussian m is

Page 16: Discriminative Learning for Hidden Markov Models

16

MCE: Model estimation formulas

Setting of Dm

,1

, ,1

,1 , ,

( | , ) ( | , ) ( )

( | , ) ( )

r

r

R

m r r r r m r Sr t

r r m r st

D E p S X p S X t

p s X t

Theoretically,

set Dm so that f(χ,q,s,Λ;Λ') > 0

Empirically,

Page 17: Discriminative Learning for Hidden Markov Models

MCE: Workflow

17

Training utterances

Last iteration Model Λ′

Recognition

GT-MCE Training transcripts

Competing strings

New model Λ

next iteration

Page 18: Discriminative Learning for Hidden Markov Models

18

Experiment: TI-DIGITS

Vocabulary: “1” to “9”, plus “oh” and “zero”

Training set: 8623 utterances / 28329 words

Test set: 8700 utterances / 28583 words

33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features.

Model: Continuous Density HMMs

Total number of Gaussian components: 3284

Page 19: Discriminative Learning for Hidden Markov Models

19

Experiment: TI-DIGITS

Obtain the lowest error rate on this task

Reduce recognition Word Error Rate (WER) by 23%

Fast and stable convergence

GT-MCE vs. ML (maximum likelihood) baseline

E=1.0E=2.0E=2.5

0 2 4 6 8 10600

700

800

900

1000

1100

MCE iteration

Loss

fun

c. (

sigm

oid

erro

r co

unt)

MCE training - TIdigits

E=1.0E=2.0E=2.5

0 2 4 6 8 100.2

0.25

0.3

0.35

0.4

MCE iteration

WE

R (

%)

MCE training - TIdigits

Page 20: Discriminative Learning for Hidden Markov Models

20

Experiment: Microsoft Tele. ASR

Microsoft Speech Server – ENUTEL A telephony speech recognition system

Training set: 2000 hour speech / 2.7 million utterances

33-dim spectrum features: (E+MFCCs) +∆ +∆∆

Acoustic Model: Gaussian mixture HMM

Total number of Gaussian components: 100K

Vocabulary: 120K (delivered vendor lexicon)

CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz

Training Cost: 4~5 hours per iteration

Page 21: Discriminative Learning for Hidden Markov Models

21

Experiment: Microsoft Tele. ASR

Name voc.size # word description

MSCT 70K 4356 enterprise call center system(the MS call center we use daily)

SA 20K 43966 major commercial applications(and include many cell phone data)

QSR 55K 5718 name dialing system(many names are OOV, rely on LTS)

ACNT 20K 3219 foreign accented speech recognition(designed to test system robustness)

Evaluate on four corpus-independent tests

Collected from sites other than training data providers

Cover major commercial Tele. ASR scenarios

Page 22: Discriminative Learning for Hidden Markov Models

22

Experiment: Microsoft Tele. ASR

WER ML GT-MCE WER reduction

MSCT 11.59% 9.73% 16.04%

SA 11.24% 10.07% 10.40%

QSR 9.55% 8.58% 10.07%

ACNT 32.68% 29.00% 11.25%

Significant performance improvements across-the-board

The first time MCE is successfully applied to a 2000 hr. speech database

The Growth Transformation based MCE training is well suited for large scale modeling tasks