introduction to discriminative training in speech …...2010/01/14  · schluter: introduction to...

129
Introduction to Discriminative Training in Speech Recognition Ralf Schl¨ uter , Georg Heigold Lehrstuhl f¨ ur Informatik 6 Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University D-52056 Aachen, Germany January 14, 2010 Schl¨ uter: Introduction to Discriminative Training in Speech Recognition 1 January 14, 2010

Upload: others

Post on 05-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Introduction to Discriminative Trainingin Speech Recognition

Ralf Schluter, Georg Heigold

Lehrstuhl fur Informatik 6Human Language Technology and Pattern Recognition

Computer Science Department, RWTH Aachen UniversityD-52056 Aachen, Germany

January 14, 2010

Schluter: Introduction to Discriminative Training in Speech Recognition 1 January 14, 2010

Page 2: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Contents

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 2 January 14, 2010

Page 3: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

IntroductionMotivationOverview

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 3 January 14, 2010

Page 4: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 5: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 6: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)

Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 7: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 8: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 9: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 10: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

c ′

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010

Page 11: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010

Page 12: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010

Page 13: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010

Page 14: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010

Page 15: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML/MMI

class -1class +1

MLMMI

-1

0

1

2

-5 -4 -3 -2 -1 0 1 2

y

x

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010

Page 16: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

IntroductionMotivationOverview

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 6 January 14, 2010

Page 17: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 18: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 19: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 20: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 21: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 22: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 23: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 24: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 25: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 26: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010

Page 27: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 8 January 14, 2010

Page 28: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Notation

Xr sequence xr ,1, xr ,2, ..., xr ,Tr acoustic observationvectors

Wr spoken word sequence wr ,1,wr ,2, ...,wr ,Nr in trainingutterance r

W any word sequence

p(W ) language model probability, supposed to be given

pθ(Xr |W ) acoustic emission probability/acoustic model

θ set of all parameters of the acoustic model

Mr set of competing word sequences to be considered

f smoothing function

Schluter: Introduction to Discriminative Training in Speech Recognition 9 January 14, 2010

Page 29: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 10 January 14, 2010

Page 30: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

General ApproachTraining

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθ{F (θ)}

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010

Page 31: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

General ApproachTraining

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθ{F (θ)}

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))

I covers maximum mutual information (MMI), minimumclassification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010

Page 32: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

General ApproachTraining

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθ{F (θ)}

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010

Page 33: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

General ApproachTraining

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθ{F (θ)}

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

r=1

f

(log

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010

Page 34: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 12 January 14, 2010

Page 35: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Probabilistic Training Criteria

Objective

I find good estimate of probability distribution

I optimality regarding error via Bayes’ decoding(asymptotic w.r.t. amount of training data)

Schluter: Introduction to Discriminative Training in Speech Recognition 13 January 14, 2010

Page 36: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Maximum Likelihood (ML)

I optimization of joint probability

arg maxθ

∑r

log(p(Wr )pθ(X |Wr )

)= arg max

θ

∑r

log pθ(Xr |Wr )

I Tutorial on HMM [Rabiner 1989].

I Maximization of probability of reference word sequences(classes).

I Model correctness important.

I HMM: maximization for each class separately.

I Neglects competing classes.

I Expectation-maximiation: local convergence guaranteed.

I Estimation efficient, easily parallelizable.

Schluter: Introduction to Discriminative Training in Speech Recognition 14 January 14, 2010

Page 37: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Maximum Mutual Information (MMI)I optimization of conditional probability

arg maxθ

∑r

log pθ(Wr |Xr ) = arg maxθ

∑r

logp(Wr )pθ(Xr |Wr )∑

V p(V )pθ(Xr |V )

I Considers competing classes and therefore decision boundariesI Necessitates set of competing classes on training data.I Optimization for standard modeling (HMMs, mixture

distributions): only gradient descent or similar.I Optimization using log-linear modeling: convex problemI First application of MMI for ASR using discrete

HMMs [Bahl+ 1986]:I 2000 isolated words, 18% rel. improvement in word error rate.

I MMI for discrete and continuous probabilitydensities [Brown 1987]:

I isolated E-set letters, 18% rel. improvement in recognition rate.I MMI for discrete and continuous probabilty

densities [Normandin 1991]:I digit strings, up to 50% rel. improvement in string error rate.

Schluter: Introduction to Discriminative Training in Speech Recognition 15 January 14, 2010

Page 38: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 16 January 14, 2010

Page 39: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Error-Based Training Criteria

Objective: Optimize some error measure directly, e.g.:I Empirical recognition error on training data

I Advantage: direct relation to decision ruleI Problem: non-differentiable training criterion, use of

differentiable approximations in practiceI Problem: ASR classes (words/word sequences) difficult to

handle

I Model-based expected error on training dataI Advantage: word or phoneme error easy to handleI Usually, approximated word/phoneme error, but correct edit

distance also is viable [Heigold+ 2005]I Relation to decision rule less straight-forward.I Over-training and generalization becomes an issue

(→ regularization, margin)

Schluter: Introduction to Discriminative Training in Speech Recognition 17 January 14, 2010

Page 40: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Minimum classification error (MCE)

I For ASR: minimization of smoothed empirical sentenceerror [Juang & Katagiri 1992, Chou+ 1992].

arg minθ

1

R

R∑r=1

1

1+[ pαθ (Xr |Wr ) · pα(Wr )∑

W 6=Wr

pαθ (Xr |W ) · pα(W )

]2%

I Smoothing parameters α and %.

I Upper bound to Bayes’ error rate for any acousticmodel [Schluter+ 2001]

I Lesser effect of incorrect model assumptions.

Schluter: Introduction to Discriminative Training in Speech Recognition 18 January 14, 2010

Page 41: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Minimum word/phone error (MWE/MPE)

I minimization of model-based expected word/phone error ontraining data [Povey & Woodland 2002]

arg maxθ

R∑r=1

∑W A(W ,Wr )p(W )pθ(Xr |W )∑

W p(W )pθ(Xr |W )

I Criterion: maximum expected accuracy A(W ,Wr ).

I Accuracy usually approximate, but exact case based on edit(Levenshtein) distance also possible [Heigold+ 2005].

I Regularization (e.g. I-smoothing [Povey & Woodland 2002])necessary due to overtraining.

I Usually better than MMI and MCE.

Schluter: Introduction to Discriminative Training in Speech Recognition 19 January 14, 2010

Page 42: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 20 January 14, 2010

Page 43: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Practical Issues

I Importance of language model in training of acoustic model.

I Relative and absolute scaling of language and acoustic modelin training.

I Necessity for recognition of training data.

I Efficient calculation of discriminative training statistics usingword lattices.

Schluter: Introduction to Discriminative Training in Speech Recognition 21 January 14, 2010

Page 44: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Language Models for Discriminative TrainingPotential Importance of Language Model Choice:

I language model for recognition of alternative word sequences

I language model dependence of discriminative training criterionitself

I interaction of language model of acoustic model parameters

Correlation hypothesis:only those acoustic models need optimization, which even to-gether with a language model do not sufficiently discriminate.

−→ language model choice would correlate for training andrecognition

Masking hypothesis:language model usually largely improves recognition accuracyand might mask deficiencies of the acoustic models.

−→ suboptimal language models for training would give betterperformance

Schluter: Introduction to Discriminative Training in Speech Recognition 22 January 14, 2010

Page 45: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Language Model for Discriminative TrainingI Discriminative training includes language model.I In training, unigram language model usually leads to the best

word error rates [Schluter+ 1999] (WSJ 5k):

language models criterion word error rates[%]recog train dev eval dev& eval

bi – ML 6.91 6.78 6.86zero MMI 6.71 6.03 6.41uni 6.59 6.00 6.33 -8%bi 6.71 6.20 6.48tri 6.87 6.54 6.72

tri – ML 4.82 4.11 4.51zero MMI 4.63 4.05 4.38uni 4.30 3.64 4.01 -11%bi 4.48 3.94 4.24tri 4.58 4.00 4.33

Schluter: Introduction to Discriminative Training in Speech Recognition 23 January 14, 2010

Page 46: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Scaling of likelihoods

I recognition: absolute scaling of likelihoods irrelevant(language model scale vs. acoustic model scale)

I absolute scaling does have impact on word posteriorcalculation [Wessel+ 1998, Woodland & Povey 2000]

I use language model scale β also in training:

p(X ,W ) = p(W )βpθ(X |W )

I replace p(X ,W ) with:

p(X ,W )γ = p(W )βγpθ(X |W )γ for γ ∈ [0, 1]

I optimum approx. for γ = 1β ., i.e. use

p(X ,W )1β = p(W )pθ(X |W )

I For simplicity here usually omitted in equations.

Schluter: Introduction to Discriminative Training in Speech Recognition 24 January 14, 2010

Page 47: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Competing Word Sequences

I Problem: Exponential number of competing word sequences.I Competing word sequences need to be estimated:

I Hypothesis-generation on training data using recognizer.I Initial lattice generation using recognizer sufficient.I Later acoustic model rescoring constrained to lattice.

I Representation and processing of competing word sequences.I Efficient algorithms to process word lattices.I Generic implementation: weighted finite state transducers.

Schluter: Introduction to Discriminative Training in Speech Recognition 25 January 14, 2010

Page 48: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Competing Word Sequences

History:I best recognized word sequence for MMI (Corrective Training)

[Normandin 1991]:I considers incorrectly recognized training sentences only

I best incorrectly recognized word sequence for MCE[Juang & Katagiri 1992]:

I interpretation of smoothed sentence error still valid

I N-best recognized word sequences for MMI [Chow 1990]:I continuous speech recognition, 1000 wordsI only minor improvements in word error rate

I word graphs from recognition for MMI training[Valtchev+ 1997]:

I large vocabulary, 64k wordsI efficient implementationI 5-10% relative improvement in word error rate

Schluter: Introduction to Discriminative Training in Speech Recognition 26 January 14, 2010

Page 49: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 27 January 14, 2010

Page 50: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Comparative Experimental Results

WER [%]SieTill WSJ 5k EPPS English Mandarin BN/BC

Crit. Test Dev Evl Dev06 Evl06 Evl07 Dev07 Evl06

ML 1.81 4.55 3.74 14.4 10.8 12.0 15.1 21.9MMI 1.79 4.07 3.53 13.8 11.0 12.0 14.4 20.8MCE 1.69 4.02 3.47 13.8 11.0 11.9MWE 3.98 3.44MPE 4.17 3.62 13.4 10.2 11.5 14.2 20.6

I SieTill [Schluter 2000]

I WSJ 5k [Macherey 2010]

I EPPS/broadcasts [Heigold 2010]

Schluter: Introduction to Discriminative Training in Speech Recognition 28 January 14, 2010

Page 51: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 29 January 14, 2010

Page 52: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Motivation

Goal: optimization method for discriminative training criteria F (θ) w.r.t.Goal: set of parameters θ which provides reasonable convergence.

I Various approaches, e.g.:I extended Baum-Welch (EBW) [Normandin 1991]I gradient descent, study: e.g. [Valtchev 1995]I MMI with log-linear models: generalized iterative scaling (GIS)I generalization of GIS to log-linear models with hidden variables

and further criteria like MPE and MCE [Heigold+ 2008a]

I Problems:I robust setting of step sizes/iteration constants (EBW and

gradient descent),I convergence speed (especially GIS).

Schluter: Introduction to Discriminative Training in Speech Recognition 30 January 14, 2010

Page 53: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Extended Baum-Welch

I Motivated by a growth transformation [Gopalakrishnan+ 1991]

I Widely used for discriminative training of Gaussian mixtureHMMs, e.g. [Normandin 1991, Valtchev+ 1997,Schluter 2000, Woodland & Povey 2002]

I Highly optimized heuristics for finding right order ofmagnitude for iteration constants.

I Training of Gaussian mixture HMMs: require positivevariances to obtain estimate for iteration constants.

Schluter: Introduction to Discriminative Training in Speech Recognition 31 January 14, 2010

Page 54: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 32 January 14, 2010

Page 55: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Gradient descent

Follow gradient to optimize parameter:

θ = θ + γ∇θFθ

Step sizes:

I heuristic, e.g. for MCE [Chou+ 1992])

I by comparison to EBW [Schluter 2000]

Convergence:

I local optimum

I better convergence: general purpose approaches,e.g. Qprop, Rprop, or L-BFGS, for experimental comparisonssee [McDermott & Katagiri 2005, McDermott+ 2007,Gunawardana+ 2005, Mahajan+ 2006]

Schluter: Introduction to Discriminative Training in Speech Recognition 33 January 14, 2010

Page 56: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 34 January 14, 2010

Page 57: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Rprop [Riedmiller & Braun 1993]

General purpose gradient based optimization:

I assume iteration n

I parameter update:

θ(n+1)i = θ

(n)i + γ

(n)i sign

(∂F (θ(n))

∂θi

)I update of step sizes γ

(n)i :

γ(n+1)i =

min{γ(n)

i · η+, γmax} if ∂F (θ(n))∂θi

· ∂F (θ(n−1))∂θi

> 0

max{γ(n)i · η−, γmin} if ∂F (θ(n))

∂θi· ∂F (θ(n−1))

∂θi< 0

γ(n)i otherwise

I η+ ∈ (1,∞), η− ∈ (0, 1)

Schluter: Introduction to Discriminative Training in Speech Recognition 35 January 14, 2010

Page 58: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 36 January 14, 2010

Page 59: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Formal gradient of MMII notation:

I W : word sequence w1, . . . ,wN

I r : index of training segment/utterance given by (Xr ,Wr )I Xr : acoustic observation vector sequence xr1, . . . , xrT

I Wr : reference/spoken word sequence wr1, . . . ,wrN

I sT1 : HMM state sequence s1, . . . , sT

I MMI training criterion:

FMMI(θ) =∑

r

log

(p(Wr )pθ(Xr |Wr )∑

W

p(W )pθ(Xr |W )

)

=∑

r

(log p(Wr )pθ(Xr |Wr )− log

∑W

p(W )pθ(Xr |W )

)I acoustic model (HMM):

pθ(Xr ,W ) =∑sTr

1

Tr∏t=1

p(st |st−1)pθ(xrt |st)

Schluter: Introduction to Discriminative Training in Speech Recognition 37 January 14, 2010

Page 60: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Derivative of MMI w.r.t. Parameter

Gradient of MMI criterion:

∇θFMMI(θ) =∑

r

(∇θ log pθ(Xr |Wr )

∑W

p(W )pθ(Xr |W )∇θ log pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

)

For efficient evaluation, consider derivative of acoustic model,∇θ log pθ(Xr |W ).

Schluter: Introduction to Discriminative Training in Speech Recognition 38 January 14, 2010

Page 61: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Derivative of MMI w.r.t. ParameterDerivative of acoustic model:

∇θ log pθ(Xr ,W ) = ∇θ log∑

sTr1 :W

Tr∏t=1

pθ(xrt |st)p(st |st−1)

=Tr∑t=1

∑sTr

1 :W

(∇θ log pθ(xrt |st)

)·∏Tr

t′=1 pθ(xrt′ |st′)p(st′ |st′−1)∑σTr

1 :W

∏Trτ=1 pθ(xrτ |sτ )p(sτ |sτ−1)

=Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

=Tr∑t=1

∑s

γrt(s|W ) · ∇θ log pθ(xrt |s)

with the word sequence conditioned state posterior (occupancy):

γrt(s|W ) =

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )= pθ,t(s|Xr ,W )

Schluter: Introduction to Discriminative Training in Speech Recognition 39 January 14, 2010

Page 62: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Derivative of MMI w.r.t. Parameterresubstitute derivative of acoustic model into derivative of MMI criterion:

∇θFMMI(θ) =∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

·(γrt(s|Wr )−

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)

)

=∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

)·(γrt(s|Wr )− γrt(s)

)with the general state posterior (occupancy):

γrt(s) =

∑W

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)= pθ,t(s|Xr )

Schluter: Introduction to Discriminative Training in Speech Recognition 40 January 14, 2010

Page 63: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Efficient Calculation of State OccupanciesIn general:

I efficient calculation of spoken word sequence conditional stateoccupancy γrt(s|Wr ): forward-backward state probabilities ontrellis of word sequence

I efficient calculation of general state occupancy γrt(s):forward-backward probabilities on trellis of word lattice

Viterbi approximation:I γrt(s|W ) = δs,srt(W ) with forced alignment

Sr (W ) = sr1(W ), . . . , srTr (W ) of spoken word sequenceI assume a (word) lattice Mr for utterance r , with edges ω

representing a word w(ω) (in context) with start time ts(ω)and end time te(ω), and a corresponding forced alignmentstets (ω). An edge sequence W ∈Mr then corresponds to the

word sequence W (W). Consequently, the language model andacoustic model can also be defined for an edge sequence,which then might specify word boundaries, phonetic andlanguage model context.

Schluter: Introduction to Discriminative Training in Speech Recognition 41 January 14, 2010

Page 64: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Word Posterior ProbabilitiesFor the general state occupancy in Viterbi approximation we obtain:

γrt(s) =

∑W

p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior

p(ω|Xr ) =∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

A forward-backward algorithm is used to efficiently compute edge(word in context) posterior probabilities using word lattices.

Schluter: Introduction to Discriminative Training in Speech Recognition 42 January 14, 2010

Page 65: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 43 January 14, 2010

Page 66: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Formal gradient of MPE

I A(W ,Wr ): accuracy (negated error) between string W and Wr

I example (MPE): approximate phone accuracy[Povey & Woodland 2002]

I expectation of accuracy:

Eθ[A(·,Wr )] :=∑W

A(W ,Wr ) · p(W )pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

I MPE training criterion:

FMPE(θ) =∑

r

Eθ[A(·,Wr )]

Schluter: Introduction to Discriminative Training in Speech Recognition 44 January 14, 2010

Page 67: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Derivative of MPE w.r.t. Parameter

Derivative of MPE criterion:

∇θpθ(Xr |W ) = pθ(Xr |W ) ·(∇θ log pθ(Xr |W )

)∇θFMPE(θ) =

∑r

∑W

(A(W ,Wr )− Eθ[A(·,Wr )]

)·(∇θ log pθ(Xr |W )

· p(W )pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

For efficient evaluation, consider derivative of acoustic model:

∇θ log pθ(Xr |W ) =Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

Schluter: Introduction to Discriminative Training in Speech Recognition 45 January 14, 2010

Page 68: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Derivative of MPE w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MPE criterion:

∇θFMPE(θ) =∑

r

Tr∑t=1

∑s

(∇θ log pθ(xrt |s)

)· γrt(s)

with the general state accuracy:

γrt(s) =∑W

(A(W ,Wr )−Eθ[A(·,Wr )]

∑sTr

1 :st=s

p(W )pθ(Xr , sTr1 |W )

∑W ′

p(W ′)pθ(Xr |W ′)

which can be computed efficiently, similar to the case of generalstate occupancies.

Schluter: Introduction to Discriminative Training in Speech Recognition 46 January 14, 2010

Page 69: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Efficient Calculation of State Accuracy

In general:

I assumption: A(W ,Wr ) =Tr∑t=1

A(srt(W ), srt(Wr ))

I example: approximate phone accuracy[Povey & Woodland 2002]

I efficient calculation of general state accuracy γrt(s):forward-backward accuracies on trellis of word lattice[Povey & Woodland 2002]

Schluter: Introduction to Discriminative Training in Speech Recognition 47 January 14, 2010

Page 70: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Word Posterior AccuraciesFor the general state accuracy we in Viterbi approximation obtain:

γrt(s) =

∑W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior accuracies

p(ω|Xr ) =∑

W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

Later, an efficient way of computing edge (word in context)posterior accuracies using word lattices will be presented.

Schluter: Introduction to Discriminative Training in Speech Recognition 48 January 14, 2010

Page 71: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 49 January 14, 2010

Page 72: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Forward/Backward Probabilities on Word LatticesI Let ωs(W) and ωe(W) be the first and last edge of a

continuous edge sequence W on a word lattice.I Assume that the lattice fully encodes the language model context:

p(W (W)) = p(W = ωN1 ) =

N∏n=1

p(ωn|ωn−1)

Let ωri and ωrf be the initial and final edges of a word lattice forutterance r . Then define the following forward (Φ) and backward(Ψ) probabilities on initial and final partial edge sequences on theword lattice respectively:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

Ψ(ω) =∑

W:ωs(W)=ω

ωe(W)=ωrf

p(W)pθ(xrTr

ts(W)|W)

Schluter: Introduction to Discriminative Training in Speech Recognition 50 January 14, 2010

Page 73: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Forward/Backward Probabilities on Word Lattices

For the forward probability a recursion formulae can be derived byseparating the last edge from the edge sequence in the summationand ≺ denoting direct predecessor edges:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

=∑ω′≺ω

∑W ′:ωs(W ′)=ωri

ωe(W ′)=ω′

p(W ′)p(ω|ω′)pθ(xrte(W ′)1 |W ′)pθ(xr

te(ω)ts(ω)|ω)

=∑ω′≺ω

Φ(ω′)p(ω|ω′)pθ(xrte(ω)ts(ω)|ω).

Using this recursion formula, the forward probabilities can becalculated efficiently on word lattices.

Schluter: Introduction to Discriminative Training in Speech Recognition 51 January 14, 2010

Page 74: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Forward/Backward Probabilities on Word Lattices

Similar to the forward probabilities, a recursion formula can bederived for efficient calculation of the backward probabilities and �denoting direct successor edges:

Ψ(ω) =∑

W:ωs(W)�ωωe(W)=ωrf

p(W)pθ(xrTr

ts(ω)|ωW)

=∑ω′�ω

∑W ′:ωs(W ′)�ω′ωe(W ′)=ωrf

p(ω′|ω)p(W ′)pθ(xrte(ω)ts(ω)|ω)pθ(xr

Tr

ts(ω′)|ω′W ′)

=∑ω′�ω

pθ(xrte(ω)ts(ω)|ω)p(ω′|ω)Ψ(ω′)

Schluter: Introduction to Discriminative Training in Speech Recognition 52 January 14, 2010

Page 75: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Forward/Backward Probabilities on Word Lattices

Using the forward and backward probabilities, the edge/wordposterior on a word lattice can be written as

p(ω|Xr ) =

Φ(ω)∑ω′�ω

p(ω′|ω)Ψ(ω′)

Φ(ωrf )

with pθ(Xr ) = Φ(ωrf ) = Ψ(ωri ).

Word posterior probabilities follow naturally from MPE and similardiscriminative training criteria. They also are the basis forconfidence measures, which are used for unsupervised training,adaptation, or dialog systems. They are also part of approximateapproaches to Bayes’ decision rule with word error cost, likeconfusion networks [Mangu+ 1999], or minimum frame worderror [Wessel+ 2001a, Hoffmeister+ 2006].

Schluter: Introduction to Discriminative Training in Speech Recognition 53 January 14, 2010

Page 76: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 54 January 14, 2010

Page 77: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

FB Probabilities: Generalization to WFSTs

Replace word lattice with WFST

I edge label: word with pronunciation

I weight of edge ω: p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I semiring: substitute arithmetic operations (multiplication,addition, inversion) with operations of probability semiring

Semiring IK p ⊕ p′ p ⊗ p′ 0 1 inv(p)

probability IR+ p + p′ p · p′ 0 1 1p

Example WFST from SieTill, Wr =”drei sechs neun” (in red)

1

2

3

4

5

6

7 8 90

neun /9//−876null /0//−424drei /3//−384

[SIL] /si//−2618

sechs /6//−1014

drei /3//−1013

drei /3//−909

zwei /2//−556

neun /9//−480

[SIL] /si//−706

[SIL] /si//−274

[SIL] /si//−632

[SIL] /si//−719

sieben /7//67

sieben /7//−568

fünf /5//−437

Schluter: Introduction to Discriminative Training in Speech Recognition 55 January 14, 2010

Page 78: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

FB Probabilities: Generalization to WFSTs

Forward probabilities (pre(ω) ∈ W such that pre(ω) ≺ ω)

Φ(ω) :=⊕

W:ωs(W)=ωri

ωe(W)=ω

⊗ω∈W

p(ω|pre(ω))⊗ pθ(xrte(ω)ts(ω)|ω)

=⊕ω′≺ω

Φ(ω′)⊗ p(ω|ω′)⊗ pθ(xrte(ω)ts(ω)|ω)

Backward probabilities: similar

Using the forward and backward probabilities, the edge posterioron a WFST Xr can be written as

p(ω|Xr ) = Φ(ω)⊗( ⊕ω′�ω

p(ω′|ω)⊗Ψ(ω′)

)⊗ inv(Φ(ωrf ))

Schluter: Introduction to Discriminative Training in Speech Recognition 56 January 14, 2010

Page 79: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Expectation Semiring

vector weight (p, v) of edge ω with

I p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I v ← A(ω) · pI accuracy of edge ω such that

⊗ω∈W A(ω) = A(W,Wr )

I approximate phone accuracy [Povey & Woodland 2002] can bedecomposed in this way

I such a decomposition not possible in general

expectation semiring [Eisner 2001]:vector semiring whose first component is a probability semiring

Semiring IK (p, v)⊕ (p′, v′) (p, v)⊗ (p′, v′) 0 1 inv(p, v)

expectation IR+ × IR (p + p′, v + v′) (p · p′, p · v′ + p′ · v) (0, 0) (1, 0)

„1p,− v

p2

«

Schluter: Introduction to Discriminative Training in Speech Recognition 57 January 14, 2010

Page 80: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Edge Posteriors & Expectation Semiringprobability semiring

I word posterior probabilities (see MMI derivative) identical toedge posteriors using probability semiring

p(ω|Xr ) = pprobability(ω|Xr )

I intuitive and classical result [Rabiner 1989]

expectation semiringI word posterior accuracies (see MPE derivative) identical to

v -component of edge posteriors using expectationsemiring [Heigold+ 2008b]

p(ω|Xr ) = pexpectation,v (ω|Xr )

I also use this identity to efficiently calculateI derivative of unified training criterionI covariance between two random additive variables (related to

MPE derivative)

Schluter: Introduction to Discriminative Training in Speech Recognition 58 January 14, 2010

Page 81: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 59 January 14, 2010

Page 82: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Definition of Models

assume feature vector x ∈ IRD and class c ∈ {1, . . . ,C}

Gaussian modelN (x |µc ,Σc) with

I means µc ∈ IRD

I positive-definite covariancematrices Σc ∈ IRD×D

induces posterior pθ(c|x)

p(c)N (x |µc ,Σc)∑c ′

p(c ′)N (x |µc ′ ,Σc ′)

I include priors p(c) ∈ IR+

Log-linear model with uncon-strained parameters

I λc0 ∈ IRI λc1 ∈ IRD

I λc2 ∈ IRD×D

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)

Schluter: Introduction to Discriminative Training in Speech Recognition 60 January 14, 2010

Page 83: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Transformation: Gaussian into Log-Linear Model

Comparison of terms quadratic, linear, and constant inobservations x leads to the transformation rules[Saul & Lee 2002, Gunawardana+ 2005]:

1. λc2 = −12 Σ−1

c

2. λc1 = Σ−1c µc

3. λc0 = −12

(µ>c Σ−1

c µc + log |2πΣc |)

+ log p(c)

Schluter: Introduction to Discriminative Training in Speech Recognition 61 January 14, 2010

Page 84: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian

Convex Optimization

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 62 January 14, 2010

Page 85: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Transformation: Log-Linear into Gaussian ModelI invert transformation from Gaussian to log-linear model

1. Σc = −12λ−1c2

2. µc = Σ−1c λc1

3. p(c) = exp(λc0 + 1

2

(µ>c Σ−1

c µc + log |2πΣc |))

I problem: parameter constraints not satisfied in generalI covariance matrices Σc must be positive-definiteI priors p(c) must be normalized

I solution: model parameters for posterior are ambiguouse.g. for ∆λ2 ∈ IRD×D ,∆λ0 ∈ IR

exp(x>(λc2 + ∆λ2)x + λ>c1x + (λc0 + ∆λ0)

)∑c ′

exp(x>(λc ′2 + ∆λ2)x + λ>c ′1x + (λc ′0 + ∆λ0)

)=

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)Schluter: Introduction to Discriminative Training in Speech Recognition 63 January 14, 2010

Page 86: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Transformation: Log-Linear into Gaussian Model

invert transformation rules for transformed log-linear model

1. Σc = −12 (λc2 + ∆λ2)−1

2. µc = Σ−1c λc1

3. p(c) = exp((λc0 + ∆λ0) + 1

2

(µ>c Σ−1

c µc + log |2πΣc |))

use additional degrees of freedom to impose parameter constraints

I choose ∆λ2 ∈ IRD×D such that λc2 + ∆λ2 arenegative-definite

I choose ∆λ0 such that p(c) is normalized, i.e.,

∆λ0 := − log∑

c

exp

(λc0 +

1

2

(µ>c Σ−1

c µc + log |2πΣc |))

Schluter: Introduction to Discriminative Training in Speech Recognition 64 January 14, 2010

Page 87: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 65 January 14, 2010

Page 88: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Motivation

Conventional approach:

I depends on initialization and choice of optimization algorithm

I spurious local optima (non-convex training criterion)

I many heuristics required

I i.e., involves much engineering work

“Fool-proof” approach:

I unique optimum (independent of initialization)

I accessibility of global optimum (convex training criterion)

I joint optimization of all model parameters, no parameters tobe tuned

Schluter: Introduction to Discriminative Training in Speech Recognition 66 January 14, 2010

Page 89: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 67 January 14, 2010

Page 90: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Assumptions

Assumptions to cast HCRF into CRF

I log-linear parameterization, e.g.p(x |s) = exp

(x>λs2x + λ>s1x + λs0

)and p(s|s ′) = exp(αs′s)

I MMI-like training criterion

I alignment represents spoken sequence

I alignment of spoken sequence known and kept fixed

I use single densities with augmented features instead of mixtures

I exact normalization constant

Schluter: Introduction to Discriminative Training in Speech Recognition 68 January 14, 2010

Page 91: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Lattice-Based MMI

Flattice(λ) =∑

r

∑sTr

1 ∈Nr

pλ(xTr1 , sTr

1 )

∑sTr

1 ∈Dr

pλ(xTr1 , sTr

1 )

I numerator word lattice Nr : state sequences sT1 representing

correct hypothesisI denominator word lattice Dr : correct and competing state

sequences, use word pair approximation and pruningI non-convex

word lattice:

fünf

fünf

fünfeins

eins

eins

sechs

[SIL]

zwei

neun

sechs[SIL]

acht

[SIL]acht

[SIL]

acht

acht

sechs

[SIL]

eins

drei

0 44152 277

[SIL]

114

115

112

81

78

74

75

Schluter: Introduction to Discriminative Training in Speech Recognition 69 January 14, 2010

Page 92: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Fool-Proof MMI

Ffool(λ) =∑

r

pλ(xTr1 , sTr

1 )∑sTr

1 ∈Sr

pλ(xTr1 , sTr

1 )

I consider only best state sequence sT1 in numerator, kept fixed

I sum over full state sequence network in denominatorI convex

0: ε

0: ε

ε ε:

ε ε:

0: ε

ε ε:

ε29:

0: ε

ε7:

0: ε

ε ε:

18:ε

ε29:

ε29:

ε ε:

0: ε

ε ε:

ε7:

ε ε:

ε7:

18:ε

ε29:

ε29:

51:ε

ε62:

ε84:

ε73:

ε29:

ε40:

18:ε

ε95:

ε40:

ε29:

ε106:

ε73:

ε62:

51:ε51:ε

ε62:

51:εε62:

ε73:

ε40:

ε84:

0: ε

18:ε

ε29:

ε29:

ε62:

51:ε

ε40:

ε29:

ε29:

ε40:

18:ε

ε7:ε7:

ε ε:

ε29:

ε29:

ε7:

0: ε

15

17

17

16

7:acht

18

18

18

18

18:acht

19

19

19

19

1918:acht

18:acht

20

20

20

20

20

7:acht

21

21

21

21

21

18:acht

7:acht

22

22

22

22

22

22

23

23

23

23

23

23

23

23

24

24

24

24

24

24

24

24

24

25

25

25

25

25

25

25

25

25

18:acht

18:acht

18:acht

18:acht

Part of (pruned) HMM state network

Schluter: Introduction to Discriminative Training in Speech Recognition 70 January 14, 2010

Page 93: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Frame-Based MMI

Fframe(λ) =∑

r

Tr∑t=1

pλ(xt , st)S∑

s=1

pλ(xt , s)

I frame discrimination, cf. hybrid approach

I assume alignment for numerator sT1 , kept fixed

I summation over all HMM states s ∈ {1, . . . ,S} indenominator

I convex

Schluter: Introduction to Discriminative Training in Speech Recognition 71 January 14, 2010

Page 94: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Refinements to MMI

These refinements do not break convexity:

I `2-regularization

I margin term

Schluter: Introduction to Discriminative Training in Speech Recognition 72 January 14, 2010

Page 95: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results

Incorporation of Margin Concept

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 73 January 14, 2010

Page 96: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Initialization

Analyze effect of initialparameters on training.

I vary initialization fordifferent trainingcriteria

I experiments: digitstrings (SieTill,German, telephone)

-4000

-3000

-2000

-1000

0

0 50 100 150 200 250 300

F

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

1.5

2

2.5

3

3.5

4

4.5

5

0 50 100 150 200 250 300

WE

R [%

]

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

Schluter: Introduction to Discriminative Training in Speech Recognition 74 January 14, 2010

Page 97: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Read Speech (WSJ)

I 5k-vocabulary, trigram language model

I phone-based HMMs, 1,500 CART-tied triphones

I audio data: 15h (training), 0.4h (test)

I log-linear model with kernel-like features f (x)I first (fd(x) = xd) and second (fdd′(x) = xd · xd′) order featuresI cluster features: assume GMM of marginal distribution,

p(x) =∑

l p(x , l)

fl(x) =

{p(l |x) if p(l |x) ≥ threshold

0 otherwise

I starting from scratch (model) and linear segmentation

I frame-based MMI, with re-alignments

I details: [Wiesler+ 2009]

Schluter: Introduction to Discriminative Training in Speech Recognition 75 January 14, 2010

Page 98: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Read Speech (WSJ)

Feature setup WER [%]

First order features, monophones 22.7+second order features 10.3+210 cluster features + temporal context of size 9 6.2+1,500 CART-tied HMM states (triphones) 3.9+realignment 3.6GHMM (ML) 3.6

(MMI) 3.0

Schluter: Introduction to Discriminative Training in Speech Recognition 76 January 14, 2010

Page 99: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 77 January 14, 2010

Page 100: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

MotivationGoal: incorporation of margin term into conventional training criteria

I replace likelihoods p(W )p(X |W ) with margin-likelihoodsp(W )p(X |W ) exp(−ρA(W ,Wr ))

I A(W ,Wr ): accuracy between hypothesis W and reference Wr

I interpretation (boosting):emphasize incorrect hypotheses by up-weighting

I interpretation (large margin): next slides

Margin

low complexity taskdifferent

loss functions

convergence local optima

different parameterization optimization

Margin in trainingis promising.

Individual contribution of margin in LVCSR training?

Schluter: Introduction to Discriminative Training in Speech Recognition 78 January 14, 2010

Page 101: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 79 January 14, 2010

Page 102: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Support Vector Machines (Hinge Loss)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I hinge loss function l (hinge)(Wr , dr ; ρ) :=maxW 6=Wr {max {−drW + ρ(A(Wr ,Wr )− A(W ,Wr )), 0}}

I `2-regularization with constant C > 0

I [Altun+ 2003, Taskar+ 2003]

Schluter: Introduction to Discriminative Training in Speech Recognition 80 January 14, 2010

Page 103: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 81 January 14, 2010

Page 104: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Smooth Approximation to SVM: Margin-MMI

Margin-based/modified MMI (M-MMI)

FM-MMI,γ(λ) =− C

2‖λ‖2

+R∑

r=1

1

γlog

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

)

Lemma: FM-MMI,γγ→∞→ SVMhinge (pointwise convergence).

I [Heigold+ 2008b]

Schluter: Introduction to Discriminative Training in Speech Recognition 82 January 14, 2010

Page 105: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Proof

∆A(W ,Wr ) := A(Wr ,Wr )− A(W ,Wr )

−1

γlog

exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W

exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

=

1

γlog

1 +∑

W 6=Wr

exp(γ(−drW + ρ∆A(W ,Wr )))

γ→∞→

maxW 6=Wr

{−drW + ρ∆A(W ,Wr )} if ∃W 6= Wr : drW < ρ∆A(W ,Wr )

0 otherwise

= maxW 6=Wr

{max{−drW + ρ∆A(W ,Wr ), 0}}

=: l (hinge)(Wr , dr ; ρ).

Schluter: Introduction to Discriminative Training in Speech Recognition 83 January 14, 2010

Page 106: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 84 January 14, 2010

Page 107: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Support Vector Machines (Margin Error)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I margin error loss functionl (error)(Wr , dr ; ρ) := E

(A(arg minW [drW + ρA(W ,Wr )],Wr )

)I `2-regularization with constant C > 0

I [Heigold+ 2008b]

Schluter: Introduction to Discriminative Training in Speech Recognition 85 January 14, 2010

Page 108: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 86 January 14, 2010

Page 109: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Smooth Approximation to SVM: Margin-MPE

Margin-based/modified MPE (M-MPE)

FM-MPE,γ(λ) = −C

2‖λ‖2

+R∑

r=1

∑W

E (W ,Wr )

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑

V exp(γ(λ>f (Xr ,V )− ρA(V ,Wr )))

)

Lemma: FM-MPE,γγ→∞→ SVMerror .

I [Heigold+ 2008b]

Schluter: Introduction to Discriminative Training in Speech Recognition 87 January 14, 2010

Page 110: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

OutlineIntroduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 88 January 14, 2010

Page 111: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Experimental Evaluation of Margin

Digit strings (SieTill, German, telephone)

dns/state feature orders # param, criterion WER [%]

1 first 11k ML 3.8MMI 2.9M-MMI 2.7

64 first 690k ML 1.8MMI 1.8M-MMI 1.6

1 first, second, 1,409k Frame 1.8and third MMI 1.7

M-MMI 1.5

Schluter: Introduction to Discriminative Training in Speech Recognition 89 January 14, 2010

Page 112: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Experimental Evaluation of Margin

European parliament plenary sessions in English (EPPS) andMandarin broadcasts

WER [%]EPPS En Mandarin BN/BC

Criterion 90h 230h 1500h

ML 12.0 21.9 17.9

MMI 20.8M-MMI 20.6

MPE 11.5 20.6 16.5M-MPE 11.3 20.3 16.3

Schluter: Introduction to Discriminative Training in Speech Recognition 90 January 14, 2010

Page 113: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Experimental Evaluation of Margin

Handwriting Recognition (IFN/ENIT)

I isolated town names, handwritten

I choose slice features to use 1D HMM

I details: see [Dreuw+ 2009]

WER [%]Criterion abc-d abd-c acd-b bcd-a abcd-e

ML 7.8 8.7 7.8 8.7 16.8

MMI 7.4 8.2 7.6 8.4 16.4M-MMI 6.1 6.8 6.1 7.0 15.4

Schluter: Introduction to Discriminative Training in Speech Recognition 91 January 14, 2010

Page 114: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

ConclusionsEffective Discriminative Training

Annex

Schluter: Introduction to Discriminative Training in Speech Recognition 92 January 14, 2010

Page 115: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Effective Discriminative TrainingI Discriminative Criteria

I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parameters

I Optimization MethodsI general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010

Page 116: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Effective Discriminative TrainingI Discriminative Criteria

I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)

I Log-Linear ModelingI convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010

Page 117: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Effective Discriminative TrainingI Discriminative Criteria

I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010

Page 118: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Effective Discriminative TrainingI Discriminative Criteria

I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010

Page 119: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Thanks for your attention!

Schluter: Introduction to Discriminative Training in Speech Recognition 94 January 14, 2010

Page 120: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks

Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010

Page 121: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

ReferencesY. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support vector machines,” in International

Conference on Machine Learning (ICML) 2003, Washington, DC, USA, 2003.

L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer. “Maximum Mutual Information Estimation of

Hidden Markov Model Parameters for Speech Recognition,”Proc. 1986 Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, pp. 49-52, Tokyo, Japan, May1986.

P. F. Brown. The Acoustic-Modeling Problem in Automatic Speech Recognition, Ph.D. thesis, Department

of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 1987.

W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD training of HMM based speech recognizer,” in

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1992, San Francisco,CA, USA, March 1992, pp. 473-476.

Y.-L. Chow: “Maximum Mutual Information Estimation of HMM Parameters for Continuous Speech

Recognition using the N-best Algorithm,” inProc. Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP), pp. 701–704, Albuquerque, NM, April 1990.

P. Dreuw, G. Heigold, and H. Ney, “Confidence-based discriminative training for model adaptation in offline

Arabic handwriting recognition,” in International Conference on Document Analysis and Recognition(ICDAR), Barcelona, Spain, July 2009.

J. Eisner, “Expectation semirings: Flexible EM for finite-state transducers,” in International Workshop on

Finite-State Methods and Natural Language Processing (FSMNLP), Helsinki, Finland, August 2001.

P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo. “An Inequality for Rational Functions with

Applications to Some Statistical Estimation Problems,” IEEE Transactions on Information Theory, Vol. 37,Nr. 1, pp. 107-113, January 1991.

Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010

Page 122: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

A. Gunawardana, M. Mahajan, A. Acero, and J. Platt, “Hidden conditional random fields for phone

classification,” in Interspeech, pp. 117 120, Lisbon, Portugal, Sept. 2005.

G. Heigold, W. Macherey, R. Schluter, and H. Ney: ”Minimum Exact Word Error Training,” in Proc. IEEE

Automatic Speech Recognition and Understanding Workshop (ASRU), pages 186–190, San Juan, PuertoRico, November 2005.

G. Heigold, T. Deselaers, R. Schluter, H. Ney: “GIS-like Estimation of Log-Linear Models with Hidden

Variables,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),pages 4045–4048, Las Vegas, NV, USA, April 2008.

G. Heigold, T. Deselaers, R. Schluter, H. Ney, “Modified MMI/MPE: A direct evaluation of the margin in

speech recognition,” in International Conference on Machine Learning (ICML), pp. 384-391, Helsinki,Finland, July 2008.

G. Heigold: A Log-Linear Modeling Framework for Speech Recognition, Doctoral Thesis to be submitted,

RWTH Aachen University, Aachen, Germany, 2010.

B. Hoffmeister, T. Klein, R. Schluter, and H. Ney: “Frame Based System Combination and a Comparison

with Weighted ROVER and CNC,” in Proc. Interspeech, pages 537–540, Pittsburgh, PA, USA, September2006.

B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions

on Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992.

D. Kanevsky: “Extended Baum Welch transformations for general functions,” in Proc. IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec,Canada, May 2004.

W. Macherey, L. Haferkamp, R. Schluter, H. Ney: “Investigations on Error Minimizing Training Criteria for

Discriminative Training in Automatic Speech Recognition,” in Proc. European Conference on SpeechCommunication and Technology (Interspeech), Lisbon, September 2005.

Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010

Page 123: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

W. Macherey, “Discriminative training and acoustic modeling for automatic speech recognition,” Ph.D.

thesis to be submitted, RWTH Aachen University, 2010.

M. Mahajan, A. Gunawardana, A. Acero: “Training algorithms for hidden conditional random fields,” in

Proc IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse,France, May 2006.

L. Mangu, E. Brill, A. Stolcke: “Finding Consensus Among Words: Lattice-Based Word Error

Minimization,” Proc. European Conference on Speech Communication and Technology (EUROSPEECH),pp. 495–498, Budapest, Hungary, Sept. 1999.

E. McDermott, S. Katagiri: “Minimum Classification Error for Large Scale Speech Recognition Tasks using

Weighted Finite State Transducers,” in Proc. IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP), Philadelphia, PA, USA, April 2005.

E. McDermott, T. Hazen, J.L. Roux, A. Nakamura, S. Katagiri: “Discriminative training for large

vocabulary speech recognition using Minimum Classification Error,” in Proc. IEEE Transactions on Audio,Speech and Language Processing (ICASSP), Vol. 15, No. 1, pp. 203–223, April 2007.

Y. Normandin, “Hidden Markov Models, Maximum Mutual Information, and the Speech Recognition

Problem,” Ph.D. thesis, McGill University, Montreal, Canada, 1991.

D. Povey and P. C. Woodland, “Minimum phone error and I- smoothing for improved discriminative

training,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2002,Orlando, FL, May 2002, vol. 1, pp. 105–108.

L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in

Proc. of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The Rprop

algorithm,” in IEEE International Conference on Neural Networks (ICNN) 1993, San Francisco, CA, USA,1993, pp. 586 591.

Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010

Page 124: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

L. Saul and D. Lee, “Multiplicative updates for classification by mixture models,” in T.G. Dietterich, S.

Becker, and Z. Ghahramani, editor, Advances in Neural Information Processing Systems (NIPS). MIT Press,2002.

R. Schluter, B. Muller, F. Wessel, and H. Ney: “Interdependence of Language Models and Discriminative

Training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. 1,pages 119–122, Keystone, CO, December 1999.

R. Schluter: Investigations on Discriminative Trainings Criteria, Doctoral Thesis, RWTH Aachen University,

Aachen, Germany, Sept. 2000.

R. Schluter, H. Ney: ”Model-based MCE Bound to the True Bayes’ Error,” IEEE Signal Processing Letters,

Vol. 8, No. 5, pages 131–133, May 2001.

B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” in Advances in Neural Information

Processing Systems (NIPS) 2003, 2003.

V. Valtchev: Discriminative Methods in HMM-based Speech Recognition, Ph.D. thesis, St. John’s College,

University of Cambridge, Cambridge, March 1995.

V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. “MMIE Training of Large Vocabulary Recognition

Systems,” Speech Communication, Vol. 22, No. 4, pp. 303-314, September 1997.

F. Wessel, K. Macherey, and R. Schluter, “Using word probabilities as confidence measures,” in IEEE

International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1998, Seattle, WA, USA,May 1998, pp. 225-228.

F. Wessel, R. Schluter, and H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior

Probabilities,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 33–36, Salt Lake City, Utah, May 2001.

Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010

Page 125: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

F. Wessel, R. Schluter, H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior

Probabilities,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 33–36,Salt Lake City, Utah, May 2001.

S. Wiesler, et al., “Investigations on features for log-linear acoustic models in continuous speech

recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy,Dec. 2009.

P. C. Woodland and D. Povey, “Large scale discriminative training for speech recognition,” in Automatic

Speech Recognition (ASR) 2000, Paris, France, September 2000, pp. 7–16.

P.C. Woodland, D. Povey: “Large Scale Discriminative Training of Hidden Markov Models for Speech

Recognition.” Computer Speech and Language, Vol. 16, No. 1, pp. 2548, 2002.

Schluter: Introduction to Discriminative Training in Speech Recognition 96 January 14, 2010

Page 126: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks

Schluter: Introduction to Discriminative Training in Speech Recognition 96 January 14, 2010

Page 127: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Speech Tasks: Corpus Statistics & Setups

Identifier Description Train/Test [h]

SieTill German digit strings 11/11 (Test)EPPS En English European Parliament 92/2.9 (Evl07)

plenary speechBNBC Cn 230h Mandarin broadcasts 230/2.2 (Evl06)BNBC Cn 1500h Mandarin broadcasts 1,500/2.2 (Evl06)

Identifier #States/#Dns Features

SieTill 430/27k 25 LDA(MFCC)EPPS En 4,500/830k 45 LDA(MFCC+voicing)

+VTLN+SAT/CMLLRBNBC Cn 230h 4,500/1,100k 45 LDA(MFCC)+3 tones

+VTLN+SAT/CMLLRBNBC Cn 1500h 4,500/1,200k 45 SAT/CMLLR(PLP+voicing

+3 tones+32 NN)+VTLN

Schluter: Introduction to Discriminative Training in Speech Recognition 97 January 14, 2010

Page 128: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Outline

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks

Schluter: Introduction to Discriminative Training in Speech Recognition 98 January 14, 2010

Page 129: Introduction to Discriminative Training in Speech …...2010/01/14  · Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010 Motivation I In practice,

Handwriting Recognition Tasks

IFN/ENIT:

I isolated Tunisian town names

I 4 training folds + 1 additional fold for testing

I simple appearance-based image slice features

I each fold comprises approximately 500,000 frames

#Observations [k]Corpus Towns Frames

IFN/ENIT a 6.5 452b 6.7 459c 6.5 452d 6.7 451e 6.0 404

Schluter: Introduction to Discriminative Training in Speech Recognition 99 January 14, 2010