on maximum mutual information speaker-adapted training

18
On maximum mutual information speaker-adapted training John McDonough a,b, * , Matthias Wo ¨ lfel c , Emilian Stoimenov c a Spoken Language Systems, Saarland University, D-66123 Saarbru ¨ cken, Germany b Intelligent Sensor-Actuator Systems, Universita ¨ t Karlsruhe, Kaiserstrasse 12, D-76128 Karlsruhe, Germany c Institut fu ¨ r Theoretische Informatik, Universita ¨ t Karlsruhe, Am Fasanengarten 5, D-76131 Karlsruhe, Germany Received 6 October 2006; received in revised form 10 June 2007; accepted 22 June 2007 Available online 3 August 2007 Abstract In this work, we combine maximum mutual information parameter estimation with speaker-adapted training (SAT). As will be shown, this can be achieved by performing unsupervised estimation of speaker adaptation parameters on the test data, a distinct advantage for many recognition tasks involving conversational speech. We derive re-estimation formulae for the basic speaker-independent means and variances, the optimal regression class for each Gaussian component when multiple speaker-dependent linear transforms are used for adaptation, as well as the optimal feature-space transformation matrix for use with semi-tied covariance matrices. We also propose an approximation to the maximum likelihood and maximum mutual information SAT re-estimation formulae that greatly reduces the amount of disk space required to conduct training on corpora which contain speech from hundreds or thousands of speakers. We also present empirical evidence of the importance of combination speaker adaptation with discriminative training. In particular, on a subset of the data used for the NIST RT05 evaluation, we show that including maximum likelihood linear regression transfor- mations in the MMI re-estimation formulae provides a WER of 35.2% compared with 39.1% obtained when speaker adaptation is ignored during discriminative training. Ó 2007 Published by Elsevier Ltd. Keywords: Speech recognition; Speaker-adapted training; Discriminative training 1. Introduction Since the publications of Digalakis et al. (1995) and Leggetter and Woodland (1995), linear transform techniques have become extremely popular for performing speaker adaptation on continuous density hidden Markov models (HMMs). A good summary of many of the refinements of linear regression-based speaker adaptation is given by Gales (1998). Other work proposes a set of linear transforms that are specified by very few free parameters, and hence able to be robustly estimated with little speaker-dependent enrollment data 0885-2308/$ - see front matter Ó 2007 Published by Elsevier Ltd. doi:10.1016/j.csl.2007.06.005 * Corresponding author. Address: Institut fu ¨ r Theoretische Informatik, Universita ¨t Karlsruhe, Am Fasanengarten 5, D-76131 Karlsruhe, Germany. E-mail addresses: [email protected] (J. McDonough), [email protected] (M. Wo ¨ lfel), emilian.stoimenov@g- mail.com (E. Stoimenov). Available online at www.sciencedirect.com Computer Speech and Language 22 (2008) 130–147 www.elsevier.com/locate/csl COMPUTER SPEECH AND LANGUAGE

Upload: john-mcdonough

Post on 26-Jun-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On maximum mutual information speaker-adapted training

Available online at www.sciencedirect.comCOMPUTER

Computer Speech and Language 22 (2008) 130–147

www.elsevier.com/locate/csl

SPEECH AND

LANGUAGE

On maximum mutual information speaker-adapted training

John McDonough a,b,*, Matthias Wolfel c, Emilian Stoimenov c

a Spoken Language Systems, Saarland University, D-66123 Saarbrucken, Germanyb Intelligent Sensor-Actuator Systems, Universitat Karlsruhe, Kaiserstrasse 12, D-76128 Karlsruhe, Germany

c Institut fur Theoretische Informatik, Universitat Karlsruhe, Am Fasanengarten 5, D-76131 Karlsruhe, Germany

Received 6 October 2006; received in revised form 10 June 2007; accepted 22 June 2007Available online 3 August 2007

Abstract

In this work, we combine maximum mutual information parameter estimation with speaker-adapted training (SAT). Aswill be shown, this can be achieved by performing unsupervised estimation of speaker adaptation parameters on the testdata, a distinct advantage for many recognition tasks involving conversational speech. We derive re-estimation formulaefor the basic speaker-independent means and variances, the optimal regression class for each Gaussian component whenmultiple speaker-dependent linear transforms are used for adaptation, as well as the optimal feature-space transformationmatrix for use with semi-tied covariance matrices. We also propose an approximation to the maximum likelihood andmaximum mutual information SAT re-estimation formulae that greatly reduces the amount of disk space required toconduct training on corpora which contain speech from hundreds or thousands of speakers. We also present empiricalevidence of the importance of combination speaker adaptation with discriminative training. In particular, on a subsetof the data used for the NIST RT05 evaluation, we show that including maximum likelihood linear regression transfor-mations in the MMI re-estimation formulae provides a WER of 35.2% compared with 39.1% obtained when speakeradaptation is ignored during discriminative training.� 2007 Published by Elsevier Ltd.

Keywords: Speech recognition; Speaker-adapted training; Discriminative training

1. Introduction

Since the publications of Digalakis et al. (1995) and Leggetter and Woodland (1995), linear transformtechniques have become extremely popular for performing speaker adaptation on continuous density hiddenMarkov models (HMMs). A good summary of many of the refinements of linear regression-based speakeradaptation is given by Gales (1998). Other work proposes a set of linear transforms that are specified by veryfew free parameters, and hence able to be robustly estimated with little speaker-dependent enrollment data

0885-2308/$ - see front matter � 2007 Published by Elsevier Ltd.doi:10.1016/j.csl.2007.06.005

* Corresponding author. Address: Institut fur Theoretische Informatik, Universitat Karlsruhe, Am Fasanengarten 5, D-76131Karlsruhe, Germany.

E-mail addresses: [email protected] (J. McDonough), [email protected] (M. Wolfel), [email protected] (E. Stoimenov).

Page 2: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 131

(McDonough and Waibel, 2004; McDonough et al., 2004). Regardless of the precise formulation of the lineartransform, these techniques, until very recently, have been based on maximum likelihood (ML) parameter esti-mation. This is also true of the technique speaker-adapted training (SAT) introduced by Anastasakos et al.(1996), in which transform parameters are estimated for all speakers in a training set, and then used duringthe re-estimation of the speaker-independent means and variances.

Gopalakrishnan et al. (1991) first proposed a practical technique for performing maximum mutual infor-mation (MMI) training of hidden Markov models, and commented on the fact that MMI- is superior toML-based parameter estimation given that the amount of available training data is always limited, and thatthe HMM is not the actual model of speech production. Gopalakrishnan’s development was subsequentlyextended by Normandin (1991) to the case of continuous density HMMs. While these initial works were oftheoretical interest, for several years it was believed that the marginal performance gains that could beobtained with MMI did not justify the increase in computational effort it entailed with respect to ML training.This changed when Woodland and Povey (2000) discovered that these gains could be greatly increased by scal-ing all acoustic log-likelihoods during training. Since the publication of Woodland and Povey (2000), MMItraining has enjoyed a spate of renewed interest and a concomitant flurry of publications, including Uebeland Woodland (2001) in which an MMI criterion is used for estimating linear regression parameters, andZheng et al. (2001) in which different update formulae are proposed for the standard MMI-based meanand covariance re-estimation. Schluter (2000) expounds a unifying framework for a variety of popular disci-minative training techniques. Gunawardana (2001) sets forth a much simplified derivation of Normandin’soriginal continuous density re-estimation formulae which does not require the discrete density approximationsNormandin used. While Gunawardana’s original proof contained a fallacy, a correct proof of Gunawardana’smain result was provided by Axelrod et al. (2004).

In the still more recent past, a great deal of research effort has been devoted to the development of opti-mization criteria and training paradigms for discriminative training of hidden Markov models. Povey andWoodland (2002) explored the notion of using minimum word or phone error as an optimization criterionduring parameter estimation, and found that this provided a significant reduction in WER with respect toMMI training. Povey et al. (2005) then extended this notion to the estimation of feature transformations.More recently, several authors have applied the concept of large margin estimation, on which the field ofsupport vector machines is based, to discriminative training of HMMs. Representative works in this directionare those of Liu et al. (2005) and Sha and Saul (2006). Another approach to discriminative training is basedon the concept of conditional random fields, which produce the conditional probability of an entire statesequence given a sequence of observations instead, have been explored by Gunawardana et al. (2005) andMahajan et al. (2006).

Although largely ignored in the literature, the use of speaker adaptation during discriminative training hasa large effect on final system performance, as shown in (McDonough et al., 2007). In this work, we make a fullexposition of the steps necessary to perform speaker-adapted training under a MMI criterion. As will beshown, this can be achieved by performing unsupervised parameter estimation on the test data, a distinctadvantage for many recognition tasks involving conversational speech. In (McDonough et al., 2002), we usedGunawardana’s theorem to derive re-estimation formulae for the speaker-independent (SI) means and vari-ances of a continuous density HMM when SAT is conducted on the latter under an MMI criterion. We alsoproposed an approximation to the basic SAT re-estimation formulae that greatly reduces the amount of diskspace required to conduct training. In this work, we provide the mathematical details omitted in previous pub-lications for reasons of brevity. We also derive re-estimation formulae with which the regression class of eachGaussian component can be optimally assigned.

Tsakalidis et al. (2002) presented a scheme for MMI-SAT in which all parameters, including the speaker-dependent (SD) adaptation parameters, were estimated with an MMI criterion. The same authors also pre-sented a technique for performing MMI estimation of a transformation matrix suitable for use with semi-tiedcovariance (STC) matrices proposed by Gales (1999), but commented on the tendency of their technique toproduce a transformation that was ‘‘effectively identity.’’

In this work, we also present re-estimation formulae for STC transformation matrices based on an MMIcriterion. As will be shown, these re-estimation formulae are different from those in (Tsakalidis et al., 2002),and the transformation matrices therewith obtained are distinctly different from the identity. Moreover, we

Page 3: On maximum mutual information speaker-adapted training

132 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

present a positive definiteness criterion, with which the regularization constant present in all MMI re-estima-tion formulae can be reliably set to provide both consistent improvements in the total mutual information ofthe training set, as well as fast convergence. We also combine the STC re-estimation formulae with their likefor the SI means and variances, and update all parameters during MMI-SAT. To demonstrate the effectivenessof the proposed techniques, we present the results a subset of the data used for the NIST RT05 evaluation.

This work is intended to provide a complete exposition of the re-estimation formulae required for speaker-adapted training under an MMI criterion. We also present sufficient empirical evidence to demonstrate theoverriding importance in terms of final system performance of incorporating speaker adaptation into discrim-inative training, so as to achieved a matched condition between training and test. The balance of this work isorganized as follows. In Section 2, we briefly describe the basic SAT mean and covariance re-estimation for-mula for ML-based training, and introduce a novel way to reduce the amount of hard disk space required toconduct this training. We then present re-estimation formulae for SAT re-estimation of means of a continu-ous-density HMM under an MMI criterion. The re-estimation formulae for HMM covariances, regressionclass assignment of each Gaussian, as well as the transformation matrices used with STC are quoted in thebalance of Section 2. Full derivations of these remaining estimation formulae are provided in Appendix A.In Section 3, we present the results of a set of experiments illustrating both the effective of the HMM trainingtechniques proposed here, as well as the importance of including speaker adaptation into discriminative train-ing. Finally, in Section 4, we summarize our efforts, and present plans for further work.

2. Maximum mutual information estimation

We begin this section with a summary of the ML-SAT formulae, then show how they can be modified tosupport MMI-SAT estimation.

2.1. ML-SAT estimation formulae

The re-estimation formula used in maximum likelihood speaker-adapted training (ML-SAT) have appearedpreviously in the literature (Anastasakos et al., 1996). We summarize them here only as an aid for our sub-sequent discussion. Assume we wish to estimate the kth mean lk and diagonal covariance matrix

Dk ¼ diagfr2k;0 r2

k;1 � � � r2k;N�1g

of a continuous density hidden Markov model. Let s be an index over all speakers in the training set, and let

xðsÞt denote the tth observation from speaker s. Also let cðsÞk;t denote the posterior probability that xðsÞt was drawnfrom the kth Gaussian in the HMM whose parameters we wish to re-estimate. Let us define the quantities

cðsÞk ¼X

t

cðsÞk;t ð2:1Þ

oðsÞk ¼X

t

cðsÞk;t xðsÞt ð2:2Þ

sðsÞk ¼X

t

cðsÞk;t xðsÞ2t ð2:3Þ

where xðsÞ2t denotes the vector obtained by squaring xðsÞt component-wise. The statistics (2.1)–(2.3) are typicallyaccumulated during forward–backward training (Deller et al., 1993, Chapter 12). The ML re-estimation for-mula for lk is then given by

lk ¼M�1k vk ð2:4Þ

where

Mk ¼X

s

cðsÞk AðsÞTD�1k AðsÞ ð2:5Þ

vk ¼X

s

AðsÞTD�1k oðsÞk � cðsÞk bðsÞ� �

ð2:6Þ

Page 4: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 133

and (A(s), b(s)) are the matrix and bias components for speaker s obtained with maximum likelihood linearregression (Leggetter and Woodland, 1995). Moreover, the nth diagonal covariance component of the kthGaussian can be re-estimated from

1 Fas

r2kn ¼

1

ck

Xs

sðsÞkn � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �ð2:7Þ

where lðsÞkn is the nth component of lðsÞk ¼ AðsÞlk þ bðsÞ and ck ¼P

scðsÞk .

A typical implementation of SAT requires writing out the quantities cðsÞk , oðsÞk , and sðsÞk , in addition to(A(s),b(s)), for every speaker in the training set. A moment’s thought will reveal, however, that Mk can be accu-mulated from only {(A(s),b(s))} and fcðsÞk g. Moreover, the sum vk can be accumulated for all speakers and writ-ten to disk just once per training iteration, or else once per training for each processor used in parallelforward–backward training, after which these partial sums can be added together. For a training corpus withseveral thousand speakers such as Broadcast News (BN), this results in a tremendous savings in the disk spacerequired for SAT, as first noted by Matsoukas et al. (1996). Although the sum in (2.7) requires that the newly-updated mean lk be used in calculating lðsÞkn , experience has proven that the re-estimation works just as well ifthe prior value of lk is used instead, in which case the partial sums in (2.7) can also be written out just once foreach parallel processor. This novel and useful approximation has been dubbed fast1 speaker-adapted training(FSAT).

2.2. MMI-SAT estimation formulae

The re-estimation formulae for maximum mutual information speaker-adapted training (MMI-SAT) arequite similar to their maximum likelihood counterparts. In this section, we provide the full derivation forre-estimating the HMM mean components under an MMI criterion. The formulae for re-estimation of thediagonal covariances, regression classes and semi-tied covariance transformation matrices can be obtainedthrough similar development. To avoid obscuring our main arguments under a mass of detail, we only quotethe re-estimation formulae for these latter components here; a full derivation is provided in Appendix A.

To begin our development, let x(s), g(s), and w(s), respectively, denote observation, gaussian component, andword sequences associated with an utterance of speaker s. Define the mutual information between the ensembleof words W and that of obervations X as

IðW ;X ; KÞ ¼X

s

logpðwðsÞ; xðsÞ; AðsÞ;KÞ

pðwðsÞÞpðxðsÞ; AðsÞ;KÞ

Let K0 denote the current set of parameter values, and

cðsÞ ¼ pðgðsÞjwðsÞ; xðsÞ; AðsÞ;K0Þ � pðgðsÞjxðsÞ; AðsÞ;K0Þ

the difference in posterior probabilities of g(s) that comes from knowledge of the correct word transcription.Also define the auxiliary function

QðKjK0Þ ¼ Sð1ÞðKjK0Þ þ Sð2ÞðKjK0Þ ð2:8Þ

where,

Sð1ÞðKjK0Þ ¼Xs;gðsÞ

cðsÞ log pðxðsÞjgðsÞ; AðsÞ;KÞ ð2:9Þ

Sð2ÞðKjK0Þ ¼Xs;gðsÞ

d 0ðgðsÞÞZ

RL

xðsÞpðxðsÞjgðsÞ; AðsÞ;K0Þ log pðxðsÞjgðsÞ; AðsÞ;KÞdxðsÞ ð2:10Þ

where LxðsÞ is the total length of x(s) and d 0(g(s)) is a normalization constant that is typically chosen heuristicallyto achieve a balance between speed of convergence and robustness. The integral in (2.10) also appears in the

t heisst ‘‘almost’’ auf Englisch.

Page 5: On maximum mutual information speaker-adapted training

134 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

definition of differential entropy (Gallager, 1968, §7.4), which is known to be finite for the Gaussian randomvectors considered in the sequel. Gunawardana (2001) sought to demonstrate

QðKjK0Þ > QðK0jK0Þ ) IðW ;X ; KÞ > IðW ;X ; K0Þ ð2:11Þ

While a fallacy in Gunawardana’s proof has been revealed, Axelrod et al. (2004) have devised a correct proofof (2.11) that is valid for a very general class of acoustic models.

To simplify what follows, let us decompose (2.8) as a sum over individual Gaussian components. From theconditional independence assumption inherent in the hidden Markov model, we know

log P ðxðsÞjgðsÞ; AðsÞ;KÞ ¼X

t

log p xðsÞt jgðsÞt ; AðsÞ;K� �

¼X

t

log p xðsÞt ; AðsÞ;Kk

� ���k¼gðsÞt

where Kk are the parameters of the kth Gaussian component. Hence,

Sð1ÞðKjK0Þ ¼X

t;s;gðsÞ

cðsÞ log p xðsÞt ; AðsÞ;Kk

� ���k¼gðsÞt

¼X

k

Xt;s;gðsÞ:gðsÞt ¼k

cðsÞ log p xðsÞt ; AðsÞ;Kk

� �

¼X

k

Xt;s

cðsÞk;t log p xðsÞt ; AðsÞ;Kk

� �ð2:12Þ

where it is necessary to redefine cðsÞk;t as

cðsÞk;t ¼ p gðsÞt ¼ kjwðsÞ; xðsÞ; AðsÞ;K0� �

� p gðsÞt ¼ kjxðsÞ; AðsÞ;K0� �

ð2:13Þ

Also invoking conditional independence, we have

pðxðsÞjgðsÞ; AðsÞ;K0Þ log pðxðsÞjgðsÞ; AðsÞ;KÞ ¼Y

t

p xðsÞt jgðsÞt ; AðsÞ;K0k

� ���k¼gðsÞt

�X

t

log p xðsÞt jgðsÞt ; AðsÞ;Kk

� ���k¼gðsÞt

which implies

Sð2ÞðKjK0Þ ¼X

k

Xt;s;gðsÞ:gðsÞt ¼k

d 0ðgðsÞÞZ

RLx

p x; AðsÞ;K0k

� �log pðx; AðsÞ;KkÞdx

¼X

k

Xs

dðsÞk

ZRLx

p x; AðsÞ;K0k

� �log pðx; AðsÞ;KkÞdx ð2:14Þ

where

dðsÞk ¼X

t;gðsÞ:gðsÞt ¼k

d 0ðgðsÞÞ ð2:15Þ

Finally, we have shown,

QðKjK0Þ ¼X

k

QkðKkjK0kÞ ¼

Xk

Sð1Þk ðKjK0Þ þ Sð2Þk ðKjK0Þh i

ð2:16Þ

where

Sð1Þk ðKjK0Þ ¼X

t;s

cðsÞk;t log p xðsÞt ; AðsÞ;Kk

� �ð2:17Þ

Sð2Þk ðKjK0Þ ¼X

s

dðsÞk

ZRLx

p x; AðsÞ;K0k

� �log pðx; AðsÞ;KkÞdx ð2:18Þ

Now let us set Kk = {lk,Dk} and observe that the acoustic log-likelihood for the kth Gaussian can beexpressed as

log pðx; AðsÞ;KkÞ ¼ �1

2log j2pDkj þ ðx� AðsÞlk � bðsÞÞTD�1

k ðx� AðsÞlk � bðsÞÞh i

from which it follows

Page 6: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 135

rlklog pðx; AðsÞ;KkÞ ¼ AðsÞTD�1

k x� AðsÞTD�1k ðAðsÞlk þ bðsÞÞ ð2:19Þ

rr2k;n

log pðx; AðsÞ;KkÞ ¼1

2

xn � lðsÞkn

� �2

r2k;n

� �2� 1

r2k;n

264

375 ð2:20Þ

2.2.1. Mean estimation

From (2.16)–(2.18) it follows,

rlkQk KkjK0

k

� �¼X

t;s

cðsÞk;trlklog p xðsÞt ; AðsÞ;Kk

� �þX

s

dðsÞk

ZRLx

p x; AðsÞ;K0k

� �rlk

log pðx; AðsÞ;KkÞdx

Substituting (2.19) into the last equation gives

rlkQk KkjK0

k

� �¼X

t;s

cðsÞk;t

�AðsÞTD�1

k xðsÞt � AðsÞTD�1k ðAðsÞlk þ bðsÞÞ

þX

s

dðsÞk AðsÞTD�1k

ZRLx

p x; AðsÞ;K0k

� �xdx� AðsÞTD�1

k ðAðsÞlk þ bðsÞÞ� �

¼X

s

�AðsÞTD�1

k oðsÞk � cðsÞk AðsÞTD�1k ðAðsÞlk þ bðsÞÞ

þX

s

dðsÞk

�AðsÞTD�1

k ðAðsÞl0k þ bðsÞÞ � AðsÞTD�1

k ðAðsÞlk þ bðsÞÞ�

where l0k is the current value of the kth mean. Grouping terms and equating the result to zero, we find that the

new value of lk can be calculated from (2.4) as before, provided that we define

Mk ¼X

s

cðsÞk þ dðsÞk

� �AðsÞTD�1

k AðsÞ ð2:21Þ

vk ¼X

s

AðsÞTD�1k oðsÞk � cðsÞk bðsÞ� �

þ dðsÞk AðsÞl0k

h ið2:22Þ

In the above cðsÞk and oðsÞk are still defined as in (2.2), but now the definition of cðsÞk;t in (2.13) must be used. Itremains only to choose a value for dðsÞk ; good results have been obtained with

dðsÞk ¼ EX

t

p gðsÞt ¼ kjxðsÞ; AðsÞ;K0� �

ð2:23Þ

for E = 1.0 or 2.0 as recommended in (Woodland and Povey, 2000). Setting E P 1.0 also ensures that Mk ispositive definite, which is necessary if lk ¼M�1

k vk is to be an optimal solution.

2.2.2. Diagonal covariance estimation

Using development similar to that in the prior section, it can be shown that the diagonal covariance matri-ces can be estimated under an MMI criterion according to the formula:

r2kn ¼

Ps sðsÞk � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �þ dðsÞk r02

kn þ l0ðsÞkn � lðsÞkn

� �2� �

Ps cðsÞk þ dðsÞk

� � ð2:24Þ

where l0ðsÞkn is the nth component of l0

k ¼ AðsÞl0k þ bðsÞ and r02

kn is the current value of the variance. In order toprevent r2

kn from becoming too small or even negative, a floor value is used. As before, sðsÞk is defined in (2.3)with cðsÞk;t as given in (2.13). In accumulating the term

X

s

sðsÞk � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �

Page 7: On maximum mutual information speaker-adapted training

136 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

the same FSAT approximation can be made as before in order to economize on disk space; that is, lðsÞk can becalculated with l0

k instead of lk.

2.2.3. Regression class estimationTo obtain the largest possible reduction in word error rate, quite often we estimate not a single global trans-

formation (A(s),b(s)) for each speaker, but a set of transformations ðAðsÞr ; bðsÞr Þ

� �. In this case, it is of interest to

reassign the kth Gaussian to a regression class r so as to maximize the mutual information. As shown inAppendix A.2, this can be achieved by choosing that regression class r that maximizes the auxiliary function

Qkr KkrjK0kr

� �¼ � 1

2akr þ lT

k vkr �1

2lT

k Mkrlk

where

akr ¼X

s

cðsÞk bðsÞr � oðsÞk

� �T

D�1k cðsÞk bðsÞr � oðsÞk

� �.cðsÞk þ dðsÞk bðsÞr D�1

k bðsÞr � 2l0k

� �� �

vkr ¼X

s

AðsÞTr D�1k oðsÞk � cðsÞk þ dðsÞk

� �bðsÞr þ dðsÞk l0

k

h i

Mkr ¼X

s

cðsÞk þ dðsÞk

� �AðsÞTr D�1

k AðsÞr

2.2.4. Semi-tied covariance estimation

Gales (1999) defines a semi-tied covariance matrix as

Rk ¼ PDkP T

where Dk is, as before, the diagonal covariance matrix for the kth Gaussian component, and P is a transfor-mation matrix shared by many Gaussian components. Both Dk and P can be updated using a ML criterion.Assuming P is non-singular, however, it is possible to define

M ¼ P�1

The transformation M can then be applied to each feature x, which is more computationally efficient thantransforming the covariances. Let mT

i denote the ith row of M, and let f Ti denote the ith row in the cofactor

matrix of M. Moreover, define

W j ¼ W ð1Þj þ W ð2Þ

j ¼X

k

1

r2k;j

Xt;s

cðsÞk;t oðsÞk;t o

ðsÞTk;t þ dk

Xi

r02k;ip

0i p0T

i

" #ð2:25Þ

where p0i is the ith column of P. A procedure for estimating an optimal transformation M can be developed

directly along the lines proposed by Gales (1999). As shown in Appendix A.3, such a procedure proceeds byiteratively updating each row mT

j according to

mTj ¼ f T

j W �1j

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifficþ d

f Tj W �1

j fj

s

As explained in Appendix A.3, in order to assure convergence dk in (2.25) must be chosen to ensure that Wj ispositive definite.

2.3. Conventional MMI estimation formulae

As shown in the seminal work of Gopalakrishnan et al. (1991), the re-estimation formulae for the Gaussianmixture weights is given by

qk ¼q0

kffk þ CgPk0q

0k0 ffk0 þ Cg ð2:26Þ

Page 8: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 137

where q0k is the current value of the mixture weight,

fk ¼ck

q0k

ð2:27Þ

and ck is obtained by summing (2.1) over all training speakers. The constant C is chosen to ensure all mixtureweights are positive.

In the absence of MLLR, the re-estimation formulae for a mean lk and diagonal variance r2k for continuous

density HMMs can be written as

lk ¼ok þ Dl0

k

ck þ Dð2:28Þ

r2k ¼

sk þ D r02k þ l2

k

� �ck þ D

� l02k ð2:29Þ

where l0k and r02

k are the mean and diagonal variance, respectively, from the prior iteration, and ck, ok, and sk

are obtained by summing the relevant quantities in (2.1) and (2.2) over all speakers in the training set. More-over, l02

k denotes the vector obtained by squaring l0k component-wise. The normalization constant D can be

chosen as in Woodland and Povey (2000). It is interesting to note that the original derivation of (2.28) and(2.29) in (Normandin, 1991) was based on an approximation, but a simpler and more satisfying derivationis provided in (Gunawardana and Byrne, 2000).

While (2.28) and (2.29) were originally derived for estimating the parameters of an unadapted HMM, theycan lend themselves also for use with CMLLR (Gales, 1998) during training.

3. Speech recognition experiments

The speech experiments described below were conducted with the Enigma weighted finite-state transducer(WFST) library, and the Millenium automatic speech recognition system. Feature extraction was performedwith the Speech Frontend. All packages are implemented as extension modules of the Python scriptinglanguage.

The front-end used in these experiments was based on a spectral envelope obtained from the warped min-imum variance distortionless response (MVDR) (Wolfel and McDonough, 2005) with model order 30. Due tothe properties of the warped MVDR, neither the Mel-filterbank nor any other filterbank was used. The advan-tages of the MVDR approach are an increase in resolution in low frequency regions relative to conventionalMel-filterbanks, and the increased accuracy in the modeling of spectral peaks relative to spectral valleys, whichleads to improved robustness in the presence of noise. Front-end analysis involved extracting 20 cepstral coef-ficients per frame of speech and performing global cepstral mean subtraction (CMS) with variance normali-zation. The final features were obtained by concatenating 15 consecutive frames of cepstral features together,then performing a linear discriminant analysis (LDA) to obtain a feature of length 42. The LDA transforma-tion was followed by a second global CMS, then a global STC transform (Gales, 1999).

The training data used for the experiments reported here was taken from the ICSI, NIST, and CMUmeeting corpora, as well as the Transenglish Database (TED) corpus, for a total of 100 h of training material.

3.1. Discriminative training paradigm

MMI training was conducted by first writing a word lattice for each utterance in the training set using abigram language model (LM). The LM was converted to a weighted finite-state acceptor with a Perl script.Thereafter, the LM was composed with a lexicon transducer to obtain a transducer that translates phonesequences to words. This transducer was in turn composed with an HC transducer (Stoimenov and McDon-ough, 2006) to obtain a transducer that translates sequences of Gaussian mixture models (GMMs) tosequences of words. This final transducer was then determinized, its weights pushed forward and minimized.The word trace decoder used for lattice writing was developed along the lines suggested by Saon et al. (2005).Proceeding as in (Ljolje et al., 1999), after the raw lattice was obtained from the decoder, it was projected ontothe output side to discard all arc information except for the word identities. The correct transcription, as

Page 9: On maximum mutual information speaker-adapted training

138 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

required for discriminative training, was then inserted into the lattice, and this modified lattice was then com-pacted through epsilon removal, determinization and minimization (Mohri and Riley, 1998). Lattice genera-tion was relatively efficient, with all steps completed in less than 5· real time on a 3 GHz Intel x86 workstation.

A recognition network was generated from the corresponding word lattice for each training utterance,through composition with a unigram LM, then with a lexicon, and finally with HC. The final utterance rec-ognition networks were optimized through determinization, weight pushing and minimization. To accumulatethe statistics required for MMI training, the constrained network was used to recognize each utterance with astate trace decoder, which maintains the time alignment of each feature to a state in the HMM. Unlike theprocedures described in (Woodland and Povey, 2000), this procedure is optimal in that neither the phonenor word boundaries are fixed on the basis of the original lattices. Rather these boundaries are determinedduring the Viterbi alignment with the current model. Based on the Viterbi alignment, a new lattice was gen-erated with the time boundaries and log-likelihood scores for each state in the HMM. This information wasused to derive the posterior probabilities of the edges in the lattice as described by Valtchev et al. (1997).Because the utterance recognition networks had been highly optimized through WFST operations, the finalconstrained decoding and accumulation of training statistics for each MMI-SAT iteration ran in approxi-mately real time.

3.2. Experiments with VTLN and CMLLR

The test set used for the experiments described below was the lecture meeting portion of the NIST RT-05sevaluation set. It consisted of 22,258 words, a total of 3.5 h of data.

The speaker adaptation parameters estimated during ML training were used unaltered during MMI train-ing. Similarly, the adaptation parameters estimated with the corresponding ML-trained model were used unal-tered when testing the MMI-trained model.

The first set of experiments were conducted using VTLN and CMLLR without MLLR in training. On thetesting sides, VTLN, CMLLR and MLLR were used for all cases. Mixture weight, mean, and covariance re-estimation were based on the conventional MMI formulae (2.26)–(2.29). The results of these experiments aretabulated in Table 1. For the sake of comparison, we show the results for two models trained with the MLcriterion, the first with VTLN but without CMLLR during training which achieved 39.7% WER, the secondwith both VTLN and CMLLR during training which achieved 37.8% WER. The row labeled ‘‘UnadaptedMMI’’ shows the results for the model training with only VTLN using the re-estimation formulae (2.28)and (2.29). The best result of 39.1% with this model came after the third iteration, which represented a slightimprovement over the comparable result of 39.7% obtained with only ML training. The row labeled MMI-CMLLR shows the results obtained with the model trained with both VTLN and CMLLR but no MLLR,once more using the re-estimation formulae (2.28) and (2.29). This scenario provided a significant gain overthe VTLN only scenario. Curiously, however, the best result of 38.0% here, obtained after the second itera-tion, was not better than the 37.8% result obtained with ML training using VTLN and CMLLR.

3.3. Experiments with VTLN, CMLLR and MLLR

For the final set of experiments, VTLN, CMLLR and MLLR were always used during training and testing.Mean estimation during MMI training was performed according to (2.4), (2.21) and (2.22), covariance estima-

Table 1ASR results using VTLN and CMLLR during training

Training scenario Iteration

1 2 3

Unadapted ML 39.7ML-CMLLR 37.8Unadapted MMI 40.0 39.3 39.1MMI-CMLLR 39.2 38.0 38.2

Page 10: On maximum mutual information speaker-adapted training

Table 2ASR results using VTLN, CMLLR and MLLR in training

Training scenario Iteration

1 2 3

ML-SAT 36.7MMI-SAT 35.6 35.2 35.3MMI-STC 35.4 34.7 35.1MMI-ORC 35.4

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 139

tion was based on (2.24), and mixture weight estimation was according to (2.26) and (2.27). In all experimentsreported here, a maximum of eight MLLR transforms per speaker were used. The regression classes for thesetransforms were obtained from a divisive K-means clustering procedure that yielded a tree structure. If a leafreceived less than 1500 frames of enrollment data during adaptation for a particular speaker, then parameterestimation was based on the first ancestor of the given leaf containing sufficient data. The results of theseexperiments are tabulated in Table 2. The best result of 35.2% was obtained by applying VTLN, CMLLRand MLLR during training, which represents the matched condition between training and test, as shown inthe row labeled MMI-SAT. The result obtained with the MMI-SAT paradigm was better than the ML-SAT result of 36.7% as well as all of the results shown in Table 1 for which no MLLR was used duringtraining.

A further reduction in WER from 35.2% to 34.7% WER was obtained by re-estimating the global STCtransformation matrix during adapted training using a MMI criterion; these results are shown in the rowlabeled MMI-STC. The STC transformation was estimated during an intermediate pass through the trainingbetween the passes devoted to the estimation of the means, covariances and mixture weights. The STC esti-mation formulae appeared in (McDonough and Waibel, 2003), but were successfully applied for the first timein (McDonough et al., 2007).

The final row of Table 2 tabulates the results obtained when the regression classes used for MLLR adap-tation were re-estimated with an MMI criterion. Unfortunately, this reassignment of the regression classesyielded no improvement over the MMI-SAT case, wherein all regression classes were held fixed.

4. Conclusions

In the recent past, a great deal of research effort has been devoted to the development of optimization cri-teria and training paradigms for the discriminative training of hidden Markov models (HMMs). While manyof the optimization criteria and training algorithms have proven better than the maximum mutual information(MMI) criterion investigated by Normandin (1991), most authors have ignored speaker adaptation during dis-criminative HMM training. In this work, we have demonstrated the importance of speaker adapation duringdiscriminative training. In particular, we have shown that on a subset of the data used for the NIST RT05evaluation, including MLLR transformations in the MMI re-estimation formulae provides a WER of35.2% compared with 39.1% obtained when speaker adaptation is ignored during discriminative training.When the global STC transformation we re-estimated during discriminative speaker-adapted training, a fur-ther reduction in WER from 35.2% to 34.7% was achieved.

The sine-log all-pass transform (SLAPT) (McDonough, 2000) has proven to provide a reduction in WERsuperior to MLLR in all known cases (McDonough and Waibel, 2004; McDonough et al., 2004). In (McDon-ough and Waibel, 2004), SLAPT was effectively combined with a STC transformation. To date, however,SLAPT has never been used in the MMI-SAT training paradigm. Future work will investigate such a trainingparadigm.

Acknowledgements

This work was supported by the European Union (EU) under the integrated project CHIL, Computers in

the Human Interaction Loop, Contract Number 506909, as well as the German Ministry of Research and

Page 11: On maximum mutual information speaker-adapted training

140 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

Technology (BMBF) under the SmartWeb project, Grant Number 01IMD01A. The authors gratefully thankthe EU and the Republic of Germany for their financial support.

Appendix A. Maximum mutual information estimation formulae

In this appendix, we derive MMI re-estimation formulae for diagonal covariance matrices, optimal regres-sion classes (McDonough, 2000, §4) and STC transform matrices (Gales, 1999).

A.1. Diagonal covariance estimation

In order to solve for the optimal covariances, let us begin again from (2.16)–(2.18) and note

rr2k;n

Qk KkjK0k

� �¼X

t;s

cðsÞk;trr2k;n

log p xðsÞt ; AðsÞ;Kk

� �þX

s

dðsÞk

ZRLx

p x; AðsÞ;K0k

� �rr2

k;nlog pðx; AðsÞ;KkÞdx

and substituting (2.20) into the latter provides

rr2k;n

Qk KkjK0k

� �¼ 1

2

Xt;s

cðsÞk;t

xðsÞt;n � lðsÞkn

� �2

r2k;n

� �2� 1

r2k;n

264

375þ 1

2

Xs

dðsÞk

ZRLx

p x; AðsÞ;K0k

� � xn � lðsÞkn

� �2

r2k;n

� �2� 1

r2k;n

264

375dx

¼ 1

2

Xs

sðsÞk � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �r2

k;n

� �2� cðsÞk

r2k;n

264

375þ 1

2

Xs

dðsÞk

r02kn þ l0ðsÞ

kn � lðsÞkn

� �2

r2k;n

� �2� 1

r2k;n

264

375

Equating the last expression to zero and grouping terms we learn

r2kn ¼

Ps sðsÞk � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �þ dðsÞk r02

kn þ l0ðsÞkn � lðsÞkn

� �2� �

Ps cðsÞk þ dðsÞk

� � ðA:1Þ

where l0ðsÞkn is the nth component of l0

k ¼ AðsÞl0k þ bðsÞ and r02

kn is the current value of the variance. In accumu-lating the term

X

s

sðsÞk � 2oðsÞk lðsÞkn þ cðsÞk lðsÞ2kn

� �

the same FSAT approximation can be made as before in order to economize on disk space; that is, lðsÞk can becalculated with l0

k instead of lk.

A.2. Optimal regression classes

In order to obtain the largest possible reduction in word error rate, quite often we estimate not a singleglobal transformation (A(s),b(s)) for each speaker, but a set of transformations AðsÞr ; b

ðsÞr

� �� �. In this case, it

is of interest to reassign the kth Gaussian to a regression class r so as to maximize the mutual information.This can be done with a slight extension of the prior analysis, to wit: it is necessary to calculate the actual valueof the auxiliary function in addition to the optimal parameters Kk achieving its maximum. Beginning oncemore with (2.17), note that

Sð1Þkr ¼ �1

2

Xt;s

cðsÞk;t log j2pDkj þ xðsÞt � AðsÞr lk � bðsÞr

� �TD�1

k xðsÞt � AðsÞr lk � bðsÞr

� �h i

¼ � 1

2

Xt;s

cðsÞk;t log j2pDkj þ xðsÞTt D�1k xðsÞt þ bðsÞTr D�1

k bðsÞr þ lTk AðsÞTr D�1

k AðsÞr lk � 2bðsÞTr D�1k xðsÞt

��þ2lT

k AðsÞTr D�1k bðsÞr � 2lT

k AðsÞTr D�1k xðsÞt

Page 12: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 141

Upon ignoring the term which does not depend on AðsÞr ; bðsÞr

� �, this becomes

Sð1Þkr ¼ �1

2

Xs

cðsÞk bðsÞTr D�1k bðsÞr � 2bðsÞTr D�1

k oðsÞk � 2lTk AðsÞTr D�1

k oðsÞk � cðsÞk bðsÞr

� �þ cðsÞk lT

k AðsÞTr D�1k AðsÞr lk

h i

Note, however, that

Xs

cðsÞk bðsÞTr D�1k bðsÞr � 2bðsÞTr D�1

k oðsÞk

h i¼X

s

cðsÞk bðsÞr � oðsÞk =cðsÞk

� �T

D�1k bðsÞr � oðsÞk =cðsÞk

� �� oðsÞTk D�1

k oðsÞk =cðsÞ2k

� �

Once more ignoring that term which is independent of AðsÞr ; bðsÞr

� �,

Sð1Þkr ¼�1

2

Xs

cðsÞk bðsÞr � oðsÞk

� �T

D�1k cðsÞk bðsÞr � oðsÞk

� �.cðsÞk � 2lT

k AðsÞTr D�1k oðsÞk � cðsÞk bðsÞr

� ��

þcðsÞk lTk AðsÞTr D�1

k AðsÞr lk

iðA:2Þ

Considering now (2.18), we have

Sð2Þkr ¼X

s

dðsÞk

ZRLx

p x; AðsÞ0 ; bðsÞ0 ;K

0k

� �log p x; AðsÞr ; b

ðsÞr ;Kk

� �dx

¼ � 1

2

Xs

dðsÞk

ZRLx

p x; AðsÞ0 ; bðsÞ0 ;K

0k

� �log j2pDkj þ x� AðsÞr lk � bðsÞr

� �TD�1

k x� AðsÞr lk � bðsÞr

� �h idx

¼ � 1

2

Xs

dðsÞk

ZRLx

p x; AðsÞ0 ; bðsÞ0 ;K

0k

� �log j2pDkj þ xTD�1

k xþ lTk AðsÞTr D�1

k AðsÞr lk þ bðsÞTr D�1k bðsÞr

���2bðsÞTr D�1

k xþ 2lTk AðsÞTr D�1

k bðsÞr � 2lTk AðsÞTr D�1

k x��

dx ðA:3Þ

where AðsÞ0 ; bðsÞ0

� �are the current transformation paramters for the kth Gaussian. The integral of the term

xTD�1k x is independent of AðsÞr ; b

ðsÞr

� �and hence can be ignored. From the other terms in (A.3) we can write

Sð2Þkr ¼ �1

2

Xs

dðsÞk lTk AðsÞTr D�1

k AðsÞr lk þ bðsÞTr D�1k bðsÞr � 2bðsÞTr D�1

k AðsÞ0 l0k þ bðsÞ0

� �h

þ2lTk AðsÞr D�1

k bðsÞr � 2lTk AðsÞTr D�1

k AðsÞ0 l0k þ bðsÞ0

� �i¼ � 1

2

Xs

dðsÞk lTk AðsÞTr D�1

k AðsÞr lk þ bðsÞTr D�1k bðsÞr � 2l0

k

� �� 2lT

k AðsÞTr D�1k l0

k � bðsÞr

� �� �ðA:4Þ

where

l0k ¼ AðsÞ0 l0

k þ bðsÞ0

Finally, combining (A.2) and (A.4), we find

Qkr KkrjK0kr

� �¼ Sð1Þkr þ Sð2Þkr ¼ �

1

2akr þ lT

k vkr �1

2lT

k Mkrlk ðA:5Þ

where

akr ¼X

s

cðsÞk bðsÞr � oðsÞk

� �T

D�1k cðsÞk bðsÞr � oðsÞk

� �.cðsÞk þ dðsÞk bðsÞr D�1

k bðsÞr � 2l0k

� �� �

vkr ¼X

s

AðsÞTr D�1k oðsÞk � cðsÞk þ dðsÞk

� �bðsÞr þ dðsÞk l0

k

h i

Mkr ¼X

s

cðsÞk þ dðsÞk

� �AðsÞTr D�1

k AðsÞr

With these definitions, it is possible to choose the optimal regression class r* as that class which maximizes(A.5); see (McDonough, 2000, §4.5).

Page 13: On maximum mutual information speaker-adapted training

142 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

A.3. Semi-tied covariance estimation

Gales (1999) defines a semi-tied covariance matrix as

Rk ¼ PDkP T

where Dk is, as before, the diagonal covariance matrix for the kth Gaussian component, and P is a transfor-mation matrix shared by many Gaussian components. Let Nðl;CÞ denote a normal distribution with mean land covariance matrix C. The likelihood of a feature xvNðl; PDkP TÞ can then be calculated as

log pðx; flk;Dkg; P Þ ¼ log pðx; flk; PDkP TgÞ ¼ � 1

2½log j2pPDkP Tj þ ðx� lkÞ

TðPDkP TÞ�1ðx� lkÞ� ðA:6Þ

Both Dk and P can be updated using a ML criterion. Assuming P is non-singular, however, it is possible todefine

M ¼ P�1

If the transformation M is applied to the feature x, the log-likelihood of the latter can be expressed as (Ander-son, 1984, §2.2)

log pðx; flk;Dkg;MÞ ¼ log jM j þ log pðMx; fMlk;DkgÞ

¼ log jM j � 1

2log j2pDkj þ ðx� lkÞ

TMTD�1k Mðx� lkÞ

h iðA:7Þ

The following development is based on the fact that (A.6) and (A.7) return equivalent likelihoods, which,resorting to an abuse of notation, we will express as

log pðx; Kk; P Þ ¼ log pðx; Kk;MÞ ðA:8Þ

Given this equivalence, the calculation in (A.7) is to be preferred, as it entails less computation duringdecoding.

Gales (1999) showed that the optimal M can be estimated by maximizing the auxiliary function

Sð1Þ ¼X

k

ck log jM j � 1

2ck log j2pDkj �

1

2

Xt;s

cðsÞk;t oðsÞTk;t MTD�1k MoðsÞk;t

h i( )ðA:9Þ

where

oðsÞk;t ¼ xðsÞt � lk

ck ¼X

t;s

cðsÞk;t

and cðsÞk;t is the posterior probability that observation xðsÞt was drawn from Gaussian component k. If speakeradaptation is performed in the model space, Gales’ development can be readily extend by setting

oðsÞk;t ¼ xðsÞt � ðAðsÞlk þ bðsÞÞ ðA:10Þ

After the auxiliary function (A.9) is calculated, the rows mTi of M can be updated using a recursive procedure

similar to the one described below.In order to estimate M using an MMI criterion, we need only modify (A.9) slightly. Firstly, we redefine cðsÞk;t

as the difference in posterior probabilities (2.13). Now assume that we wish only to optimize M; the otherparameters {l,K} can be updated in a separate step. Hence the Gaussian normalization factor- 1

2log j2pDkj

can be excluded as it does not depend on M; what remains is

Sð1Þ ¼X

k

ck log jM j � 1

2

Xt;s

cðsÞk;t oðsÞTk;t MTD�1k MoðsÞk;t

� �( )

Let mTi denote the ith row of M, and let f T

i denote the ith row in the cofactor matrix of M. Hence S(1) can berewritten as

Page 14: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 143

Sð1Þ ¼ 1

2

Xk

ck log mTi fi

� �2 �X

t;s

cðsÞk;t

Xj

mTj oðsÞk;t

� �2

r2k;i

8><>:

9>=>;

¼ 1

2

Xk

ck log mTi fi

� �2 �X

t;s

cðsÞk;t

Xj

1

r2k;i

mTj oðsÞk;t o

ðsÞTk;t mj

( )

or equivalently

Sð1Þ ¼ 1

2c log mT

i fi

� �2 �X

j

mTj W ð1Þ

j mj

( )ðA:11Þ

where

c ¼X

k

ck

W ð1Þj ¼

Xk

1

r2k;j

Xt;s

cðsÞk;t oðsÞk;t o

ðsÞTk;t

ðA:12Þ

Note that c = 0 ideally.Turning now to the second term in the MMI auxiliary function, Eq. (2.14) must be modified to read

Sð2Þ ¼X

k

Xs

dðsÞk

ZRLx

p x; AðsÞ;K0k ; P

0� �

log pðx; AðsÞ;Kk; P Þdx ðA:13Þ

where dðsÞk is once more given by (2.23). For the case of model-space speaker adaptation, Eqs. (A.7) and (A.8)must be rewritten as

log pðx; AðsÞ;Kk; P Þ ¼ log jM j � 1

2log j2pDkj þ ðx� AðsÞlk � bðsÞÞTMTD�1

k Mðx� AðsÞlk � bðsÞÞh i

Substituting this expression into (A.13), we have

Sð2Þ ¼X

k

Xs

dðsÞk log jM j � 1

2log j2pDkj �

1

2

ZRLx

p x; AðsÞ;K0k ; P

0� �

�½x� ðAðsÞlk þ bðsÞÞ�TMTD�1k M ½x� ðAðsÞlk þ bðsÞÞ�dx

ðA:14Þ

If x is drawn from p x; AðsÞ;K0k ; P

0� �

clearly

xvN AðsÞl0k þ bðsÞ; P 0D0

kP 0T� �

so that

o ¼ x� AðsÞl0k þ bðsÞ

� �vN 0; P 0D0

kP 0T� �

In evaluating the integral in (A.14), it would be simpler to work with an uncorrelated random vector. Hence,define y such that

y ¼ M0o ¼ M0 x� AðsÞl0k þ bðsÞ

� �� �ðA:15Þ

where M0 = (P0)�1. Clearly (Anderson, 1984, §2.4),

yvN 0;D0k

� �

From (A.15) it also follows

x ¼ P 0y þ AðsÞl0k þ bðsÞ

� �

Substituting the last equation into (A.10), we find
Page 15: On maximum mutual information speaker-adapted training

2 Thanothe

144 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

x� ðAðsÞlk þ bðsÞÞ ¼ P 0y þ AðsÞl0k þ bðsÞ

� �� ðAðsÞlk þ bðsÞÞ ¼ P 0y þ AðsÞ l0

k � lk

� �

which implies S(2) can be rewritten as2

Sð2Þ ¼X

k

Xs

dðsÞk log jM j � 1

2log j2pDkj �

1

2

ZRLy

p y; D0k

� �

� P 0y þ AðsÞ l0k � lk

� �� �TMTD�1

k M P 0y þ AðsÞ l0k � lk

� �� �dyo

ðA:16Þ

Sð2Þ ¼X

k

Xs

dðsÞk log jM j � 1

2log j2pDkj �

1

2

ZRLy

p y; D0k

� �

� yTP 0TMTD�1k MP 0y

� �dy þ l0

k � lk

� �TAðsÞTMTD�1

k MAðsÞ l0k � lk

� �oðA:17Þ

where Ly is the length of y and (A.17) follows from (A.16) in consequence of E{y} = 0.Again let us assume that the diagonal covariance components are not updated, so that the Gaussian nor-

malization constant can be excluded:

Sð2Þ ¼X

k

Xs

dðsÞk log jM j � 1

2

ZRLy

p y; D0k

� �

� yTP 0TMTD�1k MP 0y

� �dy � 1

2l0

k � lk

� �TAðsÞTMTD�1

k MAðsÞ l0k � lk

� �

¼X

k

dk log jM j � 1

2

ZRLy

p y; D0k

� �yTP 0TMTD�1

k MP 0y� �

dy� �

� 1

2

Xs

dðsÞk l0k � lk

� �TAðsÞTMTD�1

k MAðsÞ l0k � lk

� �)ðA:18Þ

where

dk ¼X

s

dðsÞk

Note that y can be expressed as

y ¼X

i

yiei

where ei is the ith unit vector, yivN 0; r02k;i

� �, and

Efyiyjg ¼r02

k;i; for i ¼ j

0; otherwise

Hence,

ZRLy

p y; D0k

� �yTP 0TMTD�1

k MP 0y� �

dy ¼Z

RLy

Yi

p yi; r02k;i

� � Xj

yjej

!T

P 0TMTD�1k MP 0

Xl

ylel

!24

35dy

¼X

i

r02k;i p0T

i MTD�1k Mp0

i

� �ðA:19Þ

where p0i denotes the ith column of P0. Substituting (A.19) into (A.18) provides

e validity of this step becomes apparent upon consideration of the following. Let Y be a random vector (r.v.) and let X = g(Y) ber r.v. where g is a deterministic function. Then for some deterministic function f,

EXff ðX Þg ¼ EY fðf � gÞðY Þg:

Page 16: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 145

Sð2Þ ¼X

k

dk log jM j � 1

2

Xi

r02k;i p0T

i MTD�1k Mp0

i

� �" #� 1

2

Xs

dðsÞk l0k � lk

� �TAðsÞTMTD�1

k MAðsÞ l0k � lk

� �( )

ðA:20Þ

Now note,

Xi

r02k;i p0T

i MTD�1k Mp0

i

� �¼X

i

r02k;i

Xj

mTj p0

i

� �2

r2k;j

264

375 ¼X

j

1

r2k;j

mTj

Xi

r02k;ip

0i p0T

i

!mT

j

Moreover, defining

DðsÞk ¼ AðsÞ l0k � lk

� �

it is readily found

Xs

dðsÞk l0k � lk

� �TAðsÞTMTD�1

k MAðsÞ l0k � lk

� �¼X

j

1

r2k;j

mTj

Xs

dðsÞk DðsÞk DðsÞTk

!mj

Substituting this into (A.20), we find

Sð2Þ ¼ 1

2d logðmT

i fiÞ2 �X

j

mTj W ð2Þ

j mj

( )ðA:21Þ

where

d ¼X

k

dk

W ð2Þj ¼

Xk

1

r2k;j

dk

Xi

r02k;ip

0i p0T

i þX

s

dðsÞk DðsÞk DðsÞTk

" #

Finally, combining the intermediate results from (A.11) and (A.21), we arrive at

QðKjK0Þ ¼ Sð1Þ þ Sð2Þ ¼ 1

2ðcþ dÞ log mT

i fi

� �2 �X

j

mTj W jmj

( )ðA:22Þ

where

W j ¼ W ð1Þj þ W ð2Þ

j ¼X

k

1

r2k;j

Xt;s

cðsÞk;t oðsÞk;t o

ðsÞTk;t þ dk

Xi

r02k;ip

0i p0T

i þX

s

dðsÞk DðsÞk DðsÞTk

" #ðA:23Þ

Note that if the means are updated in a separate step, then DðsÞk ¼ 0 and (A.23) simplifies to

W j ¼X

k

1

r2k;j

Xt;s

cðsÞk;t oðsÞk;t oðsÞTk;t þ dk

Xi

r02k;i p0

i p0Ti

" #ðA:24Þ

A procedure for optimizing (A.22) can be developed directly along the lines proposed by Gales (1999). Webegin by taking the partial derivate with respect to omT

i as

oQomT

i¼ ðcþ dÞ f T

i

mTi fi� mT

i W i ðA:25Þ

Equating to zero and rearranging, this provides

ðcþ dÞf Ti W �1

i ¼ mTi fi

� �mT

i ðA:26Þ

From the latter, it is clear that mTi must be in the direction of f T

i W �1i . Hence, set

Page 17: On maximum mutual information speaker-adapted training

146 J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147

mTi ¼ af T

i W �1i

for some real a, and substitute into (A.26) to obtain

a2 f Ti W �1

i fi

� �f T

i W �1i ¼ ðcþ dÞf T

i W �1i

Hence a is given by

a ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

cþ d

f Ti W �1

i fi

s

and Q(K|K0) can be maximized by iteratively updating the rows of M according to

mTj ¼ f T

j W �1j

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifficþ d

f Tj W �1

j fj

sðA:27Þ

Apart from the redefinition of cðsÞk;t in (A.12) and the regularization term introduced by the MMI criterion, thisexpression is identical to that derived by Gales (1999).

Eq. (A.27) cannot be used indiscriminately. In particular, care must be taken to choose the dðsÞk

n oappearing

in (A.23) large enough so that each Wj is positive definite, as we now show. Proceeding once more from (A.25)we find,

o2QðKjK0ÞomiomT

i¼ �ðcþ dÞ fif T

i

mTi fið Þ2

� W i

In order to have a maximum of Q(K|K0) we require that o2Q=omT

i omi is negative definite. As c + d = d > 0 andfif T

i is a symmetric rank-one update, and hence positive definite, this will certainly hold if Wi is positive def-inite. It may be possible, however, to derive a weaker condition.

References

Anastasakos, T., McDonough, J., Schwarz, R., Makhoul, J., 1996. A compact model for speaker-adaptive training. In: Proceedings of theICSLP. pp. 1137–1140.

Anderson, T.W., 1984. An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, New York.Axelrod, S., Goel, V., Gopinath, R., Olsen, P., Visweswariah, K., 2004. Discriminative estimation of subspace constrained gaussian

mixture models for speech recognition. IEEE Transactions Speech and Audio Processing.Deller, J., Hansen, J., Proakis, J., 1993. Discrete-Time Processing of Speech Signals. Macmillan Publishing, New York.Digalakis, V., Rtischev, D., Neumeyer, L., 1995. Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE

Transactions Speech and Audio Processing. 3, 357–366.Gales, M.J.F., 1998. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language

12.Gales, M.J.F., 1999. Semi-tied covariance matrices for hidden Markov models. IEEE Transactions Speech and Audio Processing 7, 272–

281.Gallager, R., 1968. Information Theory and Reliable Communication. Wiley, New York.Gopalakrishnan, P., Kanevsky, D., Nadas, A., Nahamoo, D., 1991. An inequality for rational functions with applications to some

statistical estimation problems. IEEE Transactions on Information Theory 37 (January), 107–113.Gunawardana, A., 2001. Maximum mutual information estimation of acoustic HMM emission densities. Tech. Rep. 40, Center for

Language and Speech Processing, The Johns Hopkins University, 3400 N. Charles St., Baltimore, MD 21218, USA.Gunawardana, A., Byrne, W., 2000. Discriminative speaker adaptation with conditional maximum likelihood linear regression. In:

Proceedings of the ITRW ASR, ISCA.Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C., 2005. Hidden conditional random fields for phone classification. In: Proceedings

of the Interspeech, Lisbon, Portugal.Leggetter, C.J., Woodland, P.C., 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden

Markov models. Computer Speech and Language 9 (April), 171–185.Liu, C., Jiang, H., Li, X., 2005. Discriminative training of CDHMMs for maximum relative separation margin. In: Proceedings of the

ICASSP, Philadelphia, Pennsylvania, USA.Ljolje, A., Pereira, F., Riley, M., 1999. Efficient general lattice generation and rescoring. In: Proceedings of the Eurospeech, Budapest,

Hungary.

Page 18: On maximum mutual information speaker-adapted training

J. McDonough et al. / Computer Speech and Language 22 (2008) 130–147 147

Mahajan, M., Gunawardana, A., Acero, A., 2006. Training algorithms for hidden conditional random fields. In: Proceedings of theICASSP, Toulouse, France.

Matsoukas, S., Schwartz, R., Jin, H., Nguyen, L., 1996. Practical implementations of speaker-adaptive training. In: Proceedings of theICASSP.

McDonough, J., 2000. Speaker compensation with all-pass transforms. Ph.D. Thesis, The Johns Hopkins University.McDonough, J., Waibel, A., 2003. Maximum mutual information speaker-adapted training with semi-tied covariance matrices. In:

Proceedings of the ICASSP.McDonough, J., Waibel, A., 2004. Performance comparisons of all-pass transform adaptation with maximum likelihood linear regression.

In: Proceedings of the ICASSP, Montreal, Canada.McDonough, J., Schaaf, T., Waibel, A., 2002. On maximum mutual information speaker-adapted training. In: Proceedings of the

ICASSP.McDonough, J., Schaaf, T., Waibel, A., 2004. Speaker adaptation with all-pass transforms. Speech Communication Special Issue on

Adaptive Methods in Speech Recognition.McDonough, J., Wolfel, M., Stoimenov, E., 2007. Comparison of techniques for combining speaker adaptation with discriminative

training. In: Proceedings of the ICASSP, Honolulu, Hawaii, USA.Mohri, M., Riley, M., 1998. Network optimizations for large vocbaulary speech recognition. Speech Communication 25 (3).Normandin, Y., 1991. Hidden Markov models, maximum mutual information, and the speech recognition problem. Ph.D. Thesis, McGill

University.Povey, D., Woodland, P., 2002. Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the

ICASSP, Orlando, Florida, USA.Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G., 2005. fMPE: Discriminatively trained features for speech

recognition. In: Proceedings of the ICASSP, Philadelphia, Pennsylvania, USA.Saon, G., Povey, D., Zweig, G., 2005. Anatomy of an extremely fast LVCSR decoder. In: Proceedings of the Interspeech, Lisbon,

Portugal.Schluter, R., 2000. Investigations on discriminative training criteria. Ph.D. Thesis, Rheinisch–Westfalische Technische Hoschschule

Aachen.Sha, F., Saul, L.K., 2006. Large margin gaussian mixture modeling for phonetic classification and recognition. In: Proceedings of the

ICASSP, Toulouse, France.Stoimenov, E., McDonough, J., 2006. Modeling polyphone context with weighted finite-state transducers. In: Proceedings of the ICASSP,

Toulouse, France.Tsakalidis, S., Doumpiotis, V., Byrne, W., 2002. Disciminative linear transforms for feature normalization and speaker adaptation in

HMM estimation. In: Proceedings of the ICSLP. pp. 2585–2588.Uebel, L., Woodland, P., 2001. Improvements in linear transform based speaker adaptation. In: Proceedings of the ICASSP.Valtchev, V., Woodland, P.C., Young, S.J., 1997. MMIE training of large vocabulary speech recognition systems. Speech Communication

22, 303–314.Wolfel, M., McDonough, J., 2005. Minimum variance distortionless response spectral estimation: review and refinements. IEEE Signal

Processing Magazine.Woodland, P., Povey, D., 2000. Large scale discriminative training for speech recognition. In: ISCA ITRW Automatic Speech

Recognition: Challenges for the Millenium. pp. 7–16.Zheng, J., Butzberger, J., Franco, H., Stolcke, A., 2001. Improved maximum mutual information estimation training of continuous

density HMMs. In: Proceedings of the Eurospeech.