improved hidden markov models for speech recognition bound... · the left-hand side of(1.6) isoften...

22
Philips J. Res. 43,224-245, 1988 R 1184 IMPROVED HIDDEN MARKOV MODELS FOR SPEECH RECOGNITION by X. AUBERT, H. BOURLARD, Y. KAMP and C.I. WELLEKENS Philips Research Laboratory Brussels, Av. Vall Becelaere 2, Box 8, B-1170 Brussels, Belgium. Abstract The basic hidden Markov model used in most statistical approaches to speech recognition needs to be considerably expanded and modified in order to become fully efficient. The purpose of this paper is to describe some of these improvements brought to the original scheme. First, good starting values are needed for the training procedure by which the model parameters are adjusted to their optimal value. These initialization val- ues have to be provided by a separate procedure, performing supervised segmentation of known utterances. Next, adaptation to specific phenom- ena such as duration of acoustic events or correlation between adjacent speech segments, requires corresponding modifications of the original model. Finally, the discriminative power of the models can be increased to improve recognition accuracy. Keywords: Dynamic programming, hidden Markov models, multilayer perceptron, pattern recognition, speech recognition. 1. Introduetion One of the specific difficulties in speech recognition is to take into account the inherent statistical variations in pronunciation and speaking rate which quite adversely affect the performance of recognition systems even when operating in speaker-dependent mode. An efficient approach to this prob- lem consists in modelling each word or subword unit (e.g. a phoneme) by a (first order) Markov chain. Indeed, introduetion of this technique around 1975 (see e.g. ref. 1 and 2) has boosted considerable progress in recognition rate: However, more extended experience has quickly revealed that Markov models are still rather crude approximations of the speech production proc- ess and that they need to be considerably refined and equipped with auxil- iary tools to produce their full efficiency. The purpose of this paper is to de- scribe some of the improvements worked out at PRLB, to explain why they 224 Philips JournnI of Research Vol. 43 Nos 3/4 1988

Upload: phunganh

Post on 27-Sep-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Philips J. Res. 43,224-245, 1988 R 1184

IMPROVED HIDDEN MARKOV MODELS FORSPEECH RECOGNITION

by X. AUBERT, H. BOURLARD, Y. KAMP and C.I. WELLEKENSPhilips Research Laboratory Brussels, Av. Vall Becelaere 2, Box 8, B-1170 Brussels, Belgium.

AbstractThe basic hidden Markov model used in most statistical approaches tospeech recognition needs to be considerably expanded and modified inorder to become fully efficient. The purpose of this paper is to describesome of these improvements brought to the original scheme. First, goodstarting values are needed for the training procedure by which the modelparameters are adjusted to their optimal value. These initialization val-ues have to be provided by a separate procedure, performing supervisedsegmentation of known utterances. Next, adaptation to specific phenom-ena such as duration of acoustic events or correlation between adjacentspeech segments, requires corresponding modifications of the originalmodel. Finally, the discriminative power of the models can be increasedto improve recognition accuracy.

Keywords: Dynamic programming, hidden Markov models, multilayerperceptron, pattern recognition, speech recognition.

1. Introduetion

One of the specific difficulties in speech recognition is to take into accountthe inherent statistical variations in pronunciation and speaking rate whichquite adversely affect the performance of recognition systems even whenoperating in speaker-dependent mode. An efficient approach to this prob-lem consists in modelling each word or subword unit (e.g. a phoneme) by a(first order) Markov chain. Indeed, introduetion of this technique around1975 (see e.g. ref. 1 and 2) has boosted considerable progress in recognitionrate: However, more extended experience has quickly revealed that Markovmodels are still rather crude approximations of the speech production proc-ess and that they need to be considerably refined and equipped with auxil-iary tools to produce their full efficiency. The purpose of this paper is to de-scribe some of the improvements worked out at PRLB, to explain why they

224 Philips JournnI of Research Vol. 43 Nos 3/4 1988

Phllips Journalof Research Vol. 43 Nos 3/4 1988 225

Improved hidden Markov models for speech recognition

were felt necessary and how they could be integrated in a complete recog-nition system.The basic concept of Markov chain can be introduced as follows (see e.g.

ref. 3). Consider a discrete time, finite state automaton having a set of statesQ = {ql>q2, ... ,qs}' The successive states visited in the course of time forma sequence of random variables {qik' k = 1,2, ... } with qik E Q. This se-quence forms a first order stationary Markov chain if

(1.1)

and if the right-hand side in (1.1) is independent of the time index k. Theparameters of such a Markov chain are the transition probabilities Pij ~f

P (qMi) satisfying the condition "'i.T=l Pij = 1 for all i. In order to avoid sep-arate specification of initial probabilities, it will be assumed that all state se-quences start from an additional initial state 1.The preprocessor of a recognition system extracts vectors of d acoustic pa-

rameters from the speech waveform at regular intervals, typically every centi-second. In this way, the speech signal is transformed into a sequence ofacoustic vectors Y = {Yl,y2,"'}' YkERd. Accordingly, each state qi of theMarkov model (except the initial state) has, associated with it, an emissionprobability P(Yklq7) specifying the probability that a particular acoustic vec-tor Yk is produced in this state and where the notation q7 is used to meanthat, at time k, the observed state is qi' The emission probability is also as-sumed time-independent and is thus written P (yklq;). In the following, wewill essentially consider continuous multivariate Gaussian emission proba-bilities, characterized by the density function

IP (yklq;) = (21T) -"2d l"'i.il-1 exp(-t Ch - /Li) T ("'i.i"'i.T)-l (Yk - /Li)) (1.2)

where /Li and 2,i"'i.T are, respectively, the mean vector and the covariancematrix associated with state qi' Let us also mention the important alterna-tive of discrete emission probabilities which are used when the acoustic vec-tors have been vector-quantized into a finite alphaber=). Speech is repre-sented by an ordered sequence of acoustic vectors and this is reflected in theMarkov model by forcing a unidirectional flow in the state sequence via theconstraint Pij = 0 if j < i. A word or subword unit is then modelled by aMarkov chain such as shown in fig. 1 where"F is a final non-emitting state.The name hidden Markov model (HMM) was coined to stress the fact thatone cannot observe directly the sequence of states but only the sequence of

X. Aubert, H. Bourlard, Y. Kamp and C.J. Wellekens

Fig.1.A typical 3·state phonemic Markov model.

acoustic vectors produced in each state. In other words, the conditionalemission probabilities have a masking effect on the actual sequence of statesto the effect that there is no one to one correspondence between an acousticvector and the state producing ie The fact that states can be skipped or re-peated according to the transition probabilities allows for shortening orelongation of the utterance; the variability in pronunciation is captured bythe emission probabilities. In the following, phonemes will be used as thefundamental acoustic units and the recognition procedures will thus be basedon phonemic hidden Markov models, a typical example of which is given infig. 1, consistingof 3 states with only loops and jumps to the next state. Wordmodels are derived by concatenation of the constituting phoneme modelsaccording to the phonetic transcription of the lexicon and a word sequenceor a sentence model is formed by cascading word models. Consequently, ifsome phoneme model appears several times in a word or sentence model,then the corresponding states of its HMM are tied in the sense that they arebound to have the same emission and transition probabilities.The complete specificatien of a model involves thus the determination of

two parameter sets: transition probabilities Pij and emission probabilitiesP(ylqi)' These parameters are adjusted to their optimal values by a prelim-inary training phase in which known utterances are matched against their as-sociated models. The two main estimation techniques are based on an eval-uation of the probability that a given utterance Y = {Yl,'" ,h} has, beenproduced by its corresponding Markov model M having a set of states de-noted by Q. Consider a path of length K in M, i.e. a sequence of K statesTI= {qil'qi2,· .. ,qiKlqikE Q}. The probability that utterance Y has been pro-duced along this path can be computed as

K

P (Y!TI)= IIPik,ik+l P (Yklqik)'k=l

(1.3)

where the transition probabilities from the initial state 1 are neglected andwhere PiK,iK+l denotes the transition probability from state qiK to the final non-

226 Phillps Journalof Research Vol. 43 Nos 3/4 1988

Philips Journalof Research Vol. 43 Nos 3/4 1988 227

Improved hidden Markov models for speech recognition

emitting state F. The probability that model M has produced Y is then givenby

P (YIM) = 2: P (YI7J),TIEH

(1.4)

where H stands for the set of all paths of length K in M. If ()denotes theparameter set of model M, the maximum likelihood estimation procedureconsists in finding the optimal set ê such that Pij(YIM) ;;:;.Po(YIM). A r«?-cursive solution to this maximization problem is provided by the forward-backward reestimation formulas 6,7) . The Viterbi estimation is a simplifiedversion of the preceding one in the sense that, instead of considering all pathsin M, it selects only the best one ij such that P(Ylij) ;;:;.p(YI7J), 'rI7J E Handthe parameters are then adjusted in order to satisfy Pij(YI ij);;:;' P o(Ylij). De-termination of the best path can be performed via a dynamic programmingalgorithm 8) as follows. If one denotes by P(Y,qf) the probability of the bestpath in the model which jointly produces Y and terminates in state qj' then,by definition,

P(YI ij) = max P(Y,qf).qjEQ

(1.5)

According to the principle of optimality 8), the right-hand side can be corn-puted by the recurrence

(1.6)

where Yt stands for the partial utterance {Yl,y2'''' ,yd. In the speech ree-ognition literature, eq. (1.6) is often rewritten by taking logarithms of bothsides, in which case logarithms of emission probabilities are interpreted as'local distances' and logarithms of transition probabilities as 'time distorsionpenalties'. The left-hand side of (1.6) is often called the 'accumulated prob-ability' of emitting the first k acoustic vectors and terminating in state qj'Although recognition algorithms are not the main topic of this paper, it

seems appropriate for the sake of completeness to point out that they followessentially the same line of thought as the training techniques. If M denotesnow a word model, then an unknown utterance Y is recognized as word iIJ,ti •

p(ilJlY) ;;;.p(MIY) vu. (1.7)

X. Aubert, H. Bourlard, Y. Kamp and C.l. Wellekens

According to Bayes' rule, P(MIY) = P(YIM) P(M)/P(Y) where P(M) is thea priori probability of word M provided by the language model. For ree-ognition according to the maximum likelihood criterion, P(YIM) is com-puted via (1.4) and (1.3). Extension of this criterion to continuous speech,i.e. with no artificial pauses between words, is also possible, although notstraightforward 9). The Viterbi recognition criterion bases its decision on thesame inequality (1.7) but, instead of using (1.4), assumes that P(YIM) is ap-proximated by P(YI7})where 7}is the best path in M. The main advantageof the latter criterion is that it can efficiently be implemented by dynamicprogramming, called dynamic time warping (DTW) in this context, and thatit is easily extended to continuous speech (see ref. 10 and the referencestherein).Since the training algorithms mentioned above all lead to a local optimum

of the parameters, the issue of good initial values is of prime importance.Section 2 describes an automatic procedure which performs supervised seg-mentation and phonetic labelling of continuous speech in a speaker-inde-pendent mode, thereby providing first estimates of the emission probabili-ties.The transition probabilities inherited from the underlying Markov chain

suffer from the drawback that they are unable to reflect the most probableduration of an acoustic event. The semi-Markov model introduced in sec. 3offers a more realistic approach to this problem by modelling the state oc-cupancy by an appropriate probability density function of which the Poissondistribution is a simple example.Another limitation of the classical hidden Markov model is that it ignores

the correlation existing between successive acoustic vectors. To take thisphenomenon into account, correlated emission probabilities are introducedin sec. 4 which are conditionally dependent on the previous acoustic pattern.Extended experience with recognition systems based on Markovian models

reveals that the differences between the matching scores of the best and sec-ond best hypotheses are too small, which may lead to fragile decisions. Thisobservation stresses the need for models with sharper discrimination: re-placing emission probabilities by trained discriminant functions as discussedin sec. 5 provides a solution in this direction. It will be shown how this goalcan be achieved either by conventional methods in the linear case or by newapproaches such as multilayer perceptrons in the nonlinear case.

228 Phlllps Journalof Research Vol. 43 Nos 3/4 1988

Improved hidden Markov models [or speech recognition

2. Supervised phonetic segmentation

Supervised segmentation aims at locating the successive acoustic eventsproduced during the utterance of a known sentence. Having chosen the pho-neme as the basic acoustic unit, a set of about 40 phoneme-like units isneeded to completely describe a particular European language. The prob-lem thus consists in partitioning the speech signal in intervals that corre-spond best to the articulated phonemes.Though the sequence of words making up the utterance is known from

the outset, the task is not a trivial one. First, the speaker has some degreeof freedom with respect to the standard phonetic transcription. Hence, someuncertainty concerning the phoneme sequence that has been effectively pro-nounced. Then, the realizations of most phonemes show considerable vari-ations, both in their spectrum and duration distributions, this aspect beinginherent to the speech production process. The segmentation problem hasbeen tackled many times in the past using various methods. A first group ofalgorithms resorts to time-constrained clustering techniques applied to thewhole utterance with little use of the phonetic transcription (see e.g. refs 11and 12), while other approaches mainly rely on acoustic-phonetic and phon-ological knowledge, formalized and structured into a (possibly very com-plex) set of rules (see e.g. refs 13 and 14).

Our approach combines explicit use of speech-specific knowledge withinthe frame of classical tools like DTW. The system works in a speaker-in-dependent mode and is able to adapt - to a certain extent - the standardphonetic transcription to what has been really uttered. A sentence is seg-mented through three consecutive stages. First, a 'reference' phonetic tran-scription is produced from the known word sequence. Second, the speechsignal is aligned with this phonemic string by globally optimizing an objec-tive function taking spectral similarity, duration distribution and phonolog-ical rules into account. Third, starting from this time-alignment, each con-secutive phoneme pair is segmented using specialized procedures devoted toparticular phonetic contexts. This second pass leads to a refined segmenta-tion while"making checks on the validity of the labelling and boundary lo-cation. The final output yields a 'realized' phonetic transcription together withthe landmarks in the speech signal.

2.1. Handling of the phonetic transcription

We proceed by concatenating the standard phonetic transcription of eachword together with some special allophones. Although the system is aimedat connected speech, we always insert an 'inter-word' symbol to absorb pos-

Phllips Journolof Research Vol. 43 Nos 3/4 1988 229

X. Aubert, H. Bourlard, Y. Kamp and C.J. Wellekens

sible pause or breath noise. Plosives are split in their occlusion and burstportions while distinction is made between the intervocalic and postvocalic'R' and 'L'. The assembled phoneme string corresponds to the longest andmost detailed utterance of the sentence, as allowed by our system. How-ever, in connected speech the coarticulation effects are particularly impor-tant between neighbouring words, thus leading to shorter realization(s) ofthe same sentence. It is the röle of the phonological rules embedded in thetime-alignment module to provide some flexibility with respect to the basictranscription, by allowing for the skipping or assimilation of some phonemesin particular contexts.

2.2. Globally optimized time-alignment

Each phoneme is characterized by 3 (single) values: a loudness index, aspectral-distribution index and a typical duration. These parameters provideenough information for the time-alignment making unnecessary the manualextraction of prototypes. Let i denote the i-th 'centi-second frame' in thesignal, j the j - th symbol in the transcription and p(j) the correspondingphoneme. Then a spectral similarity measure between the i-th frame andthe j - the symbol is given by:

d[i,j] = Cl [p(j)] x I LR [p(j)] - LF [i] I + C2 [p(j)]x I SDR [p(j)] - SDF [i] I,

where, CI[·], C2[.] are weights depending on the phoneme p(j), LR[·],SDR[·] are the reference values provided by a table for (resp.) the loudnessand spectral distribution parameters, LF[·], SDF[·] are the current values ofthe same parameters calculated at a particular centi-second frame in the sig-nal and 1·1 denotes the absolute value operator.

The spectral distribution parameter is obtained from the first normalizedautocorrelation coefficient while the loudness parameter comes from the rootmean square energy, both parameters being processed through adaptiveclipping and scaling to provide a kind of voice normalization 14).,

Concerning duration modelling, each phoneme p(j) is characterized by areference duration value DR[p(j)] provided by the same table as before. Thefirst step consists in adapting these values to the sentence averaged speakingrate. Let p denote the ratio between the expected length (obtained by sum-ming the DR[·] for all the symbols appearing in the transcription) and theobserved length of the sentence (obtained through endpoint detection on thesignal). Then, for all phonemes but a few exceptions (plosives), the durationvalues are linearly adapted following: a[p(j)] = DR [p(j)]/p. The second step

230 Phillps Journalof Research Vol. 43 Nos 3/4 1988

Improved hidden Markov models for speech recognition

consists in associating to each a[·] a discrete Poisson distribution 15) which ismodified for some phonemes (fricative, long vowels) to cope with stronglengthening effects.

Now, using d[i,j] as a local distance and the negative Iogarithm of the du-ration distribution as an additive weight, the globally optimized time-align-ment is obtained by a DTW algorithm, the phonological rules controlling thephoneme transitions in the recurrence. By taking the whole utterance intoaccount, this global alignment provides a first segmentation which is almostfree of gross errors. However, the precise location of the inter-phonemeboundaries is not necessarily achieved at once.

2.3. Local segmentation procedureFor each pair of consecutive phonemes, an accurate boundary location is

sought by means of specialized procedures which are applied on the respec-tive sub-interval determined by the global time-alignment. It is thus as-sumed that the latter is accurate enough to insure that the true boundary lieswithin this sub-interval. As there are about 40 phonemes, it is not possibleto give a particular treatment for every phoneme pair. Rather, we resort toa Broad Phonetic Class (BPC) organization: each BPC 16) includes severalphonemes sharing gross spectral characteristics in common, thus makingpossible the grouping of many phoneme pairs into one BPC sequence. Aparticular procedure is devoted to each of the following cases that cover mostof the possible phonemic contexts: Voice-Unvoiced pair, Occlusion-Burstsequence, Burst-Fricative sequence, Vowel-Nasal pair, Vowel-Liquid pair,Vowel-Glide pair and Diphthong segmentation. For each procedure, thegeneral principle is the following 14): within the pre-assigned sub-interval,specialized features are extracted and processed into an appropriate simi-larity measure. At this level, the speech parameters consist of 5 spectral en-ergy bands in the range 0 to 6.5 Khz, together with their centre of gravity.A time-constrained clustering algorithm provides the boundary location whichis checked with respect to a minimum and maximum allowable duration.

2.4. Segmentation resultsThe system has been tested on a connected digit database and on a pho-

netically-balanced database including 341 vocabulary words and all pho-nemes of the German language. For 7 speakers (5 males, 2 females), the re-sults appear robust against voice details and speaking rate; the phonologicalrules are appropriately used to adapt the phonetic transcription and, due tothe duration control, the system is almost free of gross errors. Some inac-curacies may however occur in the most heavily coarticulated regions.

Phllips Journalor Research Vol. 43 Nos 3/4 1988 231

X. Aubert, H. Bourlard, Y. Kamp and Cl. Wellekens

Fig. 2. Automatic segmentation for the utterance of the phone number 9067340 (in Ger-man), by a male speaker. The phoneme boundaries have been superimposed on the digitized

speech signal.

232

AMPLITUDE-AXIS

o3(J)

CDo

Philips Journalof Research Vol. 43 Nos 3/4 1988

Improved hidden Markov models for speech recogniûon.

Figure 2 is an example of the automatic segmentation obtained for the ut-terance (in German) of the phone number 9067340 by a male native speaker.Let us emphasize that the standard phonetic transcription has been correctlyadapted to fit closer the phoneme sequence effectively pronounced: the firsttwo words share the same phonemic segment 'N' in common, the 'Z' in frontof the fourth word 'Z I B EN' has been assimilated to an 'S' that extendsthe ending 'S' of the third word, all inter-word segments have been skipped,except between the fourth and the fifth word where it has been convertedinto an occlusion 'di' preceding the burst of the 'D'.To further assert the quality of the segmentation, recognition experiments

have been performed by estimating the parameters required by the HMMof phonemes from the automatic segmentation, without any additional step.Hence, the training phase is decoupled in two separate parts, namely thephonemic segmentation of the training sentences and the parameters esti-mation (by collecting all the samples corresponding to the same phonemicclass). On a seven-digit phone number database, the results compare fa-vourably with those obtained by a Viterbi training initialized with a linearsegmentation!"). On a thousand words vocabulary task, recognition resultsare still acceptable, thus illustrating the consistency of the a priori segmen-tation together with its appropriateness for initializing the iterative trainingphase.

3. State occupancy modelling in HMM

One of the inherent limitations of HMM lies in the fact that the time spenton a state is geometrically distributed. Indeed, if Pii represents the 'loopprobability' associated with state qi, the probability Piek) of making k con-secutive loops on this state and then leaving it is given by:

Piek) = Pri . (1 - Pii), (3.1)

where 1 - Pii represents the probability of leaving the state. Such a distri-bution is unable to model adequately the state occupancy since its mean valuePii/(l - Pii) does not correspond to the maximum probability and, conse-quently, the mean duration of a word or a phoneme cannot be made mostprobable. This is inherent in the Markov property.However, it is known 18.21) that incorporating state duration information

in the recognizer improves the performances. This could be achieved by analternative phonemic HMM where the state occupancy is explicitly mo-delled by a better suited probability density function. As suggested in ref.

Philips Journalor Research Vol, 43 Nos 3/4 1988 233

(3.4)

X. Aubert, H. Bourlard, Y. Kamp and C.l. Wellekens

22, a process of this kind will be called 'Semi-Markov Process' and affordsmuch greater generality by permitting the time to be spent on each state toobey any arbitrary probability distribution. An example of such a functionis the Poisson distribution where Piek), the probability to make k (and onlyk) consecutive loops on state qi and then to leave it, is then given by:

a~P·(k) = _.!... • e-"iI k! '

(3.2)

where ai is the mean (and the variance) of the distribution associated withstate qi and corresponds to the most probable number of loops on this state.

3.1. Dynamic programming in semi-Markov models

A semi-Markov model (SMM) is defined by a set of states qj (j = 1,2,,,.,S)where, in addition to new transition probabilities to be defined below, eachstate is associated with two probability density functions: one for the stateoccupancy probability pj(k) (the probability of making exactly k loops on stateqj) and one for the emission probabilities p(Yklqj) (the probability of emit-ting a particular acoustic vector Yk on state qj)'Let Y = {Yb Y2,""Y K} be an utterance of K acoustic vectors and ij =

{qil'".,qik} be a possible path of length K in the SMM. Neglecting the loopsin the path description leads to a new state sequence labelling {qsl'''' ,qsT}

witli Sj '* Sj+I' The probability of producing Y along this path will then begiven by:

T K

(3.3)j=l k=1

where kSj is the number of loops on state qSj ('kT=1 kSj = K) and rSj,Sj+1 rep-resents the 'relative' transition probability from state qSj to state qSj+1 (withSj '* Sj+l)' Of course, by definition 'kJ=l ri,j = 1 and ri,i = O.

Using the Viterbi criterion, the path 'iJ maximizing the probability P(Ylij)given by (3.3) can be obtained by dynamic programming using the followingrecurrences

•• p{k)P(i,j,k) = P(i - 1,j,k - 1)· (l ). p(Yilqj)

Pj k-1if k > 1

234 Philips Journalof Research Vol. 43 Nos 3/4 1988

Philips Journalor Research Vol.43 Nos 3/4 1988 235

Improved hidden Markov models for speech recognition

P(i,j,l) = max max [P(i - 1,!,k) . rij] . pj(O) . p(Yilqj),qt E Qj k

where:

- P(i,j,k) répresents the 'accumulated' probability of emitting the first iacoustic vectors on the first j states, with k loops on state qj and

- Qj represents the set of possible predecessor states for qj' i.e. the statesfor which rij '=1= O.In the particular case of our 3-state phonemic model (fig. 1) and Poisson

distribution, recurrences (3.4) can be rewritten as:

A A (x.

P(i,j,k) = P(i - 1,j,k - 1) . ~ . p(y;!qj) if k > 1

(3.5)P(i,j,l) = max [P(i - 1, j - 1,k)] . e-"j . p(yilq).

k

3.2. Training of semi-Markov models

The parameters associated with each state of SMM are:- the 'relative' transition probabilities rij'- the parameters describing the state occupancy,- the parameters defining the emission probabilities.To estimate these parameters, we are given, for each phonetic unit, a train-ing sequence of independent pronunciations Yj (1 :;;;;j :;;;;J) of this speechunit. The training along a maximum likelihood criterion for Poisson andGamma distributions have been addressed respectively in refs 21 and 23. Forthe training with a Viterbi criterion and Poisson distribution, as discussed inref. 15, the probability (3.6) of the best paths of the who I!'! training sequence

J

P = IT P(YjI17j)'j=l

(3.6)

must be maximized, where 17jdenotes the best path associated with Yj. Un-fortunately, it is not known how to solve this optimization problem rigor-ously, i.e. such as to guarantee a global optimum. Therefore, a two-step it-erative procedure is required. In the first one, the old parameter values areused to determine new best paths (by recurrences 3.4). In the second step,

N·, e -ai ,,(kl ntP = U· IT (k)ï a;' . IT rij',

k=l n,. toFi

X. Aubert, H. Bourlard, Y. Kamp and C.l. Well_e_k_e_n_s _

the parameters are reestimated from the set of new best paths by maximiz-ing the score (3.6). As a consequence, the overall process is an ascent al-gorithm and converges to a (local) maximum.More specifically, the duration probabilities are obtained as follows. Let

us denote by ~i a set of tied states (i.e. a set of states whose emission andstate occupancy probabilities have to be identical). Making apparent thecontribution of states of ~i, (3.6) can be rewritten as:

(3.7)

where:- U represents the contribution of emission probabilities and transition

probabilities for all states tE. ~;,- Ni = card (~;),- nfk) = number of loops on a particular state (index k) E ~i,

- nij = number of transitions from states of ~i to states of ~j

It can be shown 15) that the parameters ai and rij (i * j) maximizing Paregiven by:

{rij = nij I Ni withai = Nijl Ni

where Nij = ~~1,1 nfk) represents the observed total number of loops on statesof ~i for the whole training set. The determination of the emission proba-bility parameters is unchanged as has been presented in refs 5 and 10.

(3.8)

4. Explicit time correlation in HMM

The speech production mechanism implies strong correlation betweensuccessive acoustic events which is obviously also observed in the time se-quence of acoustic vectors. This has been already observed and taken intoaccount in ref. 24 by adding some dynamic properties of the cepstral coef-ficients to the speech feature vectors.In HMM, this correlation is not explicitly taken into account but the as-

sumptions underlying its derivation can be relaxed to include the correlationbetween successive acoustic vectors in the model itself. This is achieved bythe definition of a 'correlated emission' probability density. As for the clas-sical HMM, the parameters of these new densities can be estimated by atraining procedure using Viterbi or forward-backward algorithm accordingtb the selected optimization criterion as described in sec. 1.

236 Philip, Journalof Research Vol. 43 Nos 3/4 1988

Philips Journalof Research Vol. 43 Nos 3/4 1988 237

Improved hidden Markov models for speech recognition

Let us start by recalling the assumptions made when defining the classicalHMM. The joint probability that a vector sequence Y = {Yl>oo.,yK} is pro-duced by a given Markov model and by emitting Yk on state qj is denoted byP (qj, Y). This probability can also be written as:

P(qj, Y) = P(qj, Yt) P(Yf+1 I qj, Yt), (4.1)

where Y~denotes the vector sequence {Yk>Yk+l>'oo,Y,}, Trivial manipulationsshow that

sP( k yk) _ ~ P ( k-I yk-I) P( k I k-I yk-I)qj' 1 - LI qi' 1 qj' Yk qi , 1 ,

i=1(4.2)

where.S is the number of states of the model. Equation (4.2) is an ascendingrecurrence known as the forward recurrence of the Baum-Welch algo-rithm 6.10).

In the classical HMM model, it is assumed that the conditional probabilityin (4.2) is independent of the previously emitted vectors yt-1• As explainedin sec. 1, an emission probability can be defined (1.2) which depends onlyon the emitting state and is independent of time (i.e. of k), of the precedingstate and of the previously emitted vectors. A transition probability Pij is alsodefined, assumed independent of time and of the emitted vectors. With theseassumptions, the conditional probability in (4.2) takes then the classical form

(4.3)

A first-order degree of correlation can be incorporated in the model by as-suming that the conditional probability is dependent on the previously emit-ted vector Yk-I' As a consequence, one has

P( k I k-I yk-I) - P (y I k-I k )%' Yk qi , 1 - Pij k qi ,qj' Yk-I . (4.4)

The definition of the transition probability Pij remains unchanged,

P( kik-I ) - P( kik-I) -qj qi ,Yk-I - qj qi - Pij (4.5)

assumed independent of the emitted vectors and time invariant.The second factor of the right-hand member of (4.4) is the definition of a

'correlated' emission probability which can also be written

X. Aubert, H. Bourlard, Y. Kamp and C.J. Wellekens

(4.6)

The numerator in (4.6) will be assumed to be a correlated Gaussian prob-ability density function:

IP(y Ik k-I) - (2 )-d I S I-I I C 1-2 (_1 T A-I )k» Yk-I qj' qi - 7T ij ij eXP"2x ij X , (4.7)

where xT stands for ((yk - /-tijV, (Yk-I - Vij)T-y, the mean vectors /-tij and vijare associated to the emitting pair of states {qi,qj}' Aij = Sij Cij S~ and

(IR ..) (1:.. 0)c., = R~ jJ and s., = ÓJ Tij

are the corresponding correlation and standard deviation symmetric blockmatrices. Matrix Aij must be positive definite.The emission probability of Yk-l on qi can be obtained from (4.7). Indeed,

the integration of (4.7) versus Yk yields

P(Yk-l I q1,q~-I) = (27T)-d/2ITijl-l exp( -t(Yk-1 - Vi)T

(TijT~)-1 (Yk-l - Vi). (4.8)

Using (4.5), this probability can be shown to be independent of qj. As a re-sult, vij and Tij must be assumed to be independent of j and are thus writtenfrom now on with a single subscript.The correlated emission probability of Yk on qj can then be written by us-

ing (4.6), (4.7) and (4.8)

where u and v stand respectively for 1:ï/ (Yk - /-ti) and Til (Yk-l - Vi). Inorder to simplify the training algorithm and to reduce the number of param-eters, this emission probability will be assumed independent of qi. Hence,the following expressions must be assumed independent of i and are thuswritten with a single subscript j:

238 Philips Journol of Research Vol. 43 Nos 3/4 1988

Phllips Journal of Research Vol. 43 Nos 3/4 1988 239

Improved hidden Markov models for speech recognition

Xij Rij Til = x,f.Lij - J0 u, = f.Lj

z; (J - Rij R~) X~ = Xj XJ.(4.9)

With these notations, the correlated emission probability is written

exp( --t(yk - XjYk-l - J.LjV (Xj X/)-l (Yk - XjYk-l - J.Lj))P(YkIYk-t>qj) = (211")d/2IXA

(4.10)

It is exactly the Gaussian density function of a first order autoregressiveprocess. Let us observe that (4.10) can be obtained directly by assuming atthe outset that the numerator of (4.6) is Gaussian and independent of qi.The generalization to long-range correlation is then straightforward and leadsto the density function of a higher order autoregressive process instead of to(4.10). For Xj = 0, the correlated emission probability (4.10) restores thec1assicalone.It is worthwhile to notice that matrix Aij in (4.7) will be positive definite

simultaneously with Xj XJ and Ti TT.The remaining problem is to train the parameters of these new probabil-

ities on a speech data base according to one of the criteria described in theintroduction. In the case of a classical emission probability, reestimationformulas satisfying the criterion of the most probable path (Viterbi algo-rithm) or the maximum likelihood (forward-backward algorithm) criterionare well known 10). They have the advantage to allow a state-by-state rees-timation of each parameter independently. Our hypothesis of independeneeof the correlated emission probability versus the previous state allows alsosimple training algorithms along both criteria. The corresponding reesti-mation formulas have been published in ref. 25.The speech recognition using this new model proceeds along the same

techniques used with classical HMM (e.g. using a Viterbi algorithm 10) re-placing the classical emission probabilities by the correlated ones.

5. Discriminant functions as local distances

In the following, it will be assumed that the acoustic vectors emitted onthe different states qj of the HMM's form vector classes Wj in the feature spaceRd.In ref. 26, it was pointed out that a drawback of the classical HMM's is theweak discriminating power of the emission probabilities. Indeed, the prob-

ability of emitting a given vector may be nearly equal for quite differentclasses. For this reason, discriminant functions are an interesting alternativetool.

On the one hand, linear discriminant functions are well known tools andcan be trained either by a perceptron-like algorithm (in the linearly sepa-rable case) or by minimization of a least mean square algorithm 27,28). How-ever, due to their restricted modelization capabilities, they lead to a limitedimprovement of the recognition scores"). On the other hand, nonlinear dis-criminant functions are the most general and powerful tool but need an apriori choice among the infinite set of possible nonlinear functions anddrastically increase the number of parameters.Multilayer perceptrans (MLP) 30) which are closely related to high order

nonlinear discriminant functions 32), could circumvent all these drawbacks.When comparing with HMM's, the main feature of these machines is theirability to capture high-order constraints between data while stressing dis-crimination. Their training procedure simply performs the optimization of asingle machine in the parameter space and differs thus from the HMM ap-proach which uses a separate model for each class, estimates the parametersaccording to a maximum likelihood criterion and usually makes assumptionsabout the distribution of the acoustic vectors.

X. Aubert, H. Bourlard, Y. Kamp and C.l. Wellekens

5.1. Linear discriminant function and discriminative distance

It is assumed that each class Wj can be associated with a linear discrimi-nant function defined as:

gj(y) = wJ . ji, (5.1)

where ji = (yT,lV is an augmented acoustic vector and Wj is a (d + I)-di-mensional weight vector. Grouping equations (5.1) for all classes leads tothe matrix notation:

g(y) = WT. ji,

where g(y) = (gl(y), ... ,gn(y))T, W is a (d + 1) x û weight matrix and nisthe number of classes. Classification relies on the 'logical decision'

if gi(Y) - gj(y) > 0, vt =/:= i then y E Wi' (5.2)

The sign decision in (5.2) is often transformed into a binary one via a thresh-old logic unit. Such a system is then known as 'perceptron' ;'Ol). In that case,

240 Phllips Journul of Research Vol. 43 Nos3/4 1988

Philips journal of Research Vol. 43 Nos 3/4 1988 241

Improved hidden Markov models for speech recognition

the entries of W can be obtained by the classical perceptron training algo-rithm 28) on a preclassified vector set according to the reciprocal of the de-cision rule (5.2)

if y E Wi then gi(Y) > gi(Y), Vj * i. (5.3)

This procedure sequentially updates the parameters vector-by-vector andguarantees the convergence in the case of linearly separable classes. How-ever, if the classes are not linearly separable, convergence problems will arisein the perceptron training which is then often replaced by the minimizationof a Least Mean Squared Error (LMSE) criterion. In that case, given aknown partition of training vectors into classes, the values for the Wij aredetermined by minimizing a cost function E expressing the sum over all thetraining vectors of the squared errors between gi(Y) and the target, viz. 1 ify E Wi and 0 if y tE. Wi' This criterion function is explicitly written as:

aE = 2: 2: 11g(y) - Lik 11

2,k=1 yEwk

(5.4)

where L1k is a !2-vector with all zero components except the k - th one whichwill be referred to as index vector in the following. In that case, the discrim-inant functions realize thus an approximation in 'value' (0 or 1) which sig-nificantly differs from the sign approximation of the perceptron. Computa-tion of the weight matrix W minimizing E is straightforward. It is worthwileto observe that the cost function E in (5.4) can be rewritten as:

aE = 2: 2: dk(y),

k=1 yEwk(5.5)

inducing the definition of a discriminative distance between a particular vec-tor y and a class Wk as follows:

ndk(y) = Ilg(y) - Lik 112 = 2: (gtCJ) - l>kl)2.

1=1(5.6)

Thus, an unknown vector y will be assigned to class Wk if dk(y) ::s;; db),Vi * k. When comparing with the local distances - In p(Ylqk) defined by (1.2)and classically used in the HMM's, the main advantage of dk(y) is to accountsimultaneously for the proximity of a class and the distance to the other ones.

X. Aubert, H. Bourlard, Y. Kamp and C.l. Wellekens

Moreover, it does not make any assumption about the distribution of thevectors in each class (as e.g. the Gaussian case). By expanding the squarein (5.6), it is observed that dk(y) is proportional to -gk(y) within an additiveconstant independent of the class k that may be dropped in a DTW process.

In the spirit of the classical Viterbi training, it has been proved in ref. 29that the LMSE criterion described above can be embedded in an iterativeprocedure combining alternatively a DTW and a LMSE solution. This pos-sibility of embedding the determination of discriminant functions in a DTWprocess which iteratively improves the segmentation points still holds in themore general approaches described in the following.

5.2. Multi/ayer perceptrons (MLP)

It has been seen that minimization of (5.4) avoids convergence problemsbut at the price of the logical decision. A trade-off is then to simulate the!heshold logic unit by a smooth differentiable nonlinear function (sigmoid).In that case we define the new error criterion:

nE = L L 11 F [g(y)] - Lik 11

2,

k=l yEwk(5.7)

where the modified discriminant function vector F[g(y)] is related to g(y) by

F;[g(y)] = 1 + exp (_g;(y»1

In this way, one can expect that the LMSE process will approximate theperceptron solution for linearly separable sets. However, such a single layerperceptron inevitably inherits the limitations of linear discriminant func-tions.If nonlinear mappings are required to allow logical decisions, a possible

solution is to build a multilayer perceptron architecture by inserting, be-tween the input and the output, intermediate layers of nonlinear computa-tional units known as hidden units. This kind of network is called a multi-layer perceptron (MLP) and it can be proved 31,32) that it is always possibleto perform, with a restricted number of parameters, any nonlinear mappingbetween input/output pairs with only one or two layers of hidden units de-pending on whether the input vector is binary or real-valued.

AN-layered perceptron consists thus in (N + 1) layers Lk (k = O,... ,N),where Lo corresponds to the input layer, LN to the output layer and Lj (j =

242 Philips Journalof Research Vol. 43 Nos 3/4 1988

Phillps Journalof Research Vol. 43 Nos 3/4 1988 243

Improved hidden Markov models for speech recognition

1, ... ,N - 1) to the hidden layers. Let X; denote the vector formed by thenk unit values on layer Lk and let Xk = (Xl, IV be the corresponding aug-mented vector. The values on the input layer La are externally dictated bythe d-dimensional acoustic vector y to be processed and thus no = d and Xa== y. In all following layers Lk (k = 1,... ,N), each unit computes a weightedsum of the unit values of the preceding layer Lk-1 and passes this resultthrough a sigmoidal function

F(x) __ 1_1+ e-X

(5.8)

The state propagation through the network is thus ruled by the followingequation

Xk = F(WI . Xk-1), for k = 1, ... ,N, (5.9)

where Wk is a (nk-l) x nk weight matrix and where the nonlinear functionF is operated componentwise. Considering the global input-output relationof the MLP, one can write symbolically XN = cP(Xa) where cp is a highlynonlinear function depending on the parameters Wk' If the goal pursued bythe MLP is the classification of the acoustic vector y, XN will be a .a-dimen-sional vector where each component XN,k = cPk(y), (k = 1;... ,n), is the non-linear equivalent of the linear discriminant function gk(Y) and, in the sameway, can also be used as a local distance in a DTW.The weight matrices Wk are obtained from a training set of input and as-

sociated desired output pairs by minimizing, in the parameter space, the er-ror criterion defined as:

T

E = ~ 11 o, - DtlI21=1

(5.10)

where, for each training input acoustic vector YI (determining a particularXa), O, represents the output vector XN generated by the MLP, and DI thedesired output associated with YI' The number of training patterns is de-noted by T. As explained in refs. 30, 33, the weight matrices are iterativelyupdated via a gradient correction to reduce the error (5.10). This gradientis computed by recursively back-propagating the error, hence the name of'error back-propagation' given to this algorithm. The process is iterated un-til e.g. the absolute value of the relative correction on !he parameters falls

X. Aubert, H. Bourlard, Y. Kamp and C.l. Wellekens

under a given threshold. If the input units are directly connected to the out-put units without hidden units and if D, is an index vector, it is easily ob-served that criterion (5.10) reduces to (5.7) which can thus be minimized bythe same procedure.In refs. 26 and 34, these discriminant approaches have been used for con-

nected speech recognition applications and the results compare favourablywith those obtained by classical HMM's.

REFERENCES1) L.R. Bahl and F. Jelinek, IEEE Trans. Inform. Theory, IT-21, 401 (1975)2) l.K. Baker, IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-23, 24 (1975).3) T.L. Booth, Sequential machines and automata theory, WHey, New York, 1967.4) H. Bourlard, H. Ney and C.J. Wellekens, Proc. ICASSP-84, San Diego, CA, 1984,

p.26.10.1 •5) H. Bourlard, Y. Kamp and C.J. Wellekens, Proc. ICASSP-85, Tampa, FL, 1985, p.

31.5.1.6) L.E. Baum, Inequalities, 3 (1972).7) L.A. Liporace, IEEE Trans. Inform. Theory, IT-28, 729 (1982).8) R. Bellman, Dynamic programming, Princeton University Press, Princeton 1957.9) C.L Wellekens, Proc. ICASSP-86, Tokyo, 1986, p. 25.5.1.10) H. Bourlard, Y. Kamp, H. Ney and C.J. Wellekens, Speaker-Dependent Con-

nected Speech Recognition via Dynamic Programming and Statistical Methods, in Speechand Speaker Recognition, ed. M.R. Schroeder, Karger, 1985.

11) r.s. Bridle and N.C. Sedgwick, Proc. ICASSP-77, Hartford, CT, 1977, p. 656.12) T. Sakai, K. Maenobu and Y. Ariki, Information Science, 33, 31 (1984).13) R. De Mori, P. Laface and Y. Mong, IEEE Trans. on PAMI, 7-1, 56 (1985).1,4) H.C. Leung, A procedure for automatic alignment of phonetic transcriptions with con-

tinuous speech, S.M. Thesis, MIT, Cambridge, MA, 1985,15) H. Bourlard and C.L Wellekens, Proc. EUSIPCO-86, The Hague, p. 511, I.T. Young,

L Biemond, R.P.W. Duin and J.J. Gerbrands, 1986.16) V. W. Zue, Proc. IEEE, 73-11, 1985, p. 1602.17) X.L. Aubert, Proc. Eur. Conf. Speech Technology, Edinburgh, 1987, p. 161.18) L.R. Rabiner, B.H. Juang, S.E. Levinson and M.M. Sondhi, Bell System Tech-

nical J. 64, 1211 (1985).19) H.F. Silverman and N.R. Dixon, Proc. ICASSP-80, Denver, CO, 1980, p. 169.20) R.K. Moore, M.J. Russel and M.J. TomIinson, Proc. ICASSP-82, Paris, 1982, p. 1270.21) M.L Russel and R.K. Moore, Proc. ICASSP-85, Tampa, FI, 1985, p. 1.2.1.22) L. Kleinrock, Queuing Systems, Vol. 1: Theory, WHey, 1975.23) S.E. Levinson, Computer, Speech Language, 1,29, (1986).24) S. Furui, IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-34, 52 (1986).25) C.L Wellekens, Proc. ICASSP-87, Dallas, TX, 1987, p. 10.7.1. •26) H. Bourlard and C.L Wellekens, Speech Pattern Discrimination and Multilayer Per-

ceptrons, to appear in Computer, Speech and Language, 1988.27) J.J. Nilsson, Learning Machines, Mc. Graw-Hill, 1965.28) K. Kukunaga, Introduetion to Statistical Pattern Recognition, Academic Press, 1972.29) H. Bourlard and C.L Wellekens, Proc. EUSIPCO-86, The Hague, p. 507, ed. I.T.

Young, L Biemond, R.P. Duin and LJ. Gerbrands, 1986..30) D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Parallel Distributed Processing.

Exploration of the Microstructure of Cognition. vol. 1: Foundations, ed. D.E. RumeI-hart, and LL. McClelland, MIT Press, 1986.

31) M. Minsky and S. Papert, Perceptrons, Cambridge, MA, MIT Press, 1969.32) R.P. Lippmann, IEEE ASSP Magazine, 4 (1987).

244 Phlllps Journalof Research Vol. 43 Nos 3/4 1988

Phllips Journolof Research Vol. 43 Nos 3/4 1988 245

Improved hidden Markov models for speech recognition

33) T.J. Sejnowski and C.R. Rosenberg, Technical Report, Johns Hopkins Univer-sityIEECS-86/Dl, Technical Report MS-CIS-86-78, 1986.

34) H. Bourlard and C.l. Wellekens, Proc. 1st. Intern. Conf. Neural Networks, San Diego,p. IV.4D7 (1987). '

AuthorsXavier L. Aubert: Engineer in Applied Mathematics and Doctor in Applied Sciences at theUniversity of Louvain, Louvain-la-Neuve (Belgium), 1977 and 1983; Philips Research Labo-ratory, Brussels. 1985- . His current interests include hearing models and speech recognition.

Herve Bourlard: Ir. degree Faculté Polytechnique de Mons (Electrical Engineering), Mons(Belgium), 1982; Philips Research Laboratory, Brussels 1982- . Initially he was concerned withspeech synthesis and signal processing. His current interests include now speech recognition andneural network modelling. He is a member of EURASIP.

Y. Kamp: Ir. degree (Electrical and Mechanical Engineering), University of Louvain (Bel-gium), 1959; Doctoral degree in Applied Sciences, University of Louvain, 1966; Professor ofElectrical Engineering, University of Lovanium (Zaïre), 1961-1967; Philips Research Labora-tories, Brussels, 1967- . His research interests include speech recognition, fast algorithms forsignal processing and stability problems of multidimensional systems. He is a member of SIAM.

Christian J. Wellekens: Ir. degree University of Louvain (Electrical and Mechanical En-gineering), Leuven (Belgium) 1965; Ecole Polytechnique Federale de Lausanne, (Doctor inTechnical Sciences), Lausanne (Switserland), 1974; MBLE Development (Brussels-Belgium)1965-1968; Charge de Cours at Ecole Centrale des Arts et Metiers, Brussels, 1968- ; PhilipsResearch Laboratory Brussels, 1968- . His main interests are analog and digital circuit theory,signal processing, speech recognition, neural networks and applied mathematics. He is a mem-ber of IEEE, of the Societe des Ingenieurs de Telecommunication (SITEL), of INNS (Inter-national Neural Network Society) and of EURASIP (European Association for Signal Proc-essing).