hmms speech recognition

7/18/2019 HMMs Speech Recognition

http://slidepdf.com/reader/full/hmms-speech-recognition 1/23

Hidden Markov Models for Speech Recognition

B. H. Juang; L. R. Rabiner

Technometrics, Vol. 33, No. 3. (Aug., 1991), pp. 251-272.

Stable URL:

http://links.jstor.org/sici?sici=0040-1706%28199108%2933%3A3%3C251%3AHMMFSR%3E2.0.CO%3B2-N

Technometrics is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtainedprior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content inthe JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/journals/astata.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community takeadvantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.orgThu Sep 27 10:06:12 2007


http://www.jstor.org/about/terms.html

http://www.jstor.org/journals/astata.html

http://www.jstor.org/journals/astata.html

http://www.jstor.org/about/terms.html




1991 American Stat ~stical ssociation and

TECHNOMETRICS, AUGUST 1991, VOL. 33 NO. 3

the American Society for Quality Control

idden arkov odels for

Speech Recognition

B

H

Juang and

L

R Rabiner

Speech Research Department

AT&T Bell Laboratories

Murray Hil l NJ 07974

The use of hidden Markov models for speech recognition has become predominant in the

last several years, as evidenced by the n umbe r of published pap ers and talks at majo r speech

conferences. The reasons this method has become so popular are the inherent statistical

(mathematically precise) framework: the ease and availability of training algorithms for es-

timating the parame ters of th e models from finite training sets of speech d ata ; the flexibility

of the resulting recognition system in which on e can easily chang e the size, type, o r architectur e

of the models to suit particular words, sounds. and so fort h; and the eas e of implem entation

of the overall recognition system . In this expository article, we addres s the role of statistical

methods in this powerful technology as applied to speech recognition and discuss a range of

theoretical and practical issues that are as yet unsolved in terms of their im portance and their

effect on performa nce fo r different system implementation s.

KEY WORDS: Baum-Welch algorithm: Incomplete data problem; Maximum a posteriori

decoding: Maximum likelihood.

Speech recognition by machine has come of age

source Manage ment task (Chow et al. 1987; Lee 1989),

in a practical sense. Numerous speech-recognit ion

and other related efforts (Derouault 1987; Gupta,

systems are currently in operation in applications

Lennig, and Mermelstein 1987). The widespread

ranging from a voice dialer for telephone to a voice

popularity of the HM M framewo rk can be attributed

response system that quotes stock prices on verbal

to its simple algorithmic structure, which is straight-

inquiry. What makes these practical benefits happen

forward to implement, and to i ts clear performance

is the recent technological advances that enable

superiority over alternative recognition structures.

speech-recognition systems to respond reliably to

Perfo rma nce, particularly in terms of accuracy, is

nonspecific talkers with a reasonably sized recogni-

a critical factor in determining the practical value of

tion vocabulary. One such major advance is the use

a speech-recognition system. A speech-recognition

of statistical methods, of which hidden Markov model

task is often taxonomized according to its require-

(H MM ) is a part icularly interesting one .

ments in handling specific or nonspecific talkers

Th e use of HM M's for speech recognition has be- (speaker-depe ndent vs. speaker-indepen dent) and in

come pop ular in th e past de cad e. Although the num - accepting only isolated utterances or fluent speech

ber of reported recognition systems based on HM M's

(isolated word vs. connected w ord). A t present, the

is too large to discuss in detail here, it is worthwhile state-o f-the- art technology can easily achieve almost

to point out some of the most important as well as

perfect accuracy in speaker-independent isolated-

successful of these systems. The se include the early

digit recognition a nd would co mmit only

2-3%

digit-

work of the Drago n System at Carnegie Mellon Uni-

string err ors when the digit sequence is spoken in a

versity (Ba ker 1 975 ), the longstand ing effort of IB M

naturally connected manner by nonspecific talkers.

on a voice-dictation system (Averbuch et al. 1987;

Fur th erm ore , i n speaker- independent continuous

Bah l, Jelinek , and M ercer 1983; Jelinek 19 76), the

speech en vironments with a 1,000-word vocabulary

work at AT&T Bell Laboratories, Insti tute for De-

and certain grammatical constraints. several ad-

fense Analyses,

MIT

Lincoln Labs, and Philips on

vanced systems based on HMM have been demon-

whole-word recognit ion using HMM's (Bourlard,

strated to be able to achieve

96%

word accuracy.

Kamp, Ney, and Wellekens 1985; Lee, Soong, and

These results sometimes rival human performance

Juang 1988; Lipp man , Martin, and Paul 1987; Poritz

and thus, of course, affirm the potential usefulness

and Richter 1986; Rabiner, Juang, Levinson, and of an automatic speech-recognition system in des-

Sondhi 1986; Rabiner, Levinson, and Sondhi 1983;

ignated applications.

Rabiner, Wilpon, and Soong 1989), the DA RP A Re-

Although hidden Markov modeling has sig-



252 B.

H.

JUANG AND L. R. RABINER

nificantly improved the performance of current

speech-recognition systems, the general problem of

completely fluent, speaker-independent speech rec-

ognition is still far from being solved. For example,

there is no system that is capable of reliably recog-

nizing unconstrained conversational speech, nor does

there exist a good way to infer statistically the lan-

guage structure from a l imited corpus of spoken sen -

tences. The purpose of this expository article is,

therefore, to provide an overview of the theory of

HMM, discuss the role of statistical methods, and

point out a range of theoretical and practical issues

that deserve attention and are necessary to under-

stand so as to further advance research in the field

of speech recognition.

1.

MEASUREMENTS AND MODELING

OF SPEECH

Speech is a nonstationary signal. When we spea k,

our art iculatory apparatus (th e lips, jaw, tongue , and

velum, as show n in Fig. 1) mod ulates the a ir pressure

and flow to produc e an a udible sequence of soun ds.

Although the spectral content of any part icular sound

may include frequencies up to several thousand hertz ,

our articulatory configuration (vocal-tract shape,

tongue movem ent , etc . ) often does not undergo dra-

matic changes mo re than 10 t imes per second. Speech

modeling thus involves two aspects: (1) Analysis of

the short-time spectral properties of individual

sounds , perform ed at an interval on the ord er of 10

milliseconds (m sec), and

(2)

characterization of the

long-time development of sound sequences, on the

order of 100 msec, due to articulatory configuration

changes.

To see how speech may be viewed as a nonsta-

tionary signal, we show in Figure 2 a short segment

(approximately 450 msec long) of the speech wave-

form corresponding to a recorded ut terance of the

word judge. Note that digital processing of a speech

signal requires discrete t ime sampling and quanti-

zation of the waveform. Typically, an analog speech

signal is sampled at a rate of 8-20 kilohertz (k Hz ),

and th e am plitude of ea ch waveform sam ple is usu-

ally rep res en ted by on e of 216 65,536 values-that

is, 16-bit quantization of the discrete time signal.

Short-t ime spectral properties of the digital speech

signal are an alyzed by successively placing a window

over the sampled waveform as i l lustrated in Figure

2. The window generally has the property that i t

tapers toward 0 at the ends so as to minimize the

discontinuity of the signal outside the window. A

short-time sp ectral window has a typical analysis width

of 10-50 m sec, and successive windows are norma lly

positioned 10-30 msec ap art. spectral-analysis

method is then applied to the windowed signal to

produce a parsimonious representation of the spec-

tral properties of the speech waveform within the

window. Many spectral-analysis methods h ave been

proposed for speech-signal modeling. These include

such standard methods as measurement of the dis-

crete (fast) Fourier transform (FFT), al l-pole mini-

mum-phase l inear prediction (LPC) methods, and

autoregressive/moving average models (Allen and

Rabiner 1977; Atal and Ha nau er 1971; Cadzow 1982;

Mak houl 1975; Mark el and G ray 1976; Schafer and

Rabiner 1971). Even th e mo re traditional fi l ter-bank

method of spectral analysis is still used in some sys-

tems (Dautrich, Rabiner, and Martin 1983), part ic-

ularly in hardware implementations. To emphasize

spectral properties that are known to be important

to a human l istener, auditory models can be incor-

porated in the overall spectral representation (Cohen

1985; Ghitza 1986). In speech modeling, we often

call this short-time spectral vector an observation

vector or simply an observa tion. In F igure 2, where

the analysis mechanism is illustrated, we use a 30-

msec Ham min g window (fr am e) with successive

spectral frames spaced 15 msec apart . FFT spectra

of the first four frames of the signal are plotted in

the figure, each being fitted with a 10th order LPC

all-pole smoothed model spectrum.

To see the development of sound sequences on a

relatively long-time basis, we show in Figure 3 a speech

waveform correspo nding to a sente nce, My cap is

off for the judge (approx imately two seconds long),

together with a spectrogram plot of the signal. A

spectrogram is a plot of successive spectra in which

the horizontal an d vertical coordinates are t ime and

frequency, respectively, and the darkness at each

time-frequency point represents the corresponding

spectral magnitude.

T LUNGS

(POWER)

Figure I Schematic Description o f the Human Vocal System

TECHNOMETRICS A UG UST 1991 VOL. 33 NO.



HIDDEN MARKOV MODELS FOR SPEECH RECOGNITION 253

Windowed Time Waveform LOG F F T

LPC

Magnitude Spectrum

FREQUENCY

WAVE FORM

Figure 2 A Segm ent of Speech W aveform Corresponding to the Wo rd Judge and the Resulting Short-Time Spectral Analysis

of the First Four Frames.

It has been found (Juang, Rabiner, and Wilpon

1987) that sp ectral vectors, as represen ted by the so-

called cepstrum, which is defined as the Fourier

transform of the log magnitude spectrum-particu-

larly the log magn itude LPC -model spectrum-have

several advantages in statistical modeling for sp eech

recognition. Compu tationally, the cepstrum of a sta-

ble all-pole system can b e found recursively. Let the

p oly no mialP (z) p lz - p g 2 . p K z K

have all of its roots inside the unit circle. The LPC

(smoothed) spectrum of a frame of speech has the

form a/ P( z) , in terms of the transform, where a

is the gain term and

K

is typically on the order of

10-16. Since In P( z- ) is analy tic inside the unit cir-

cle, it can be repres ented in a T aylor series, leading

to th e Lau rent expansion

The coefficients c(k) defined previously are o ften called

the LPC cepstrum. Figure

4

shows histograms of the

first 12 LPC -cepstral coefficients (derived from 10th-

order all-pole models) obtained from a speech data

base of 70 sentences spoken by seven people. (The

speech material in the data base includes many dif-

ferent sentences and spans a wide range of speech

sounds. T he ban dwidth of th e speech signal was lim-

ited to 4 kHz a nd a sampling rate of 8 kHz was used.)

For speech recogn ition, a weighting is generally ap-

Figure 3 Speech Amp litude and Resulting Spectrogram for the Sentence, My Cap Is Off for the Judge.

TECHNOMETRICS AUGUST 1991 VOL. 33 NO. 3



254

B.

H.

J U A N G A N D L. R. RABINER

CEPSTR AL COEFF. VALUE

Figure

4.

Histograms of the Fi rst

12

LPC Cepstral Coeffi

cients From a Seven Speaker 70 Sentence Speech Data Base.

plied to the LPC cepstrum before further processing

Juang et al . 1987).

A

vector, such as the cepstrum ,

that repre sents a sh ort t ime speech spectrum is con-

sidered an observation of speech.

Another type of speech observation that is often

used is a discrete-symbol representation of the spec-

tral vector of each frame that results from a classi-

fication procedure called spectral labeling. The dis-

crete symbol is obtained by choosing one out of a

finite collection of several hundred spectral proto-

types. Th e chosen spectral prototype is the on e that

is closest in some well-defined spectral sense) to the

input speech spectrum. Statistical modeling is per-

formed on the index sequence of the closest spectral

prototypes. T he concept of observation distribution

is very different in this case of discrete symbols from

that of the c ontinuous distribution of param eters that

define a spectrum.

On a longer t ime basis, there are many ways to

characterize the sequen ce of sounds-that is, run-

ning speech-as repr esen ted by a sequenc e of spec-

tral observations. The most direct way is to register

the spectral sequence directly without further model-

ing. If we denote the spectral vector at time

t

by

0

and the observed spectral sequence corresponding to

the sequence of speech sounds lasts from

t =

1 to

t

=

T.

a direct spectral sequence representation is

then simply {O,),Cl

=

0 1 ,

02 .

OT ). Al te rna-

tively, one can model the seq uence of spectra in terms

of a Markov chain that describes the way one s ound

changes to another. Some perspectives as to how

these two seemingly different methodologies relate

to each other were given by Juang 1984) and Bridle

1984). In this article, we only discuss the latter case

in which an explicit probabilistic structure is imposed

on the sound-sequence representat ion.

TECHNOMETRICS A UG UST 1991 VOL. 33 NO.

3

HMM

Formulation

Consider a first-order N-state M arkov chain as i l-

lustrated for

N

= in Figure 5. Th e system can be

described as being in one of the

N

distinct states

1,

2,

. .

N

at any discrete time instant

t

We use the

state variable qr as the state of the system at discrete

time

t

Th e Markov chain is then de scribed by a state

transition probability matrix A =

[ a i j ]

where

with the following axiomatic constraints:

aj i

0

2)

and

a =

1

for all

i.

=

No te that in 1) we have assum ed homogene ity of

the M arkov chain so that the transit ion probabil it ies

do not depend on t ime. Assume that at

t = 0

the

sta te of the system q o is specified by an initial state

probability

71,

= Pr qll

= i .

Then, for any state

sequence

q =

q, , , q , , q , ,

. .

qT ), the probabil ity

of

q

being generated by the Markov chain is

Suppose now that th e state sequence

q

cannot be

readily observed. Instead, we envision each obser-

vation O r, say a cepstral vector as mentioned pre-

viously, as being produced with the system in state

q , , q r E {I , 2 ,

. . .

N). We assume that the pro-

duction of O r n each possible state i i

=

1 ,

2, . .

N ) is stochastic and is characterized by a set of ob-

servation probability measures

B =

{b, O,)),Y=,

Figure

5

A Fi rst Order Three State Markov Chain W i th As

sociated Processes.



55

HIDDEN MARKOV MODELS FOR SPEECH RECOGNITION

where

b,(O,)

=

P r ( 0 , q ,

=

i .

5 )

If the stat e sequence

q

tha t led to th e observat ion

sequence

0

=

( O , ,

0 2

.

.

Or) i s known, the

probability of 0 being generated by the system is

a s sumed to be

P r ( O q B )

=

b,](Ol)b,,(O,) . . . b,,(OT).

(6)

Th e joint probability of

0

a n d q being produced by

the system is simply the pro duct of (4) an d (6), writ-

ten as

T

P ~ ( o ,q 71.

A ,

B

=

n,,,

n , ,_, ,

(7),,co,).

I 1

It then follows that the stochastic process, repre-

sented by the ob servat ion sequence

0

is character-

ized by

P r ( 0 n , A , B )

=

P ( 0 ,

q

n , A , B )

9

which describes t he probability of 0 being produced

by the system withou t assuming the knowledge of the

state sequence in which it was generated. The triple

i =

( n , A . B) thus def ines an HMM (8) . In the

following, we shall refer to as th e model and the

model parameter se t interchangeably without am-

biguity.

The particular formulation of (8) is quite similar

to that of the incomplete data problem in statistics

(Demp ster . Laird , and Rub in 1977) . In te rms of the

physical process of a speech signal. one interpreta-

tion that may be helpful for initial understanding of

the problem is tha t a s ta te represents an abs trac t

speech code (such as a phoneme) embedded in a

sequence of spectral observations. and because

speech is normally produced in a cont inuous mann er ,

it is often difficult a nd sometim es unnecessary to d e-

termine how and when a s ta te t rans i t ion ( f rom one

abstrac t speech code to ano ther) is made. T herefore ,

in (8) we do not assume explicit , definitive obser-

vation of the state sequence q al though the Mar-

kovian str ucture of th e state seq uence is strictly im-

plied. This is why it is called a hidden Mar kov

model .

2. THE STATISTICAL METHOD OF THE

HIDDEN MARKOV MODEL

In the development of the HMM methodology,

the following problems are of particular interest. First,

given the observation sequence

0

and a model i . ,

how d o we efficiently evaluate the probability of

0

being produ ced by the source mode l ).-that is , P r ( 0

i ?

Second , given th e observat ion

0

h o w d o w e

solve the inverse problem of estitnating the param-

eter s in i ?Although the probability measure of (8)

does not dep end explicitly on q , the knowledge of

the most likely state sequence q tha t led to the ob-

servation

0

is desirable in many applications. The

third problem then is how to d educe f rom 0 the most

likely state sequenc e

q

in a meaningful manner. Ac-

cording to convention (Ferguson 1980) we call these

three problems

1)

the evaluation problem, (2) the

es t imat ion problem. and (3) the decoding problem.

In the following sections, we describe several con-

vent ional solut ions to these three s tandard problems.

2.1 The Evaluation Problem

The main concern in the evaluation problem is

computational efficiency. Without complexity con-

s tra ints , one can s imply evalua te Pr(0 / i ) d ir ec tly

fro m the definition of (8). Since the sum mation in

(8) involves NT-I possible q sequences , the tota l

computa t ional requirements a re on the order of

2 T . NT + ' pe ra tions . The need to compute (8) w ith-

out the exponent ia l growth of computa t ion, as a

function of the sequence length

T,

is the first chal-

lenge for implementa t ion of the H M M technique .

Fortunately. using the well-known forward-back-

ward procedure (Baum 1972), this exorbitant com-

putational requirement of the direct summation can

be easily alleviated.

forward induction procedure allows evaluation

of the probabi l ity P r ( 0 i to be carried out with

only a computational requirement linear in the se-

quence length T and quadratic in the num ber of states

N.

To see how this is don e, let us define the forward

variable a,(i) as a,(i)

=

P r ( O l . O ? . . O,, q, = i

>.)-that is. th e proba bility of th e partial obs erva tion

sequence up to t ime

t

and s ta te q,

=

i at t ime

t .

With

reference t o Figure 6 , which shows a trellis struc-

ture implementation of the computation of a,(i) ,

we see that the forward variable can be calculated

inductively by

The des ir ed r esu lt is s imply P r ( 0 i )

=

>=,

aT( i ) .

This t remendou s reduct ion in computa t ion m akes

the HMM method a t t rac t ive and viable for speech-

recognizer designs because the evaluation problem

can be viewed as on e of scoring how well an unknow n

observat ion sequence (corresponding t o the speech

to be recognized) matches a given model (or se-

quence of models) source, th us providing an efficient

mechanism for classification.




B. H. JUANG AND L. R. RABINER

at 1

=

t - l ( l ) a l l b ( o t ) 9 - , ( 2 ) a , , b , ( o t )

STATE

t-1(3) a31bl (Ot) at-l( 4)a,lb l (ot)

2

3

at 1

4)

at

l(4)

t-

t t + l

Figure 6

A

Trellis Structure for the Calculation of the Forward Partial Probabilit ies a, ;).

2.2 The Estimation Problem

to improve

E.

in the sens e of increasing the likelihood

Given an observation sequence (or a set of se-

P r ( 0 E.). Th e maximization of the Q function over

quences) 0 , he estimation problem involves finding

s the second step of the algorithm. The algorithm

the right model param eter values that specify a

continues by replacing i.' with

E

and repeating the

model most likely to produ ce the given sequence . In

two steps until some stopping criterion is met. The

speech recognitio n, this is often called training,

algorithm is of a general hill-climbing type and is

and the given sequence, on the basis of which we

only guaranteed to produce fixed-point solutions, al-

obtain the model parameters, is called the training

though in practice the lack of global optimality does

sequence, even though the formulation here is sta-

not seem to cause serious problems in recognition

tistical.

performance (Paul 1985). Note that the classical E M

In solving the estimation pr oblem , we often follow

algorithm of Dempster et al. (1977) parallels closely

the method of maximum likelihood (ML); that is,

the Baum-Welch algorithm. As noted by Baum in

the discussion section of D emps ter et al. (1977), the

we choose

E such that P r ( 0 i.), as defined by (8),

incomplete dat a formulation of Dempste r et al , is

is maximized for the given training sequence 0 . T h e

Baum-Welch algorithm (Baum and Egon 1967; Baum

essentially identical to the HMM formulation with-

and Petrie 1966; Baum, Petrie, Soules, and Weiss

out th e Markov-chain constraints. Th e Q function of

1970; Baum and Sell 1968) (often blended with the

(9) is clearly an expectation operation, so the two-

forward-backward algorithm because of its interpre-

step algorithm is identical to the E(xpectation)-

tation as an extension of the forward induction pro-

M(aximization) algorithm.

cedure to the evaluation problem) cleverly accom-

The M L meth od is, however, not th e only possible

plishes this maximization objective in a two-step

choice for solving the estimation problem. As will

procedure. Based on an existing model 3.' (possibly

be discussed later, other alternatives are attractive

obtained randomly), the first step transforms the ob-

and offer different modeling perspectives.

jective function

P r ( 0 E.) into a new function Q(3.' ,

2.3 The Decoding Problem

E.) that essentially m eas ure s a divergenc e between

the initial model i ' and an updated model i .. The Q

As noted previously, we often are interested in

function is defined, for the simplest case, as

uncovering th e most likely state sequence that led to

the observation sequence 0 . Al though the proba-

Q(3.' , i .) Pr (0 , q E.')log Pr(0,

q

i.),

(9)

bility measure of an HMM, by definition, does not

9

explicitly involve the state se quenc e, it is impo rtant

w h e r e P r ( 0 ,

q

i.) is given in (7). Because Q(E.', E.) in many applications to have the knowledge of the

Q ( i l , i .' ) impl ies P r ( 0 i.) P r ( 0 E.'), we can

most likely state sequence for several reasons. As a n

then simply maximize the function Q(E.', 3.) over i.

example, if we use the states of a word model to

TECHNOMETRICS AUGUST 1991 VOL. 33 NO.



57


represent the distinct sound s in the word, it may be

desirable to know the correspondence between the

speech segments and the sound s of the word , because

the duration of the individual speech segments pro-

vides useful inform ation for speech recognition.

As with the second problem, there are several ways

to define the decoding objective. The most trivial

choice is, following the B ayesian framew ork, to max-

imize the (instantaneous) a posteriori probability

that is, we decode the state at time by choosing q,

to be

-

qr

=

arg max yr(i).

l s i s N

It is also possible to extend the definition of (10)

to the cases of pairs of states or triples of states and

so on. For example, the rule

( q l, q l + , ) = arg max Pr(q, = i , q ,+ , = 0 , 1

15 i . j5 .V

will produce the maximum a posteriori (MA P) result

of the minimum number of incorrectly decoded state

pairs, given the observation sequence 0

Although the preceding discussion shows the flex-

ibility of possible localized decoding , we ofte n choose

to work o n the e ntire state sequence q by maximizing

Pr(q i or three reasons: (1) It is optimal for

the unknown observation 0 in the MAP sense, (2)

speech utterances are usually not prohibitively long

so as to req uire locally (rath er than globally) optimal

decodin g, and (3) it is possible to form ulate the max-

imization of

Pr( q 0 , E.) in a sequential mann er to

be solved by dynamic programming method s such as

the Viterbi algorithm (Forney 1973).

Maximization of Pr(q 0 , .) is equivalent to max-

imization of Pr(q, 0

1

i.) because P r ( 0 i.) is not

involved in the optimization process. From

(7), we

see that

If we define

dr(i)

=

max

P r ( q l , q 2 ,

.

.

.

q,

=

i,

41.92.. .4r-1

then the following recursion is true:

d , + , ( j ) = [max Gr(i)ai,]bj(O l+,). (15)

Th e optimal state sequence is thus the one that leads

to dT(qT) = maxi dT(i) . This re cursion is in a form

suitable for the application of the Viterbi algorithm.

2.4 Speech Recognit ion Using HM M's

The typical use of HMM's in speech recognition

is not very different from the traditional pattern-

matching paradigm (Du da and Har t 1973). Success-

ful application of H M M me thod s usually involves the

following steps:

1. Define a set of L sound classes for modeling,

such as phonem es or words; call the sou nd classes V

=

{Ul, u2, . .

,

uL).

2. For e ach class, collect a sizable set (the trainin g

set) of labeled utterances that are known to be in

the class.

3. Based on each training set, solve the estima-

tion problem to obtain a best model i.i for each

class (i = 1, 2 ,

. .

, L .

4. During recognition, evaluate P r ( 0 i .,) ( i =

1 , 2 ,

. . .

L) for the unknown utterance

0

and

identify the speech that produced 0 as class u, if

Pr(0 I . , ) = max P r ( 0 E,,).

(16)

I C I ~ L

Since the detailed characteristics of how to imple-

ment an HMM recognizer are not essential to this

article, we will omit them here. Interested readers

should consult Jelinek, Bahl, and M ercer (1975) and

Levinson, Rab iner, and Sondhi (1983) for more spe-

cifics related to individual applications.

3. STRENGTHS OF THE METHOD OF

HIDDEN MARKOV MODELS AS APPLIED

TO SPEECH RECOGNITION

Th e strengths of the HM M method lie in two broad

areas: (1) Its mathematical framework and (2) its

implementational structure. In terms of the mathe-

matical framework, we discuss the method's consis-

tent statistical methodology and the way it provides

straightforward solutions to related problems. In terms

of the im plementatio nal structure, we discuss the in-

here nt flexibility the m etho d provides in dealing with

various sophisticated speech-recognition tasks and

the ease of imp lementatio n, which is one of the cru-

cial consideration s in many practical engineering sys-

tems.

3.1 The Consistent Statistical Framework of

the HM M Methodology

Th e foundation of the H M M methodology is built

on the well-established field of statistics and proba-

bility theory . T hat is to say, the de velopm ent of the

methodology follows a tractable mathematical struc-

ture that can be examined and studied analytically.




258


The basic theoretical strength of the HM M is that

it combines modeling of stationary stochastic pro-

cesses (for the short-t ime spectra) and the temporal

relationship am ong the processes (via a Markov ch ain)

together in a well-defined probability space. The

measure of such a probability space is defined by

(8).T his combination allows us to study these two

sepa rate asp ects of mode ling a dynamic process (like

speech) using one consistent framew ork.

In addition, this combination of short-time static

characterization of the spectrum within a state and

the dyn amics of chang e across states is rat he r elegant

because the m easure of (8) can be decompo sed sim-

ply into a summation of the joint probability of

0

the obser vation, and q , the state sequen ce, as defined

by (7). The decomposition permits independent study

and analysis of the behavior of the short-time pro-

cesses and the long-term characteristic transitions.

Since decoding and recognit ion are our main con-

cerns, this also provides an intermediate level of de-

cision that can be used to choose among alternate

configurations of the mod els for the recognition task .

This kind of flexibility with con sistency is particularly

useful for converting a time-varying signal such as

speech, without clear anchor points that mark each

sound chan ge, into a sequence of (s oun d) codes.

3.2

The Training Algori thm for HMM s

Ano ther at tractive feature of HMM s comes from

the fact that it is relatively easy and straightforward

to train a model from a given set of labeled training

data (one or more sequences of observat ions).

When the M L cri terion is chosen as the est imation

objective-that is. maxim ization of

P r ( 0 i.) over

i.-the well- kno wn Baum-W elch algo rithm is an it-

erative hill-climbing proc edu re tha t leads to, at least,

a fixed-point solution as explained in Section 2.2. If

we choose the state-optimized (or decoded) likeli-

hood defined by

L j ( q ) m ax P r ( 0 ,

q

i ,

(17)

q

where

q

a rg max P r (0 ,

q 1 i

(18)

as the optimization cri terion, the segmental k-m eans

algorithm (Juang and Rabiner 1990; Rabiner, Wil-

pon , and Juang 1986), which is an extended version

of the Viterbi traininglsegmentation algorithm (Je-

linek 197 6), can be co nveniently used to accomp lish

the parameter training task.

Th e segmental k-means algorithm , as can be seen

from th e objective fun ction of ( 17), involves two op-

4

timization steps-namely, the segme ntation step and

the optimization s tep. In the segmentation step. we

find a state sequence

q

such that (17) is obtained fo r

a given model

E

and an observat ion sequence 0 .

The n, given a s tate sequence t j and the observation

0 the optimization step finds a new set of model

parame ters so as to maximize (17): that is.

-

arg max{max P r ( 0 . q i .)). (19)

4

Equation (19) can be rewrit ten as

arg max{max [ log P r ( 0 q ,

j .

q

+

log Pr(q i.)]). (20)

Note that max,[logPr(O

q, i

log P r( q 3.))

consists of two terms that can be separately opti-

mized since log Pr (q 3.) is a function of only

A ,

the

state transition probabil ity matrix, and log P r ( 0

ij

i.) is a function of only

B,

the family of (intrastate)

observation distributions. (We neglect the initial state

probability for simplicity in presentation.) This sep-

arate optimization is the main distinction between

the Baum-Welch algorithm and the segmental k-

means algorithm.

These two training algorithms (Baum-Welch and

segmental k means) both result in well-formulated

and well-behaved solutions. [F or a theoretical com-

parison of the two methods in terms of likelihood

differences and state posteriori probability devia-

t ions. interested readers should consult Merhav and

Ephraim (in press).] The segmental k-means algo-

ri thm, how ever, due to the se parate optimization of

the components of the model parameter set , leads

to a more straightforward (simpler with less com-

putation and numerical difficulties) implementation.

The ease of HMM training also extends to the

choice of observation distributions. It is known (Juang

1985; Juang and Rabiner 1985; Liporace 1982) that

these algorithms can accommodate observation den-

sities that are (a) strictly log-concave densities, (b)

elliptically symmetric densities, (c) mixtures of dis-

tributions of the preceding two categories, and (d)

discrete distributions. These choices of observation

distribution in each state of the model allow accu rate

modeling of virtually unlimited types of data.

3.3 Modeling Flexibility

The flexibility of the basic

HMM

is manifested in

three aspects of the model, namely: model topology,

observation distributions, and decoding hierarchy.

Many topological struc tures for HMM s have been

studied for speech modeling. For modeling isolated

utterances (i . e. , whole words or phra ses), we often

TECHNOMETRICS, AUGUST 1991, VOL. 33, NO. 3



259


b)

Figure

7. a)

Left-to-Right Hidden Markov Model,

b)

Er-

godic Hidden Markov Model.

use left-to-right models Bakis 1976; Rabin er et al.

1983) of the type shown in Figure 7a, since the ut-

terance begins and ends at well-identified time in-

stants except in the case of very noisy or corrupte d

speech ) and the sequen tial behavior of the speech is

well represented by a sequential HMM.

r

other

speech-modeling tasks, the use of ergodic models

Levinson 1987) of the type shown in Figure 7b is

often more appropriate. The choice of topological

configuration a nd th e num ber of states in the mod el

is generally a reflection of the a priori knowledge of

the particular speech source to be modeled and is

not in any way related to the mathematical tracta-

bility or implementational considerations.

In Section 3.2, we pointed out that the range of

observation distributions that can be accommodated

by well-developed training algorithms is rather large.

There are no real analytical problems that make the

use of any of this rather rich class of distributions

impractical. Since speech has been shown to display

quite irregular probability distributions Jayant and

No11 1984; Juang, Rabiner, Levinson, and Sondhi

1985), both in waveform an d spectral paramete rs,

one indeed needs the freedom to choose an appro-

priate distribution model that fits the observation

well and yet is easy to obtain.

In modeling spectral observation s, we have found

the use of mixture densities Juang 1985; Juang and

Ra bine r 1985) beneficial. W ith f, . enoting th e ker-

nal density functio n, the mixture density assumes the

form

where c, is the mixture com ponen t weight, z

c,

=

1, and M is the numb er of mixture compone nts. This

mixture distribution function is used to characterize

the d istribution of the observations in each stat e. By

varying the number of mixture components, M , it

can be shown that it is possible to approximate den-

sities of virtually any shape unim odal, multimodal,

heavy-tail, etc.).

With specific constraints, the basic form of the

mixtu re distribution of 21) can be modified to ac-

com mod ate several othe r types of distributions, giv-

ing rise to the so-called vector quantizer HM M Ra -

biner et al . 1983), semicontinuous HM M Huan g and

Jack 1989), or continuous HMM Bahl, Brown, de

Souza, and Mercer 1988b; Poritz and Richter 1986;

Rabiner et al. 1986).

The choice of observation distributions also ex-

tends to the case of the HMM itself. One can form

an HMM with each state characterized by another

HM M; that is, b, O) of each state i = 1, 2. . .

N) can assume the form of an HMM probability

meas ure as defined by 8). This principle is the basis

of many of the subword unit-based speech-recogni-

tion algorithms Lee, Juang, Soong, and Rabiner

1989; Lee et al. 1988).

3.4 Ease of Implem entation

Two areas of conce rn in the imple men tation of any

algorithm are the potential for numerical difficulties

and the computational complexity. The H MM is no

exception.

The potential numerical difficulties in implement-

ing HM M systems come from the fact that th e terms

in the HMM probability measure of

7)

and 8) are

multip licative. direct out com e of the multiplicative

chain is the need for excessive dynamic range in nu-

merical values to prevent overflow or overflow prob-

lems in digital implementations.

Numerical scaling and interpolation are two rea-

sonable ways of avoiding such numerical problems.

Th e scaling algorithm, well docume nted by Levinson

et al. 1983) and Juang and Rabin er 1985), alleviates

the dynamic-range problem by normalizing the par-




260

B. H. JUANG AND

L.

R. RABINER

tial probabilities, such as the forward variable de-

fined in Section 2.1, at each time instance before the y

cause overflow or underflow. The scaling algorithm

is naturally blended in the forward-backward pro-

cedu re. Normalization alon e, howev er, does not en-

tirely solve the numeric problems that result from

insufficient da ta su ppo rt. Insufficient d ata support can

cause spurious singularities in the model parameter

space. One may resort to parameter smoothing and

interpolation to alleviate such numerical singularity

problems. A particularly interesting method to deal

with sparse data problems is the scheme of deleted

interpolation proposed by Jelinek and Mercer (1980).

For HM M speech recognition, some trivial measures

such as setting a num eric floor to prevent singularity

are o ften found beneficial and are straightforward to

implem ent (L ee , Lin, and Juang 1991; Rabiner et al .

1986).

With the understanding of the relationship be-

tween the trellis structure (Juang 1984) and the de-

coding structure of the HMM, as discussed in the

evaluation problem, we are able to apply the HMM

to many complicated problems without much con-

cern as to comp utational complexity. A simple cal-

culatio n could verify th at a typical off-the-shelf digital

signal proc essor of 10 million floating point op era -

tions per second would be able to support a 100-

word recognition vocabulary for a real-t ime perform-

ance. Even as recognition vocabularies increase to

size 1,000 or mo re, the required processing often

remains pretty m uch the same because of grammat-

ical constraints that limit the average number of words

following a given word to somewhere on the order

of 100 wo rds . [Th is effect is called word average

branching factor o r perplexity and has been shown

to be o n the o rde r of 100 for several large vocabulary-

recognition tasks (B ahl et al. 1980; Chow e t al. 1987).]

4.

HIDDEN MARKOV MODEL ISSUES FOR

FURTHER CONSIDERATION

Th e basic theory of hidden Ma rkov modeling has

been developed over the last two decades. When

applied to speech recognition, however, there are

still some rem aining issues to be resolved. We begin

with a discussion of para me ter-estim ation criteria as

applied to optimal decoding of the observation se-

quence.

4.1

Parameter-Estimation Criteria

Th e original H M M param eter estimation was for-

mulated as an inverse problem: Given an observation

sequence and an assumed source mod el, estimate

the (source) parame ter set

i

hich maximizes the

probabili ty that was produced by the source. Th e

ML method, which seeks to maximize Pr(0 E . , is

optimal according to this criterion. Th e Baum-Welch

reestimation algo rithm , as described previously, is a

convenient, straightforwardly implementable solu-

tion to the ML HMM estimation problem.

The ML method, however, need not be optimal

in terms of minimizing classification e rro r rate in rec-

ognition tasks in which the observation

is said to

be produced by one of the m any (say L ) source classes,

{C,),L,,. This is th e classic prob lem in isolate d a nd

connected word-recognition tasks. To achieve the

minimum classification error rate, the classical Bayes

rule (Duda and Hart 1973) requires that

C * (O ) C, if C, arg max Pr(C, O ), (22)

1

where is the unkno wn observation to be classified

into (recognized as) one of the L classes, C ( .) de-

notes the decoded class of 0 , and Pr(C, 0 ) is the

(tru e) a posteriori probability of C, given the ob ser-

vation 0 . T he dec ision rule of (22) is th e well-known

MAP decoder. The decision rule of (22) is often

written as

c* (O ) C, if C, arg max Pr (O C,) Pr(C,)

1

(23)

in terms of the class prior Pr(C,) and th e con ditional

probability P r ( 0 C,). It is clear that the difficulty

in minimizing misclassification rate stems from the

fact that both the prior distr ibution Pr(C,) and the

conditional distr ibution P r ( 0 C,) are generally un-

known and have to be estimated from a given, finite

training set. The re a re practical rea sons why this dif-

ficulty is hard to overcome.

For in stanc e, to obtain reliable estimates of Pr(C ,)

and P r ( 0 C,), we generally need a sufficient size

training set ( i .e. , large enough to adequately sam ple

all relevant class interactions). W hen the vocabulary

is large , this is difficult if n ot impossible t o a chie ve.

For example, suppose the vocabulary has 10,000

words, ea ch of which re prese nts a class. If we assum e

that 10 occurrences each, on ave rage, are needed for

reliable estimates of both Pr(C ,) and P r ( 0 C,) , this

amounts to a total of 10,000

x

10 10' word ut-

terances. (This is equivalent to -14 hours of speech ,

assuming two words per second.) Therefore, some

othe r strategy is required to estimate Pr(C,) and P r ( 0

C,) reliably from a smaller size training set. One

possibility is to choose a set of classes to represent,

instead of words, a reduced set of subword units-

that is , phonem es. This greatly reduces the require-

ments on th e am ount of training data since the num -

ber of subw ord classes essentially doe s not grow with

the size of vocabu lary. consequ ence of using sub-

word unit classes to represe nt the basic set of speech

soun ds is that an estima te of class probability, Pr(C,),

TECHNOMETRICS AUG UST 1991 VOL. 33 NO.

3



261IDDEN MARKOV MODELS FOR SPEECH RECOGNITION

can be obtained directly from a lexical description

of the words, inde pende nt of the spoken training set.

Furthermore, we can estimate P r ( 0 Ci ) f rom each

of th e sub word units in the lexical entry for the w ords.

This type of decomposition-namely, breakin g large

sound classes like words and phrases into smaller

ones like subword units-leads to one particular

problem in HMM speech recognition; that is, given

an independent (and possibly incorrect) estimate of

word probabili ty , Pr*(C ,) , how d o we est imate P r y 0

C,) such that the Bayes minimum error rate is at-

tained? (Although we have used the example of de-

composing words into subword classes, the sa me con-

cept applies to high levels-that is, decom posing

phrases into w ords.) No te that in this discussion th e

association between the training data and the (sub-

word) class is assumed to be known a priori (often

as a result of hand labeling). We call this case the

complete label case. Typical examples of complete

label systems include most isolated-word and con-

nected-word tasks and the case of hand-segmented

and han d-labeled continuous speech recognition. This

case is illustrated in Figure 8a, which shows the speech

waveform an d energy conto ur for a sequence of iso-

lated digits spoken in a stationary background. It is

relatively easy to define (roughly) the point in time

at which each spoken digit begins and ends.

The decomposition described previously (namely,

from text to words or subword units), although cir-

cumventing some of the training problems, leads to

s e v e n

h o w m a n y

oth er problems-for exam ple, the need for a detailed

segmentation and exact labeling of the speech. For

large-vocabulary continuous speech recognition in

which the training dat a is extensive, this labor-in ten-

sive task cannot realistically be accomplished. In-

stead, we often have to rely on only partial knowl-

edge of the data. For example, we generally know

or assume we know the phoneme sequence of the

words in the string, but not the direct correspond ence

between each phoneme and the segment of speech.

We call this case the incomplete label case, and typ-

ical examples ar e the problems of estimating models

of subword speech units from continuous speech and

those of words in connected word tasks without prior

word segmentation. An illustration of this case is

given in Figure 8b , which shows the speech waveform

and an energy contour for the sentence, How many

ships are there ; the bound aries for either individual

words or the phonemes m aking up the words are not

shown, nor are they known precisely.

Another problem with the estimation procedure

arises when the distribution, particularly the condi-

tional probability

P r ( 0 Ci) , is postulated to be the

same as P r ( 0 the HM M to be est imated. Since

., ,

we generally choose the form of the observation dis-

tribution before we have any solid knowledge of the

characteristics of the sou rce in each HM M s tate, there

is the risk of a serious mismatch between the chosen

observation distribution model and the actual data

source . similar mismatch potential exists in the

f o u r o n e

*acoustic background

s h i p are there

Figure

8.

(a) The Wave form and Energy Contour o f Three Digits Spoken in Isolat ion;

b)

The Wave form and Energy Contour

of a Naturally Spoken Sentence, How Man y Ships Are There?

TECHNOMETRICS, AUGUST 1991, VOL.

33,

NO.

3



262


Markovian structure of the model; that is, speech

signals need not be Markovian. We therefore need

to provide a mechanism for compensating such po-

tential errors in modeling. We call this problem the

model-mismatch case.

Note that if the chosen class model (the conditional

as well as the a priori probability) is indeed the cor-

rect one the ML method will lead to the asymptot-

ically best recogn ition perfo rma nce (in terms of high-

est correct classification rate) while allowing classes

to be add ed or rem oved fro m th e system specification

without the need for complete retraining of all class-

reference rnodels. These assumptions, however. are

rarely tru e in practice. An alternative to the preced-

ing distribution e stimate ideas, called corrective train-

ing. tries to minimize the recognition error rate di-

rectly by identifying. during training, sources that

lead to recognition errors (or near errors) and ap-

plying some type of correction rule to th e param eter

estimation to reduce the probability of these errors

or near-misses. Unlike the preceding three cases,

corrective training uses the HMM as a form of dis-

criminant function a nd is not concerned abo ut model-

ing accuracy per se but more with minimizing the

num ber of recognition err ors that occur during train-

ing. A more complete discussion of these cases fol-

lows.

The Complete-Label Case. In the complete-label

case, the association between the training data

0

and

the class Ci is precisely known a priori du ring training

(e .g .. as shown in Fig. 8 a). This is a typical situation

in classical pattern-classification theory, and all the

concerns abou t supervised learning (Duda and Ha rt

1973) apply. What is unique in the current speech-

recognition problem, however, is the use of a pre-

scribed class prior model Pr(C,).

A detailed account of the issues involved in the

complete-label case was given by Nadas, N aham oo,

and Picheny (1988). Consider th e set of L HM M class

models, 2 { i i ) f = , ,which are used to model the

classes {C,}f.=, respectively. Th e c omp lete set of

models, A ,

defines a probability measure

The notation of (24) is slightly different from that

used earlier because of the fact that we are explicitly

using both classes and associated models simulta-

neously for classification purposes. Assume that the

prior Pr:(C,)

=

Pr,,(C,) s given or obtained inde-

pendent of the spoken training set {Ool},where 0 ' ' )

are the portion of the training data that are labeled

as C,. For this case, Nadas e t al. (1988) propo sed th e

use of a slightly different train ing measure-nam ely,

the cond i t iona l max imum l ike l ihood es t ima tor ,

TECHNOMETRICS AUGU ST 1991 VOL. 33 NO. 3

(CMLE) obtained by

A = arg max Pr ,(C, O('j ).

(25)

Th e motivation for choosing this estim ator is that the

assumed model, Pr,,(Oi'j C,) or Pr(O"j EVi),may be

inappropriate for Pr(O0 ) C,) so that the asymptotic

optimality of the sta ndard ML estimator is no longer

guaranteed. Furthermore, the prior model Pro(Ci)

could be incorrect (poor estimate, etc.) , so that the

true MAP decoder of (22) is virtually impossible to

implement.

When the class prior Pro(Ci) s obtained indepen-

dent of ( 0 ) and is not part of the model A to be

optimized, then

A;LfLE = arg max

n

Pr,(C, 0 ' ' ' )

1

= arg max

2

log

Pr ,(O('j , C,)

1 ~ ~ , ~ ~

which is the well-known maximum mutual infor-

mation (MMI) estimator.

[Note that the second

equality comes from the fact that the logarithm is

mon otonic and the additive term log Pr,(C,) do es not

affect the maximization result.]

The effect of conditional ML estimation in terms

of class prior robustness-that is, uncerta in or in-

correct

Pr,(C,)-can best be illustrated by the fol-

lowing example from Nadas et al. (1988). Suppose

in the training data that there are N, occurrences of

the class C, and, among these occurrences, the (dis-

crete) observation

0

occurs jointly with class C, N,,,

times. Then, the ML estimates of the prior and the

conditional probabilities are, respectively,

and

The decoder that uses these estimates then decides

that an observation

0

belongs to class C, if

which is simply

N,,, = max N,,,.

(30)

This is optimal when the total number of observa-

tions approaches infinity. When the prior Pr(C,) is

prescribed as Pr,,(C,) rather than PrGL(Ci),however,

~ 1



263

IDDEN MARKOV MODELS FOR SPEECH RECOGNITION

@

the depen dence of the decision rule on the a priori

probability becomes obvious if we continue to use

Pr;,(O C, ); tha t is, (29) becom es

pr,,(C,)

=

max

N,

Pr,,(C,).

Nr

N

Th e CM LE cri terion of (25), on the other h and, leads

to a set of equations

P~:MLE(O

Ci)

where

No = C b , Nj,,.

This is obtained with the

variational method by defining the Lagrangian for

maximization as

where

i ( i =

1 , 2 ,

. . .

,

L) are the Lagrange mul-

tipliers. Plugging (32) into the M A P rule of (2 2), we

decide an unknow n to be from class C, if

which can be reduced to

which is inde pen den t of the p rior P r,,(C,). In this

sen se, it was conclud ed tha t if Pr,,(C,) is not th e true

prior (becau se of bad assumptions or est imation er-

rors) , the ML E will implement a suboptimal decoder

(31) , while the CM LE of (25) will lead to the co rrect

M A P decoding result (asymptotically) because of the

comp ensation built into the est imate of P r ( 0 C,).

Although the cri terion of CM L or MMI (fo r train-

ing) is attractive in terms of compensation problems

associated with M AP decoding, some important con-

cerns remain unanswe red. The most immediate con-

cern in using this criterion is the lack of a convenient

and robust algorithm to obtain the est imate PrcML E(O

C, . In many practical si tuations, the proc edure f or

obtaining the solution may be far more complicated

than (32) would imply, part icularly when HMM 's are

involved and the observation distribution is not of a

discrete type. Previous attemp ts (Bahl et al. 1986;

Brown 1987) at using the MMI criterion have not

produced an estimation proced ure that is guaranteed

to converge to an op timal solution either. Moreover,

even though the preceding example demo nstrates the

robustness of C M LE against error s in the class prior,

it is still not clear

i

CM LE is more robust than MLE

when the fo rm of the m odel (i .e. , HMM ) is incorrect

for the speech source. Another problem is that the

CMLE has a larger variance than the MLE. This

therefore undermines the potential gain in offsetting

the sensitivity due to inac curate P r,,(C,) when the

decoder based on finite training data is used on test

data not included in the training set. In the case of

insufficient training data (as is almost always the sit-

uation ), there are othe r problems with practical im-

plementations of the pro cedure.

The

Incomplete-Label

Case. T he case of incom-

plete labeling arises because of (a) practical difficul-

ties in labeling an d segm enting any large continuous-

speech data base and (b) the inherent ambiguity

among different sound classes in terms of both class

definition (i.e ., inhe rent similarities in sound classes)

and time uncertainty as realized in speech signals

(i.e., it is not clear that exact segmentation bound-

aries between adjacent sounds universally exist; see

Fig. 8b as an examp le). For the case of decom posing

an isolated word into a prescribed phoneme se-

quence, we usually have a lexical description of the

word in terms of the phonem e sequ ence (as described

in a dictionary) and the spoken version of the word

without any explicit time segmentation into corre-

sponding phonemes. Un der these condit ions (which

are typical for speech recognition), training of the

prescribed subwo rd unit mo dels is rather difficult due

to the lack of a definitive labeling relating subword

classes to specific interv als of sp eec h. Afte r all, if we

do not know for sure that a training toke n is in

sound class C,, the likelihood function P r ( 0 C,)

cannot be defined, not to mention optimized.

There are several ways to handle the problem of

incomplete labeling based on the idea of embedded

decoding . O ne way is to retain the constraints of the

known class sequence (in the previous example, the

pho nem e sequenc e) and solve for the optimal set

of class models sequentially. Another alternative is

to solve for the m odels of the so und classes simul-

taneously with the class decoding.

Consider first the case in which we have partial

know ledge; that is, a given training toke n is known

to correspond to a sequence of u class labels

h =

h, h,, . . . ,

h, (as determined from dictionary lookup

of the words realized by 0 ) , where h,

{C,},L_,.T he

goal is to obtain the L models

=

{i.,),L=, orre-

sponding to the

L

sound classes C = {C,)f=, ,using

the number of segments u and the class labels

h

as

TECHNOMETRICS AUG UST 1991 VOL.

33

NO.

3



264 B. H. JUANG AND L. R. RABINER

hard constraints in the dec oding. Figure 9 illustrates

the decomposition of a word into = 4, 5, or 6

segments. We see the varying segmentations asso-

ciated with different numb ers of base units within the

word. The procedure begins by assuming a uniform

segmentation

X

=

x , , x ?

. .

xu ) of 0 into

u

speech intervals with th e ith interval, x,, correspond-

ing to soun d class h,. No te that each interval is a

sequence of spectral vectors.) Based on this initial

segmentation, the likelihood functions are defined

and individual class mode ls are obtain ed by M L via

the forward-backward procedure). For example, if

h, = C,, then the segment x, is used to define a

likelihood function Pr x, C,) for maximization. As

a result, a set of sound unit models are created . With

the new set of unit models, we further refine the

segmen tation of 0 into X again assuming exactly

u

segmen ts) by optimally decoding 0 using the Viterbi

algorithm. This leads to an improved segmentation

of 0 that can then be used to give a refined set of

sound models. This process is iterated until a rea-

sonably stationary segmen tation of 0 into intervals

X

is obtained.

This constrained decoding approac h to the incom-

plete label case has been used in explicit acoustic

modeling of phonem ic classes Lee et al. 1989; Lee ,

Rab iner, Pieraccini, and Wilpon 1990) with good suc-

cess. There are, however, some theoretical short-

comings to the method. One problem is that the

segmentationldecoding results will be different for

different numbers of segments u in the given string

see Fig. 9). Thus eve n the simple expedient of hav-

ing multiple dictionary definitions for a word can lead

to inconsistent segme ntation in terms of sound classes.

Although the procedure is practical, there appears

to be room for theoretical improvements, which in

turn may prove beneficial in practical implementa-

tions.

An alternative and probably more thorough way

AIR

of handling the incomplete-label problem is to com-

bine the ideas of seg mentation, decoding, and model-

ing togethe r and try to solve a large network for both

the class models and segmentations simultaneously,

without any prescribed label-sequence constraint. In

this approach, the preceding iterative procedure-

that is, recursively and interleavingly improving da ta

segmentation and model estimation via the sequen-

tial k-m ean s algorithm-is again used . Th e key dif-

ference, in contrast to the constrained decoding ap-

proach discussed previously, is that the label-sequence

constraint is no longer retained in each iteration.

This allows globally optimal decoding of 0 given

a set of sound unit m odels. This advantage, howeve r,

comes at the price of giving up the convenient and

readily available lexical representation h from the

dictionary.

As previously pointed out, the goal of signal

modeling is to come up with a parsimonious, con-

sistent representation of the source that displays a

certain kind of variation in the observed output.

Speech signals are known to have inherent time un-

certainty due to speaking style and speaking rate

variations, as well as spectral uncertainty because of

coarticulation effects, individual speaker character-

istics, and so forth. This justifies the use of

HMM

for an individual sound class. The reason is that the

incomplete data problem formulation Dem pster et

al. 1977; Rabiner and Juang

1986), based on which

the fundamental HM M probability measure of 8) is

defined, is particularly appropriate when explicit

knowledge of the exact position labeling) of the par-

ticular sound in the m iddle of an utteran ce is lacking.

On the other hand , speech is a linear code Chao

1968) in the sense that decoded symbols come out

one after another and no two symbols can appear at

the same tim e. Th erefo re, the objective of optimizing

the quantity defined by 7) will have to be accom-

plished du ring the recognition process, and alternate

TIME (FRAME)

Figure 9. Decomposition of the Word Air Into Different Numbers of Segment Units.

TECHNOMETRICS, AUGUST 1991, VOL. 33, NO. 3



65

1


use of the Baum-Welch algorithm and the segmental

k-means algorithm as laid out by Lee et al. (1989)

appears to be a reasonable ap proach to the problems

associated with incomplete-label cases.

Model-Mismatch Issues. In speech modeling, we

often choose (assume) the form of model before we

actually know enough abou t the characteristics of the

source. When we choose P r ( 0 1.) to model P r (0

C) for class C and perform ML estimation, the

optimality in decoding is meaningful only when

0

is

indeed generated by the source i.. When the actual

source is inconsistent with i., we need to either im-

p r o v e P r ( 0 I . based on 0 or revise the decoding

rule in some way. Here, we discuss the first possi-

bility.

Consider two statistical populations with proba-

bility den sities and

f 2 ,

respectively. T he Kullback-

Leibler number, cross entropy, I divergence, or dis-

crimination information (Good 1963; Hobson and

Cheng 1973; Johnson 1979; Kullback 1958) defined

by

is the mea n information for discriminating against

f 2 This discrimination information can be used in

HM M (Ephraim, D embo, and Rabiner 1989) .

Let R = (R,, R?,

. . .

Rr ) be a sequence of con-

straints associated with the observation sequence

0

= ( O , ,

02

.

O T ) ,which we attempt t o model.

For example, R could be considered as the autocor-

relation of the actual speech data. Let R(R) be the

set of all models (probability distributions) that sat-

isfy the constraints R . Th e minimum discrimination

information (MD I) approach to HM M modeling is

to find the HMM parameter set i. which minimizes

the M DI defined by

v(R, P) inf I(Q : P )

V E W R

= in f q ( 0 ) log

q O) d o . (36)

Q E W R

~ ( 0

The estimation criterion is thus the minimum o ver i

of the MDI of (36). Note that in (36) we have as-

sumed that both distributions P and Q have gener-

alized pdf s, den oted b y p ( 1.) and q( .) , respectively.

Clearly, the idea of (minimum) MDI model design

is to allow a search into a general set of models

(distributions) R with the HM M measure Pr ( 0 i.)

as a contrasting reference so that the form of the

chosen model P(.

1

E.

and what is revealed in the

given data

0

can be better matched under the MD I

measure.

MD I m odeling involves two steps. First, given an

HM M p( . I .) , we find the solution, q, to the MDI

problem of (36). Then, with q given, we optimize

the p aram eter values 1. such that u(R , P) is mini-

mized.

The MDI modeling criterion carries some infor-

mation-the oretic justification and can be related to

the MMI approach (Ephraim and Rabiner 1988).

There remain several unresolved issues with this

metho d, however. Th e objective to improve the match

between the observations 0 and the model 1. is

embed ded in v( R, P ), which is comp uted for all

E R(R) with respect to the chosen model P(.

1 i .

Since P( . i.) is often not a good measure of how

well the model matches the data because of model

mismatch, the goodness of the MD I pdf, q , which

minimizes v(R, P), is difficult to measure in recog-

nition tasks, not to mention the effects on classifi-

cation errors. Furthermore, the nice computational

structure of the HMM that makes possible the

modeling of long training sequences is not prese rved

in the MDI pdf, q, resulting in an exceptionally high

computational complexity.

In practice, some ap-

proximations are made to reduce the complexity

problem (Ephraim et al. 1989). To date, it has not

been demonstrated that the MDI model-design pro-

cedure can bring about significant improvements in

recognition perform ance (E phraim et al. 1989). It is

also not clear whether M DI (for model correctness)

and CM LEi MM IE (for robust decoding) can be eas-

ily jointly formulated for an improved recognizer

design.

Corrective Training.

As explained earlier, the

minimum Bayes risk or error rate is the theoretical

recognizer performance bound conditioned on the

exact knowledge of both the prior and the condi-

tional distributions. W hen both distributions are not

known exactly and the classifier needs t o be designed

based o n a finite training set, there are several ways

to try to reduce the error rate. One method is based

on the theoretical link between discriminant analysis

and distribution estimation (Duda and Hart 1973).

The idea here is to design a classifier (discriminant

function) such that the minimum classification-error

rate is attained o n the training set. In particular, we

wish to design a classifier that uses estim ates of Pr( C,)

a nd P r ( 0 C,) and that achieves a minimum error

rate for the training set in the same way a discrimi-

nant function is designed. The reason for using the

HM M Pr ( 0 2,) as opposed to other discriminant

functions, is to exploit the strengths of the

HMM-

namely, consistency, flexibility and computational

eas e, as well as its ability to gen eralize classifier per-

formance to independent (open) data sets. The

generalization capability of HMM s, as discriminant

TECHNOMETRICS A UG US T 1991 VOL.

33

NO. 3



266 B. H. JUAN G AND L. R. RABINER

functions, is somewhat beyond the scope of this ar-

ticle and will not be discussed here. In the following,

we focus on the issue of treating the estimation of

the distributions Pr(C,) and P r ( 0 2,) as a discrim-

inant-function design to attain the minimum error

rate.

Bahl, Brow n, de Souza, and Mercer (1988a) were

the first to propose an error-correction strategy, which

they named corrective training, to specifically deal

with the misclassification problem. Their training al-

gorithm was motivated by analogy with an error-

correction training procedure for linear classifiers

(Nilsson 1965). In their proposed m ethod, the ob -

servation distribution is of a discrete type,

B

= [b , ] ,

where b, is the probability of observing a vector

quantization code index (acoustic label) when the

HM M source is in state i . Each b,, is obta ine d via the

forward-backward algorithm as the weighted fre-

quency of occurrence of the code index (Rabiner et

al. 1983). Th e corrective training algorithm of Bahl

et al. (1988a) works as follows. First, use a labeled

training set to estimate the parameters of the H MM 's

= (3.)) with the forward-backward algorithm. For

each utterance 0 labeled as Ck , for exam ple, eval-

uate P r ( 0 E+) for the correct class Ck and P r ( 0

i ,

for each incorrect class C,. (The evaluation of like-

lihood for the incorrect classes need not be exhaus-

tive.) For every utteranc e in which log P r ( 0 2,)

log P r ( 0 E.J S where

d

is a prescribed thresho ld,

modify i.k and i., according to the following mecha -

nism: (a) Apply the forward-backward algorithm to

obt ain estimates bll an d b;' using the labeled utte r-

ance 0 only, for t he correct class Ck and incorrect

class C,, respectively, and (b) modify the original b,,

in i., to b,, + ;lb,\ and the b, in i., to b, yb i. W hen

the state labels are tied for certain models, the pre-

ceding procedure is equivalent to replacing the

original b,, with b,, + ;(b; b:;). Th e prescrib ed

adaptation param eter, controls the rate ofI,

convergence, and the threshold, S defines the near-

miss cases. This corrective training algorithm there-

fore focuses on those parts of the model that are

most important for word discrimination, a clear dif-

ference from the ML principle.

Although Bahl et al. (1988a) reported that the

corrective training procedure worked better (in iso-

lated-word recognition tasks) than models obtained

using the MMI or the CMLE criteria, the lack of a

rigorous analytical foundation for the algorithm is

one problem. Without a better theoretical under-

standing of the algorithm, the appeal of the method

is primarily experimental. Other attempts to design

HMM 's for minimum erro r rate or some form of class

separation include the work by Ljolje, Ephraim , and

Rabiner (1990) and by Sondhi and R oe (1983).

Several oth er fo rms of discriminative training were

also proposed by Katagiri, Lee, and Juang (1990),

who combined the adaptive learning concept in

learning vector quantizer design (Koho nen 1986) and

the corrective training m ethod described previously,

leading to a framework for the analysis of related

trainingilearning ideas. Although this work is still in

progress, the key result is that these adaptive lear n-

ing methods can be form ulated as a general-risk min-

imization proce dure based o n a probabilistic descent

algorithm, ensuring a stochastic convergence result.

The generalized risks considered by Katagiri et al.

(1990) include the regular classification error, the

mean squared e rror, nonlinear functions (e .g. , sig-

moid) of the likelihood, and several other general

meas ures. Th e corrective training algorithm of Bahl

et al. (1988 a) is just o ne possible choice for th e min-

imization of a prescribed risk function.

4.2 Integration

of

Nonspectral and

Spectral Features

Th e use of HMM 's in speech recognition has been

mostly limited to the m odeling of sh ort-time speech

spectral information ; that is, the observation 0 typ-

ically represents a smoothed representation of the

speech spectrum at a given time. Th e spectral feature

vector has proved extremely useful and has led to a

wide variety of successful recognizer designs. This

success can be attributed both to the range of spec-

tral-analysis techniques developed in the past three

decades, as well as to our understanding of the per-

ceptual importance of the speech spectrum to the

recognition of sounds. The success of spectral pa-

rameters for characterizing speech was further aug-

mented by the introduction of the so-called delta-

cepstrum (F urui 1986), which attem pts to model the

differential speech spectrum.

Besides spectral param eters, there are othe r speech

features that are believed to contribute to th e rec-

ognition and understanding of speech by humans.

On e such category of nonspectral speech fea tures is

prosody as it is manifested on both the segmental

and the supra-segmental level (Lea 1980). Physical

manifestations of prosody in the speech signal in-

clude signal energy (suitably normalized), differen-

t ia l ene rgy , and s igna l p i tch ( fundam enta l f re -

quenc y). Ther e a re at least two issues of concern in

integrating nonspectral with spectral features in a

statistical model of speech: Do such features con-

tribute to the performance of the statistical model

for actual recognition tasks, and how should th e fea-

tures be integrated so as to be consistent with their

physical manifestations in the production and per-

ception of speech? The first issue is relatively easy

to resolve based on experimental results. Several

HMM-based recognition systems have incorporated

TECHNOMETRICS, AUGUST 1991 VOL. 33, NO. 3



267


log energy (and differential log energy) either di-

rectly into the fea ture vector or as an addit ional fea -

ture whose probabil i ty (o r l ikelihood) is factored into

the l ikelihood calculation (B ahl et al . 1983; Rabiner

1984; Shikano 1985) with moderate success ( i . e . ,

higher recognit ion pe rform anc e). Th e level of per-

formance improvement, however, is considerably

smaller than one might anticipate based on the im-

portance of prosody in speech.

The problem with combining spectral and non-

spectral features in a statistical framework is one of

temporal rate of change. To attain adequate t ime

resolution for the spectral parameters that charac-

terize th e varying vocal tract , we need to sample the

spectral observa tions at a rate on th e orde r of 100

t imes per second (10-msec frame upd ate). The pro-

sodic features of speech characterize stress and in-

tonation, and these occur at a syllabic rate of about

10 t imes per second . [Of cou rse, we can always over-

sample the features associated with the prosodic

parameters to keep the rate the same as that of the

spectral parameters; this in fact is what is currently

do ne in most systems-e.g., Rab iner (1984) and Shi-

kano (1985)]. Furthermore, the sequent ial charac-

teristic change of different features may be too dif-

ferent to warrant a single, unified Markov chain. Thus

one key question is: How do we combine two feature

sets with fundam entally different time scales and pos-

sibly different sequential characteristics so as to be

able to perform optimum modeling and decoding?

second pr oble m in comb ining fundam entally dif-

ferent (in nature) features concerns their statistical

characterization. To be technically correct, we need

to know th e joint distribution of the two feature sets.

For statistically independent (or often just uncorre-

lated ) feature sets , we can represe nt the joint density

as a pro duct of th e individual densit ies. In practice,

howe ver, there is usually som e correlation between

any pair of speech feature sets; hence som e correc-

t ion for the correlat ion is usually required . O ne pro -

posed m etho d of handling this problem is to perform

a principal-component analysis on th e joint featu re

set before hidden Markov modeling is performed

(Bocchieri and Do ddington 1986). Although this al-

leviates the difficulties somewhat, it is not a totally

satisfying solution because the resulting feature set

usually has no straightforward physical significance.

Furthermore, the set of principal components is a

function of the data an d hence need not be optimal

for unseen data (open-set problem).

4.3

Duration Modeling in HMM s

O ne of the inherent l imi tat ions of the H MM ap-

proach is i ts treatment of temporal duration. Inher-

ently, within a state of an HMM, the probabil i ty

distribution of state duration is exponential; that is,

the probability of staying in state i for exactly d ob-

servations is Pr,(d) = (a, ,) - ' (1 a, ,). This expo-

nential duration model is inappropriate for almost

any speech event.

Several alternatives for implementing different

state duration models have been proposed . The most

straightforward approach is the concept of a semi-

Markov chain (Ferguson 1980; Russell and Moore

1985) in which state transitions do not occur at reg-

ular time intervals. More formally, we assume that

for a given state sequence q in which there are

r

1 state transitions such that th e states visited are q , ,

q,,

. . .

q, with associated state durations of d l , dz ,

. . .

d, (fram es), the joint probabil ity of the obse r-

vation sequen ce and the state sequence

q ,

given

the model

J.

becomes

P r ( 0 , q i

The probabil i ty measure for such a Markov model

is, accordingly, P r ( 0 =

C ,

P r ( 0 ,

q

2 . Based

on these definitions, modeling of the source, includ-

ing the dura tion model P r,(d ,), can be implemented

using a hill-climbing reestimation procedure of the

type used previously (Levinson 1986). Typically, Pri (d ,)

is treated as a discrete distribution over the range 1

d i Dm where Dm represents the longest pos-

sible dwell in any state.

Although preceding formulation handles the du-

ration model simultaneously with the Markovian

transition and the local (state) distribution models

and c an lead to analytical solutions, there are draw-

backs with such a duration model for speech rec-

ognit ion. On e such drawback is the greatly increased

computational complexity due to the loss of regu-

larity in transition timing. In the traditional HMM,

transit ions are al lowed to occur at every t ime fr am e.

In the sem i-Markov model, transit ions do not occur

at regular time [the return transition is part of the

duration model Pr,(d ,)], and this leads to a signifi-

cantly more com plicated lat t ice for decoding an input

string. It was estimated by Rabiner (1989) that the

semi-Markov model incurs a factor of 300 increase

in computational complexity for a value of

Dm =

25. The much increased complication in decoding

latt ice of ten rend ers m any search algorithms such as

the beam search (L owerre and R eddy 1980) and the

stack algorithm (Jelinek 1969) for handling large

TECHNOMETRICS AU GU ST 1991 VOL.

33

NO. 3



268


problems extremely difficult to imp lement. A noth er

problem with the semi-Markov model is the large

number of parameters

D,,,)

associated with each

duration model that must be estimated in addition

to the usual HM M pa rameters. Finally, i t is not clear

how accurate or how robust the estimated Pri(di)

needs to be in order to be beneficial in speech rec-

ognition.

One proposal (Levinson 1986) to alleviate some

of these problems is to use a parametric state-du-

ration distribution model instead of the nonpara-

metric one s used previously. Several param etric dis-

t r ibut ions have been cons idered , inc luding the

Gaussian family (with constraints to avoid negative

durations) and the gamma family. Other less in-

volved attem pts include modeling the state duration

with a uniform distribution, requiring only the esti-

mate of the duration range, with minimum and max-

imum duration allowed in a particular state. This

simple duration model has been applied with good

success (Lowerre and Red dy 1980) for som e tasks.

The major difficulty in modeling the durational

information is that it is much m ore sparse than sp ec-

tral information; that is, there is only one duration

per state . Hence we e ither alter the structure of the

H M M (e .g. , to a semi-Markov mod el) , thereby los-

ing much of the regularity of the original HM M for-

mulation, or we seek alternative implementation

structures, as in the case of prosodic information.

For durational information a simple-minded ap-

proach is to treat the spectral modeling and the du-

ration modeling as two sep arate, loosely connected

problems. He nce the regular HMM estimation (spec-

tral) is perfo rmed on the given observation sequence

0 The n the best state sequence ij which maximizes

P r ( 0 ,

q

E . , is found using the Viterbi algorithm.

Finally, esimates of Pri( di ) re obtained based on the

optimal state sequence

ij

by ei ther the M L method

or fro m simple frequency of occurrence counts (Ra-

biner et al . 1986). Often the duration d i for state i is

normalized by the overall duration T to account for

the inherent variation in speaking rate. This ap-

proach is usually called the postprocessor duration

model because the standard decoding is performed

first and the duration information is only available

after the initial processing is finished. Although the

postprocessor duration model has had some success

(Ra bin er et al. 1986), the questions of optimality of

the estimate, robustness of the solution, and other

criteria for successful use of duration information,

especially as applied to speech recognition, remain

unanswered.

4 4

Model lustering and Splitting

O ne of the basic assumptions in statistical mod el-

ing is that the variability in the o bservations from an

information s ource can be mode led by statistical dis-

tr ibutions. For speech recognition, the source could

be a single word, a subword unit l ike a phone me, o r

a word seq uenc e. Because of variability in the source

production (e.g., accents, speed of talking) or the

source processing (e.g., transmission distortion,

noise), it is often expedient to consider using more

than a single HM M to characterize the source. The re

are two motivations behind this multiple H M M ap -

proac h. First, lum ping all of th e variability tog ether

from inhomogeneous data sources leads to unneces-

sarily co mplex m odels, o ften yielding lower modeling

accuracy. Second, some of the variability, or rather

the inhomogeneity in the source data, may be known

a priori, thus warranting separate modeling of the

source data sets. H ere , ou r main concern is the first

case-that is, auto matic mode ling of an inhomog e-

neous source with multiple HMM s, because the lat-

ter (man ual) case is basically straightforward.

Several generalized clustering algorithms exist,

such as the k-means clustering algorithm, the gen-

eralized Lloyd algorithm widely used in vector qua n-

tizer designs (Linde, Buzo, and Gray 1980) or the

greedy growing algorithm found in set-partition or

decision-tree designs (Breiman 1984), all of which

are suitable for the p urpo se of separating inconsistent

training data so that each divided subgrou p becomes

more hom ogeneou s and therefore is better modeled

by a single HMM. The nearest neighbor rule re-

quired in these clustering algorithms is simply to as-

sign an observation sequ ence 0 to cluster i i f Pr(0

i , ) max,Pr (O 3 , , where i , s denote the models

of the clusters. Successful application of the model-

clustering algorithms to the speech -recognition prob-

lem, using the straightforward ML criterion, has been

reported (Rabiner, Lee, Juang, and Wilpon 1989).

When other estimation criteria are used, however,

the interaction between multiple HMM modeling and

the Bayes minimum-error-classifier design remains

an open question in need of further study.

A n altern ative to mo del clustering is to arbitrarily

subdivide a given speech source into a large number

of subclasses with specialized characteristics and the n

consider a generalized procedure for model merging

based on source likelihood considerations. By way

of examp le, for large-vocabulary speech recognition

we often try to build specialized units (context sen-

sitive) for recognition. We could consider building

units that are a function of the sound immediately

preceding the unit (left-context) and the sound im-

mediately following the unit (right-context). There

are ab ou t 10,000 such units in English. Many of the

units are functionally almost identical. Th e problem

is how to determine which pairs of units should be

merged (so that the number of model units is man-

ageable and the variance of the parameter estimate

TECHNOMETRICS AUG UST 1991 VOL. 33 NO.

3



269


is reduced). To set idea s, consider two unit models,

3. and orrespond ing to training observation sets

0 and O h ,and the merged model correspond-

ing to the merged observation sets {0 , , O h). We can

then compute the change in entropy (i .e. , loss of

information) result ing from the merged models as

(L ee 1989 ). Wh enev er AH,, is small eno ug h, it means

that the change in entropy result ing from merging

the models will not affect system performance (at

least on the training set) and the models can be

merged. The question of how small is acceptable is

dependent on specific applications. Other practical

questions of a similar nature also remain.

4.5

Parameter Signif icance and

Statistical Independence

Although in theory the importance of the state

transition coefficients aii is the sam e as tha t of th e

state observation density b,(O ) [in training, aii affects

the parameter est imate of

b j .

and vice versa], in

normal practice of speech recognition, this is not

usually th e case. Th is para dox is due to th e discrim-

ination capability of the a, s relative to that of th e

bi (0 ) s. Since a;, is the relative frequ ency of state

transit ion from state i to state j i ts dynam ic range

(especially in a left-to-right model) is severely con-

strained. T he b i( 0 ) densities, nevertheless, often have

almost unlimited dynamic ran ge, part icularly when

continuous density functions are use d, as each of the

state densities is highly localized in the acoustic pa-

rameter space. Wh en we combine the a, s and the

bi (0 ) s to give the probabil i ty distribution of the

HMM, we find in practice that, for a left-to-right

mo del, th e atj s can be neglected en tirely (i. e., set all

a,

ai,,, , .5) with no effect on recognition per-

formance.

Th e preceding analysis points out the existence of

the un balanced nume rical significance of the a s and

the b s in the likelihood calculation of the H M M as

applied to speech recognition. This, how ever, should

not be taken as to totally discredit the usefulness of

the Markov-chain contribution in terms of signal

modeling. In f act, the tran sition prob ability still plays

an important role in parameter est imation.

The introduction of semi-Markov models as dis-

cussed in Section 3.7 is one way to enhance the sig-

nificance of the Markov-chain contribution. The

probability of a Markovian se quence of

(4)

is revised

to a general form of product Pr9,(dl)Prq2(d2)

. .

as

in (37). This more general form of Markov-chain

dependence allows introduction of a singularity in

Pr,(d ), thereb y increasing its numerical significance

und er certain cond itions. [For exam ple, if we specify

Pr,(d) 0 for d do , the H M M system has to

undergo at least d,, fram es in sta te q before it moves

out of that state, regardless of the value of b,(O,)

during that period of time.] Such a provision for

introducing singularities could have a major impact

on the recognit ion performa nce (Lowerre and Redd y

1980).

A separate issue associated with the observation

densities is the statistical independence assumption.

If the state sequ ence that led to the production of 0

is kno wn, th e cond itional probability of

0

as defined

in (6) involves a product form that implies statistical

independence (Sondhi, Levinson, and de la Noue

1988; Wellekens 1987). Th e part ial H M M measure

of (7) shows that the contribution of the underlying

Marko v chain [as expressed in (4)] is to multiply the

result of (6) , a result equ ivalent to assuming that the

observations within a state are independent. This

greatly simplified a ssump tion may be an a rea for im-

provem ent. Th e argum ent is that within a state (es-

pecially one representing a stationary sound like a

vowel) the observations a re highly correlated. T hus

assuming inde pendenc e gives far too much emphasis

to the match within such stationary states. The prob -

lem is one of f orm as well as substan ce. It is difficult

to choose an appropriate form for P r ( and

i. ,

it is even more difficult to estimate the parameters

of the chosen form from any reasonably sized train-

ing set . The problems are similar to those of any

reduced-rate measurement-namely, one measure-

ment of Pr(. q j.) f or each state rath er than seve ral

measure ments per state. Hen ce we are trying to es-

t imate parameters of a much more com plicated rep-

resentation from far less data than in the simple,

uncorrelated observations case. As such, this prob-

lem is one fo r which the re may never be a good

solution.

4.6 Higher Order H M M

Until now , almost all HM M form ulations for speech

recognition have been based on a simple first-order

Markov chain. A t th e acoustic signal-processing level,

this low-order modeling may be acceptable because

the time scale of processing for e ach fra me of signal

is kept on the o rder of 10 msec. When an HM M is

used in higher levels of a recognition system, such

as syntactic or semantic processing, the first-order

formulation often turns out to be inadequate. This

is because the grammatical structure of th e English

language (and perhaps any other language) obviously

TECHNOMETRICS, AUGUST 1991, VOL. 33, NO.




cannot be properly modeled by a first-order Markov

chain. Although the structural simplicity of a first-

order model makes the computat ion simple and

straightforward, there may be a need to complete

the analytical framework of higher order models.

Moreo ver, for such higher order models to be prac-

tically useful, many of the implementational advan-

tages of the first-order case may have to be formu-

lated in an ap propriate manner.

5. SUMMARY

In this article, we have reviewed the statistical

method of HMM 's. We showed that the strengths of

the m etho d lie in the con sistent statistical framew ork

that is flexible and versatile, particularly for speech

applications, and the ease of implementation that

makes the method practically attractive. We also

pointed out some areas of the general HM M metho d

that deserve more attention with the hope that in-

creased und erstanding will lead to performa nce im-

provements for many applications. These areas in-

clude the modeling cri teria, particularly the pro blem

of minimum classification error, incorporation of new

(nonspectral) features as well as prior linguistic

knowledge, and the modeling of state durations and

its use in speech recognition. With our current un-

derstanding, HMM systems have been shown capa-

ble of achieving recognition rates of more than 95%

word accuracy in certain speake r-independen t tasks

with vocabularies on the o rder of

1,000

words. With

further progress, it is not difficult to foresee HMM-

based recognit ion systems that perfo rm well enough

to make the technology usable in everyday life.

ACKNOWLEDGMENTS

We acknowledge the excellent comments and cri t-

icisms provided by Yariv Ephraim and

C.

H . L e e o n

earlier drafts of this article.

[Received October 1990. Revised March 1991 ]

REFERENCES

Allen, J. B. , and Rab iner, L. R. (1977), A Unified Theory of

Short-time Spectrum Analysis and Synthesis,

Proceedings of

I E E E .

65, 11, 1558-1564.

Atal , B. S., and Hanauer, S . L. (1971), Speech Analysis and

Synthesis by Linear Prediction of the Speech Wave,

Journal

of the Acoustical Society of America. 50, 2, 637-655.

Averbuch, A , , e t a l. (1987) , Experiments With the TA NG OR A

20,000 Word Speech Recog nizer, in Proceedings of the lnter-

national Con ference on Acou stics Speech and Signal Processing

New York: I EE E. pp. 701-704.

Bahl. L . R . , Bakis , R. , Cohen, P. S . , Cole , A. G. , Je l inek, F . ,

Lewis, B. L., and Mercer, R. L. (1980), Further Results on

the Recognit ion of a Continuously Read Natural Corpus, in

Proceedings of the lnternatlonal Conference on Acoustics Speech

and S~ gn al rocessing

New York: IE EE , pp. 872-875.

Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L.

(1986), Maximum Mutual Information Estimation of Hidden


Markov Model Parameters for Speech Recognit ion, in Pro-

ceedings of the International Conference on Acoustics Speech

and Signal Processing New York: IEEE, pp. 49-52.

(1988a), A New Algorithm for the Estimation of Hidden

Markov Model Parameters, in

Proceedings of the International

Conference on Acoustics Speech and Signal Processing

New

York: I EE E. pp. 493-496.

(1988b), Speech Recognition With Contin uous Param-

eter Hidden Markov Models, in Proceedings of the Interna-

tional Conference on Acoustics. Speech and Signal Processing

New York: IEEE, pp. 40-43.

Bahl. L. R.. Jelinek. F., and Mercer, R. L. (1983). A Maximum

Likelihood Approach to Continuous Speech Recognition, I E E E

Transactions on Pattern Analysis and Machine Intelligence 5 ,

179-190.

Baker. J. K. (1975), The DR AG ON System-An Overview,

IE EE Transactions on Acoustics Speech and Signal Processing

23, 24-29.

Bakis. R. (1976). Continuous Speech Word Recognit ion Via

Centisec ond Acoustic States, unpublished paper presente d at

the meeting of the Acoustics Society of America. Washington,

DC, Apr i l.

Baum. L. E . (1972). An Inequality and Associated Maximization

Technique in Statistical Estimation for Probabilistic Functions

of Markov Processes,

Inequalities

3, 1-8.

Baum, L. E . , and Egon, J . A. (1967) , An Inequali ty With Ap-

plications to Statistical Estimation for Probabilistic Functions

of a Markov Process and to a Model for Ecology, Bulletin of

the American Meteorological Society 73, 360-363.

Baum , L. E. , and Petrie, T. (1966), Statistical Inference for

Probabilistic Functions of Finite State Markov Chains, The

Annals of Mathematical Statistics 37, 1554-1 563.

Baum , L. E. , Petrie, T.. Soules, G. , and Weiss, N. (1970), A

Maximization Technique Occurring in the Statistical Analysis

of Probabilistic Functions of Markov Chains,

The Annals of

Mathematical Statistics 41, 164-171.

Baum,

L.

E ., and Sell , G . R. (1968), Growth Functions for

Transformations on Manifolds, Pacific Journal of Mathemat-

ics. 27. 21 1-227.

Bocchieri , E. L., and D oddington, G . R. (1986), Frame-Specific

Statistical Features for Speaker Independent Speech Recogni-

t ion,

IE EE Transactions on Acoustics Speech and Signal Pro-

cessing

34, 755-764.

Bour lard, H. , Kamp, Y. , Ney, H. , and Wel lekens, C . J . (1985).

Speaker-Dependent Connected Speech Recognit ion Via Dy-

namic Program ming and Statistical Method s, in

Speech and

Speaker Recognition ed. M. R. Schroeder, Basel , Switzerland:

Karger, pp. 115-148.

Breiman. L. (1984). Classification and Regression Trees Monte-

rey, CA: Wadsworth.

Bridle, J . S. (1984), Stochastic Models and Templa te Matching:

Some Im portant Relationships Between Two Apparently Dif-

ferent Techn iques for Autom atic Speech Recognition, in

Pro-

ceedings of the Institute of Acoustics Au tum n Con ference

pp.

1-8.

Bro wn, P. F. (1987), The Acoustic-Modelling Problem in Au-

tomatic Speech Recog nition, unpublished Ph. D, thesis. Car-

negie Mellon University, Dept. of Computer Science.

Cadzow, J. A. (1982), ARM A Modeling of Time Series, I E E E

Transactions on Pattern Analysis and Machine Intelligence. 4,

124-128.

Chao. Y. R. (1968). Language and Symbolic Systems Cambridge,

U.K .: C ambridge University Press.

Chow. Y. L.. et al . (1987). BYBLOS: T he BBN Continuous

Speech Recognition System, in

Proceedings of the Interna-

tional Conference on Acoustics Speech and Signal Processing

New York: IEEE, pp. 89-92.




1

Cohen . J. (1985). Application of an Adaptive Auditory Model

to Speech Recognition. unpublished paper presente d at the

110th Meeting of Acoustical Society of America. Nashville.

Tennessee. November 4-8.

Daut rich. B. A , . Rabiner .

L .

R. . and Martin. T. B. (1983). On

the Effects of Varying Filter Bank Paramete rs on Isolated W ord

Recognition. IE EE Transaclions on Acoustics, Speech and Sig-

nal Processing. 3 1. 793-806.

Demp ster. A. P.. Laird. N. M.. and Rubin, D. B. (1977). Max-

imum Likelihood From Incomplete Data via the EM Algo-

rithm.

Journa l o f rhe Royal Statistical Society.

Ser. B. 39. l.

1-38.

Derouault .

A (1987). Context D epend ent Phonetic Markov

Models for Large Vocabulary Speech Recognition. in Pro-

ceedings of lhe ln~erna~ionalonference on Acoustics. Speech

and Signal Processing. New York: IE EE . pp. 360-363.

Duda. R. and Har t , P . E . (1973).

Pattern Classification and

Scene Analysis.

New York: John Wiley.

Ephraim . Y.. Dem bo. A , . and Rabiner. L. R. (1989). A Mini-

mum Discrimination Information Approach for Hidden Markov

hlodeling.

IEE E Transac~ions n Informalion Theory.

35. 1001-

1013.

Ephraim . Y.. and Rabine r. L. R. (1988). On the Relations Be-

tween Modeling Appro aches for Inform ation Sources. in Pro-

ceedings of rhe International Conference on Acousrics. Speech

and Signrrl Processing. New York. IEEE, pp. 24-27.

Fcrguson.

J . D .

(1980). Hidden Markov Analysis: An Intro-

duction. in

Hidden Markov Modelsfor Speech,

e d .

J

D . F er -

guson. Princeton.

N J :

Institute for Defense Analyses. pp. 8-

15.

Forney.

G .

D . (1973). The Viterbi Algorithm.

Proceedings of

[he IE E E . 61. 268-278.

Furui.

S

(1986). Speaker Ind epend ent Isolated Word Recog-

nit ion Based on Dynamics Emphasized Cepstrum. Transac-

tions o f the lnstirute of Electronics and Communication Engi-

neers.

69. 1310-1317.

Ghitza. (1986), Auditory Nerve Representation as a Front-

end for Speech Recognition in a Noisy Environment,

Com-

puter Speech and Language.

1. 109.

Goo d. I . J. (1963). Maximum Entropy for Hypothesis Formu-

lation. Especially for Multidimensional Contingency Tables,

The Annals of' Malhematical Statistics, 34, 91 1-934.

Gup ta. V. N.. Lennig. M ., and Mermelstein, P. (1987). Inte-

gration of Acoustic Information in a Large Vocabulary Word

Recognizer. in

Proceedings of (he International Conference o n

Acoustics. Speech and Signal Processing,

New York: IEE E, pp.

697-700.

Hobson. A .. and Cheng. B. (1973), A Comparison of the Shan-

non and Kullback Information Measures.

Journal of Statistical

Physics.

7. 301-310.

Huang. X.. and Jack, M. A . (1989). Unified Techniques for

Vector Quantization and Hidden Markov Models Using Semi-

continu ous Models. in

Proceedings of the International Con-

ference on Acoustics. Speech and Signal Processing. New York:

IEE E. p p. 639-612.

Jayant .

N

S. . and Noll. P. (1981).

Digital Coding of Waveforms.

Englewood Cliffs. NJ: Prentice-Hall.

Jelinek. F. (1969 ). A Fast Sequential Decod ing Algorithm Using

a

Stack.

IBM Jo~irtzal f Research and Development.

13. 675-

685.

(1976 ). '.Continu ous Speech Recognition by Statistical

Methods.

Proceedings of the IEEE,

61, 532-536.

Jelinek. F.. Bahl. L.

R . ,

and Mercer. R. L. (1975). Design of a

Linguistic Statistical Decoder for the Recognition of Continu-

ous Speech.

IE EE Transac~ions n Informalion Theory,

21.

250-256.

Jelinek. F.. and Mercer. R. L. (1980). Interpolated Estimation

of Markov Source Parame ters From Sparse Dat a, in

Pallern

Recognilion in Practice,

eds . E . S . Gelsema and L. N. Kanal,

Amsterd am: N orth-Holland. p p. 381-397.

Johnso n, R. W. (1979). Determ ining Probability Distributions

by Maximum Entropy and Minimum Cross-entropy , in Pro-

ceedings of APL79

(ACM 0-89791-005). pp. 24-29.

Juang. B. H. (1984). On the Hidden Markov Model and Dy-

namic Time Warping for Speech Recognition-A Unified View,

AT T Technical Journal.

63. 1213-1243.

(1985). Maximum Likelihood Estimation for Mixture

Multivariate Stochastic Observations of Markov Chains.

A T T

Technical Journal.

64, 1235-1 249.

Juang. B. H. . and Rabiner. L. R. (1985), Mixture Autoregressive

Hidden Markov Models for Speech Signals.

IEEE Transac-

tions on Acoustics. Speech and Signal Processing,

33. 1404-

1413.

(1990). The Segm ental k-Mean s Algorithm for Estirnat-

ing Parameters o f Hidden Markov Models.

IEEE Transactions

on Acoustics. Speech and Signal Processing, 38. 1639-1641.

Juang. B. H. . Rabiner . L. R.. Levinson, S . E . . and Sondhi ,

M. M. (1985). Recent Develo pmen ts in the Application of

Hidden Markov Models to Speaker-Independent Isolated Word

Recognition. in

Proceedings of the Interna~ionalConference

on Acoustics. Speech and Signal Processing.

New York: IEEE.

pp. 9-12.

Juang. B. H. . Rabiner, L . R. . and Wilpon.

J .

G . (1987). On the

Use of Bandpass Liftering in Speech Recognition, I E E E

Transactions on Acoustics. Speech and Signal Processing. 35,

947-954.

Katagiri . S.. Lee. C. H. , and Juang. B. H. (1990). A Theory on

Adap tive Learning for Pattern Classification. unpublished

manuscript .

Koh one n. T. (1986). Learning Vector Quantization for Pattern

Recognition, Rep ort TKK-F- A601. Helsinki University of

Technology.

Kullback. S. (1958), Information Theory and Statistics. New York:

John Wiley.

Lea , W. A . (1980). Prosodic Aids to Speech Recognition. in

Trends in Speech Recogriilion,

ed . W. A . Le a, Englewood Cliffs,

NJ: Prentice-Hall, pp. 116-205.

Lee. C . H . . Juang . B . H . , Soong. F. K . , and Rabiner , L . R.

(1989), Word Recognition Using Whole Word and Subword

Models. in Proceedings of the International Conference on

Acoustics. Speech and Signal Processing,

New York: IEE E, pp.

683- 686.

Lee, C. H. . L in, C. H. . and Juang, B. H. (1991) . A S tudy on

Speaker Adaptation of the Parameters of Continuous Density

Hidden Markov Models.

IEEE Transactions on Acoustics.

Speech and Signal Processing.

39. 4. 806-811.

Lee, C. H . . Rabiner . L . R. , P ieraccini , R. . and Wilpon. J . G .

(1990). Acoustic Modeling for Large Vocabulary Speech Rec-

ognition.

Computer Speech and Language. 1

127-165.

Lee, C. H. . Soong. F. K.. and Juang. B.

H .

(1988). A Segment

Model Based Appro ach to Speech Recognition, in

Proceedings

of [h e lnternalional Conference on Acoustics. Speech and Signal

Processing.

New York: IEEE. pp. 501-501.

Lee. K. F. (1989). Autom atic Speec h Recog nition, the Devel-

opmen t of the SPHINX System. Boston: Kluwer.

Levinson. S. E . (1986). Continuously Variable Duration Hidden

Markov Models for Automatic Speech Recognition. C o m -

purer. Speech and Language. I, 29-45.

(1987). Continu ous Speech Recognition by Means of

Acous tic-Phone tic Classification Ob tain ed From a Hidde n Mar-

kov Mode l. in

Proceedings of the Internalional Conference on

Acoustics. Speech and Signal Processing. New York: IEE E, pp.

93-96.

Levinson. S. E. . Rabiner. L. R. . and Sondhi. M. M. (1983), An

TECHNOMETR ICS AU GU ST 1991 VOL. 33 NO. 3




Introduction to the Application of the Theory of Probabilistic

Functions of a Markov Process to Automatic Speech Recog-

nition." Bell System Technical Journ al, 62, 1035-1074.

Linde. Y. , Buzo, A, . and Gray , R. M. (1980), "An Algor ithm

for Vector Quantizer Design,"

IEEE Transac t ions on Com-

munications.

28, 84-95.

Liporace, L. A . (1982), "Maximum Likelihood Estimation for

Multivariate Observations of Markov Sources," IEEE Trans-

aclions on Informalion Theo ry, 28, 729-734.

Lippmann, R. P., Martin, E . A , , and P au l, D . B. (1987), "Mul-

tistyle Training for Robust Isolated W ord Speech Recognition,"

in Proceedings of the lnternafional Conference on Acoustics,

Speech and Signal Processing, New York: IEEE, pp. 705-

708.

Ljolje, A , , Ephra im, Y., and Rabin er, L. R . (1990), "Estimation

of Hidden Markov Model Parameters by Minimizing Empirical

Error Rate." in Proceedings of the International Conference on

Acou stics, Speech and Signal Processing, New York: IEE E, pp.

709-712.

Lowerre, B., and Reddy, R . (1980), "The HA RP Y Speech

Understanding System," in

Trends in Speech Recognition.

ed. W. Lea, Englewood Cliffs, NJ: Prentice-Hall, pp. 340-

360.

Makhoul, J . (1975), "Linear Prediction: A Tutorial Review," Pro-

ceedings of the IEEE, 63, 561-580.

Markel, J . D . , and Gray, A . H . , J r . (1976) , Linear Predicrion of

Speech, New York: Springer-Verlag.

Merhav, N ., and E phraim , Y. (in press), "Maximum Likelihood

Hidden Markov Modeling Using a Dominant Sequence of

States," IEEE Transactions on Acoustics, Speech and Signal

Processing.

Nadas, A. , Nahamoo, D. , and P icheny, M. A . (1988) , "On a

Model-Robust Training Method for Speech Recognition," IEEE

Transactions on Acoustics, Speech and Signal Processing, 36,

1432-1436.

Nilsson, N. J. (1965), Learning Machines, New York: McGraw-

Hill.

Paul, D . B. (1985). "Training of H MM Recognizers by Simulated

Annealing," in Proceedings of the International Conference on

Acou stics, Speech and Signal Processing, New York: IEE E, pp.

13-16.

P ori tz , A . B . , and R i ch t er , A . G . (1986). "On Hidden Markov

Models in Isolated Word Recognition," in Proceedings of the

Interna~ionalConference on Acoustics, Speech and Signal Pro-

cessing, New York: I EE E, pp. 705-708.

Rabine r. L. R . (1984), "On the Application of Energy Contours

to the Recognition of Connected Word Sequences." AT T Bell

Labs Technical Journal. 63. 1981-1995.

(1989). "A Tutorial on Hidden Markov Models and Se-

lected Applications in Speec h Recognition," Proceedings of lhe

I E E E . 77, 257-286.

Rabiner , L . R. , and Juang, B. H. (1986), "An Introduction to

Hidden Markov Models," IE EE Acousf ics, Speech Signal

Processing ,Magazine, 3, 1-16.

Rabiner , L . R . , Juang, B. H. , Levinson, S . E . , and Sondhi ,

M. M. (1986). "Recognition of Isolated Digits Using Hidden

Markov Models with Continuous Mixture Densities," A T T

Technical Journal, 61, 121 1-1222.

Rabi ner , L . R . , L ee , C. H. , Juang, B . H . , and W i lpon , J . G .

(1989), "HMM Clustering for Connected Word Recognition,"

in Proceedings of the International Conference on Acoustics,

Speech and Signal Processing, pp. 405-408.

Rabin er, L. R. . Levinson, S. E. , and Sondhi, M. M. (1983), "On

the Application of Vector Quantization and Hidden Markov

Models to Speaker-Independent Isolated Word Recognition,"

Bell System Technical Journ al, 62, 1075-1105.

Rabiner , L . R. , Wi lpon, J .

G . ,

and Juang, B. H. (1986), "A

Segmental k-Means Training Procedure for Connected Word

Recognit ion," A T T Technical Journal, 65, 21-31.

Rabiner , L . R . , Wi lpon, J . G . , and Soong, F. K. (1989), "High

Performance Connected Digit Recognition Using Hidden Mar-

kov Models," IE EE Transaclions on Acoust ics, Speech , and

Signal Processing, 37, 1211-1225.

Russel l, M. J . , and Moore, R .

K .

(1985), "Explicit Modeling of

State Occupancy in Hidden Markov Models for Automatic

Speech Recognition," in Proceedings of the International C on-

ference on Aco ustics, Speech and Signal Processing,

New York:

IEE E. pp. 5-8.

Schafer, R. W ., and Rabin er, L . R. (1971), "Design of Digital

Filter Banks for Speech Analysis," Bell System Technical Jour-

nal ,

50, 3097-3115.

Shikano,

K.

(1985), Evaluation of LPC Spectral Matching Mea-

sures for Phonelic Unit Recognition, technical repo rt , Carnegie

Mellon University, C ompu ter Science Dep t.

Sondhi, M. M., Levinson, S and De la Noue,

P

(1988), unpub-

lished report, AT&T Bell Labs.

Sondhi , M. M., and Ro e, D . (1983) , unpubl ished repor t, AT&T

Bell Labs.

Wellekens, C. (1987), "Explicit Time Correla tion in Hidden Ma r-

kov Models for Speech Recognition," in Proceedings of the

lnternational Conference on Acoustics, Speech and Signal Pro-

cessing. New York: IEEE. pp. 384-386.

hmms speech recognition

Documents