comparison of parametric representations for

8/6/2019 Comparison of Parametric Representations For

1/10

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 4, AUGUST 1980 357

Comparison of Parametric Representations forMonosyllabic Word Recognition inContinuously Spoken Sentences

Akfract-Severalparametric epresentations of theacoustic s i g n a lwerecomparedwith regard to word ecognitionperformance nasyllable-orientedcontinuousspeech ecognition system. Thevocabu-lary included many phonetically similar monosyllabic words, thereforethe mphasis was on he bility to retainphonetically ignificantacoustic information in the face of syntactic and duration variations.For each parameter set (based on a mel-frequency cepstrum, a linearfrequency cepstrum, a linear prediction cepstrum, a linear predictionspectrum, or a set of reflection coefficients), word templates were en-erated using an efficient dynamic warping method, and test data weretime registered with he emplates. A set of ten mel-frequency cep-strumcoefficientscomputed every 6.4 ms resulted in the best per-formance, namely 96.5 percent and 95.0 percent recognition with eachof two speakers. The superior performance of the mel-frequency cep-strum coefficients may be attributed to the fact that they better repre-sent the perceptually elevant aspects of the short-term peech spectrum.

T I. INTRODUCTIONHE selection of the bestparametric epresentationofacoustic data is an impo rtant task in the design of anyspeech recognition system. The usua l objectives in selecting arepresentation are t o com press the speech data by eliminatinginformation not pertinent to the phonetic analysis of the dataand t o enhance those aspects of the signal that c ontrib ute sig-nificantly t o the detection of phonetic differences. When asignificant amoun t of reference information is stored, such asdifferentspeakersproductions of the vocabulary,compactstorage of he nformationbecom es an impo rtant practicalconsideration.The choice of a basic phonetic segment bearsclosely on th erepresentation problem because the decision to identify an un-known segment with a reference category is based on the pa-rameters within the entire segme nt. The number of differentreferencesegments isgenerallysmaller than he numberofpossible unknown segments, and therefore he stepof identify-ing an unknownwith a reference entails a significant lossof information. One can minimize the loss of useful informa-tion by examining different parametric representations in theframework of the specific recognition system under consider-

Manuscript received June 11, 1979 ; revised December 18, 1979 andMarch 10, 1980. This material is based upon work supported by heNational Science FoundationunderGrant BNS 768202 3 t o HaskinsLaboratories.S . B. Daviswas with Haskins Laboratories , New Haven, CT 06510.He is now with Signal Technology, Inc., SantaBarbara, CA 93101.P. Mermelstein was with Haskins Laboratories, New Haven, CT 06510.He is now with Bell-Northern Research and NRS-Telecommunications,University of Quebec, Nuns Island, Verdun, P.Q., Canada.

ation. However, since the choice of a segment is so basic tothe decision as to w hat acoustic information is useful, the re-sult of such a comparative examination of different representa-tions is directly applicableonly to the specific recognitionsystem,an dgeneralization t o differently organized systemsmay not be warranted.Fujimura [ l ] an dMermelstein [ 2 ] discussed in d etail herationale for use of syllable-sized segm ents in the recog nitionof co ntinuous speech. The goal of the exp eriments reportedhere was to select an acoustic representation most appropriatefor he recognition of such segments. The methods used t oevaluate the representationswereopen testing, where thetraining data and test data were independently derived, andclosed testing, wh ere hese data sets were identical. In eachcase, the same speaker produced bot h the reference and testdata, which included the same words in a variety of differentsyntactic contex ts. Altho ugh variation among speakers is animportant problem in its own right, attention is focused hereon speaker-dependent representations t o restrict the differentsources ofvariation in theacoustic data.White and Neely [3] showed that th e choice of parametricrepresentations significantly affects the recognition results inan isolated w ord recognition system. Two of the best repre-sentations they explored were a 20 channel ban dpass filteringapproach using aChebyshevnorm on the ogarithmof hefilter energies as a similarity me asure, and a linear pred ictioncodingapproac h using a linear predictio n residual [4] as asimilarity measure. From he similarity of the correspondingresults, they concluded that bandpass filtering and linear pre-diction were essentially equivalent when used witli a dynamicprogramming imealignmentmethod.How ever, hat resultmay be due to th e absence of phonetically similar words inthe test vocabulary.Because of he know n variation of the ears critical band-widths with frequency [5], [6] ilters spaced linearly at lowfrequencies and logarithmically at high frequencies have beenused t o capture the phonetically imp ortan t characteristics ofspeech. Pols [7] showed th at the first six eigenvectors of thecovariancematrix forDutch vowels of three peakers, ex-pressed in terms of 17 such filter energies, accounted for 91.8percentof he total variance. T hedirectioncosinesof hiseigen vectors were very similar to a cosine series expan sion onthe filter energies. Additional eigenvectors showed an increas-ing number of oscillations of their directio n cosine s with re-spect to their originalenergies. This result suggested tha t a

0096-3518/80/0800-0357$00.75 0 1980 IEEE


2/10

358 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO, 4 , AUGUST 198comp act representation would be provided by a set of mel-frequency cepstrum coefficients. These cepstrum coefficientsare th e result of a cosine transform of the real logarithm of th eshort-term energy spectrum expressed on a mel-frequencyscale.A preliminary experimen t 9] showed th at he cepstrumcoefficients were useful for representing co nsonantal informa-tion as well. Fou r speakers produced 12 phonetically similarwords, namely stick, sick, sk i t , spit , sit, slit, strip, scrip, skip,skid, spick,and slid. A representation using only two cepstrumcoefficients resulted in 96 percent correct recognition of thisvocabulary. Given these encouraging results, it became im-portant to verify the power of the mel-frequency cepstrumrepresentation by com paring it to a number of oth er com-monly used representations in a recognition framework wherethe oth er variables, including vocabulary, are kept co nsta nt.This paper compares the performance of different acousticrepresentations in ontinuou s speech recognition systembased on syllabic units. The next section describes th e organi-zation of the recognition system, the selection of the speechdata, and he different parametric representations. The fol-lowing section describes the method for generating the acous-tic templates for each word by use of a dynamic-warping time-alignment procedure. Finally, the results obtainedwith hevarious representations are listed and discussed from the p ointof view of completeness in representing the necessary acousticinformation.

11. E X PE RI ME N T A L FRA ME W O RKA rather simple speech recognition framework served as th etestbed to evaluate the various acoustic representations.Lexical information was utilized in the form of a list of pos-

sible words and heir corresponding acoustic templates, andthese words were assumed to occur with equal likelihood. Nosyntactic or semantic information was utilized. If such infor-mation had been p resent, it could have been used to restrictthe num ber of adm issible lexical hypotheses or assign unequalprobabilities to them. Thus, in practice, instead of matchinghypothe ses to he entire vocabulary, the num ber of lexicalhypotheses that one evaluates may be reduced to amuchsmaller num ber. This reduction would cause many of th ehypothe ses phonetically similar to the ,target word to be elim-inated from consideration. Thus the high phonetic confusabil-ity of the test data may have resulted in a test environmentthat is more rigorous than would be encountered in practice.A . Selection of Corpus

The performance of continuous speech recognition systemsis determined by a num ber of distinct sources of acoustic vari-ability, including speaker characteristics, speaking rate, syn tax,communication environment,nd recording and /or trans-mission conditions. The focus of the current xperimen ts

Fant [81 compares Beraneks mel-frequency scale,Koenigsscale,and Fants approximation to the mel-frequency scale. Since the differ-ences between these scales are not significant here, the mel-frequencyscale should be unders tood as a linear frequency spacing below 1000 Hzand a logarithmic spacing above 1000 Hz.

is acoustic recognition in he face of variability induced iwords of the same speaker by variation of the surroundinwords and by syntactic position. The use of a separate refeence template for each different sy ntactic environment whicaword might occupy would require exorbitantamounts ostorage and training data.Thus an mp ortant practical rquirement is to generate reference templates without rgard to the syn tactic position of theword. To avoid thproblem of automatically segmenting comp lexconsonantalclusters, the corpus was composed of monosyllabic targewords t hat were semantically acceptable in a number of diferent positions ina given syntactic contex t. Since acoustvariation due to different speakers is a distinctly separate problem [lo] , t was considered advisable to restrict the scope othese initial experiments by using only speaker-dependentemplates. Th at is, bo th reference and est data were produced by the same speaker.The sentences were read clearly in a quiet environment anrecorded using a high-quality microph one. These recordinconditions were selected to establish the best performanclevel that one co uld expect the recognition system to attainEnvironments with higher ambient noise, which may be encountered in a practical speech input situation, would undoubtedly detract from the clarity of the acoustic informatioand therefore result in lower performance.

The speech data comprised 5 2 different CVC words fromtwo male speakers (D Z and L L ) , and a total of 169 tokenwere collected from 57 distinct sentences (Appendix A). Thsentences were read twice by each speaker in recording sessions separafed in time by two mo nths denoted by D Z I , D 2 2L L l , and LL2). Thus the data consisted of a total of 676 sylables. To achieve the required variability, the selected wordcould be used as bot h nouns and verbs. For example, Keethe hope at the bar an d Bar th e keep for the yell are twsentences that allow syntactic variation but preserve the samoverall into natio n pattern. All the words examined carriesome stress; the un stressed function words were not analyzedThe target words, all CVCs, included 12 distinct vowels, /i, e , E,z, , A , U, u , 3, , o/, some of which are normally diphthongized in En glish. Each vowel was represented in a t leafour different words, and these words manifested differencein both the prevocalic and postvocalic conson ants. The consonants were comprised of simple consonants as well aaffricates, but no conson antal clusters.B. egmentation

An automatic segmentation process [ l ] was initially considered as one wayof delimiting syllable-sized units in continuously spoken te xt, bu t any such algorithm performs thsegmentation task with a finite probability of error. In paticular, weak unstressed function words som etimes appeaappended tohe adjacent words carrying stronger stresAdditionally, n his stud y, a bound ary point located for aintervocalic consonantwith high sonority may not consitently join th at con sonant to the word of interest. In order tavoid possible interactionbetween- segmentation errors anpoor parametric representations, manual segmentation anauditory evaluation was used to accurately delimit the signa


3/10

DAVIS A N D MERMELSTEIN: MONOSYLLABIC WORD RECOGNITION 359

0 1000 2000 3000 4000 4600FREQUENCY (Hz)

MFCC FILTERSFig. 1. Filters for generating mel-frequency cepstrum coefficients.

corresponding to the target w ords. The segmentation, as wellas the subsequent analysis and recognition, was performed onaPDP-11/45minicomputerwith the Interactive Lab oratorySystem [121 .Insystemsemployingautomaticsegmentation, heactualrecognition rates can be expe cted to be lower due to the gen-eration of templates from imperfectly delimited words [13] .However, there is no reason to believe that segmentationerrors would not detr act equally from the recognition ratesobtained for the various parametricepresentations.C. Parametric Representations

The parametric representations evaluated in this study maybe divided into tw o groups: those based on the Fourier spec-trum and those based on the linear prediction spectrum. Thefirst g roup comprises the mel-frequency cepstrum coefficients(MFCC) andhe linear frequencyepstrum coefficients(LFCC). The econd roupncludes the linear predictioncoefficients (LPC), the reflection co efficients (RC), and hecepstrum coefficients derived from the linear prediction coef-ficients (LPCC). A Euclidean distance metric was used for allcepstrumparameters since cepstrum coefficients arederivedfrom an orthogonal basis. This metric was also used for theRC, in view of he ackofan nherentassociateddistancemetric. The LPC were evaluated using the minimum predic-tion residual distance metric [4].Each acoustic signal was low-pass filtered at 5 kHz and sam-pled at 10 kHz. Fourier spectra or linear predictio n spectrawere computed for sequential frames 64 points (6.4ms) or128 points (12.8 ms) apart. In each case, a 256 point Ham-ming window was sed to select the ata oints to beanalyzed. (A window sizeof 128 pointsproduceddegradedresults.)For the MFCC computatio ns, 20 triangular bandpass filterswere simulated as shown inFig. 1. The MFCC werecom-puted as

where M is the number of cepstrum coefficients, and x,, k =1 , 2 , * ,20, represents the log-energyoutput of the kt h filter.The LFCC w ere computed from the log-magnitude discreteFourier transform (DFT ) directly as

where K is the number of DFT magnitude coefficients Yk.The LPC were obtained from a 10 th order all-pole approxi-mation to he spectrumof the windowedwaveform.Th eautocorrelation m ethod for e valuation o f the inear predictioncoefficients was used [14 ]. The RC were ob tained by a trans-formation of the LPC which is equivalent to matching the in-verse of the LPC spectrum with a transfer function spectrumthat corresponds to an aco ustic tube consisting o f ten sectionsof variable cross-sectional area [151.The reflection coeffi-cients determine the fraction of energyn a raveling wave tha tis reflected at each section boundary.The LPCC were obtained from theLPC directly as [141

k = l (3)

The takurametric epresents hedistancebetween wospectral frames with optimal (reference) LPC and test LPC asA

(4)where s the autocorrelation matrix (obtained from he estsample) orresponding to @. Themetricmeasuresheresidual error when the test sample is filtered by the optim alLPC.Because of ts asymm etry, he Itakura metric requiresspecific identification of the refere nce coefficients (LPC) andthe test coefficients(a).or computational efficiency, thedenominator of (4) will be unit y if is expressed in unnormal-ized orm.Then if ?(n) denotes heunnormalizeddiagonalelements of R , rL p(n ) denotes the unnormalized autocorrela-tion coefficients from the LPC polynomial, and the logarithmis eliminated, the distance may be expressed s [16]

10D [< rLP]= ?(O) rLp(0) + 2 ?(i) rLp(z>.111. GENERATIONF ACOUSTICEMPLATES

The use of templates to represent the acoustic informationin reference tokens allows a significant computation reductioncompa red to use of the reference tokens themselves. The de-sign of a template generation process is governed by the goalof finding the point in acoustic space hat simultaneously min-imizes the distance to all given reference items. Where theappropriate distance is a linear function of the acou stic vari-ables, this goal can be realized by the use of classic patt ernrecognition echniques. However, phone tic features are notuniformly distributed across the acoustic data, and thereforeperceptually motivated distance me asures are nonlinear func-tions of those data. To avoid the com putation ally exorbita ntprocedure of simultaneously minimizing the set of nonlineardistances, tem plates are increm entally generated b y introduc -ing additional acoustic information from each reference tokento the partial tem plate forme d from the previous used refer-


4/10

360 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 4 , AUGUST 19

REF: ( k ) =T O K E N L O S E S T TOAV E RAGEDURAT ION

I N OFig. 2. Iterative algorithm for template generation.

ence tokens. Given a distance between tw o tokens, or betweena toke n and a template, the new template can be located alongthe line whose extent measures that distance. Since onlyacoustically similar tokens are to be combined into individualtemplates,one may expect that thisprocedure will exploitwhatever local linearization the space permits.A . Template Generation Algorithms

In one algorithm [lo] , an initial template is chosen as thetoken whose duration is the closest to th e average duration ofall tokens representing the same word (Fig. 2 ) . Then all re-maining tokens are warped to the initial template. The warp-ing is achieved by first using dynamic programming to providea mapping (or time registration) between any oke n andthe reference template. Following the nota tion in [17] , etT i ( m ) ,0


5/10

DAVIS AND MERMELSTEIN: MONOSYLLABIC WORD RECOGNITION 36

8O K E N S I TO R E F ( k - I )T IME WARP T OKE h i (k ) , WARPI IA V E R A G EO FRE F (k - I) [( k - I) RE F (k - I) + T O K E N ( h ) ]R E F ( k ) = W E I G H T E D

AN DW A R P E DT O K E NCeLT= k + lFig. 3. Nonitera tive algorithm fo r template generation.

TEMPLATE,B In1

AND EXTENDED TEST TOKEA lm l

RANGE OF SEARCHTIME ALIGNMENT

I- TIMEPAN OF TEMPLATE -Fig. 4. Dynamic time al ignment of speech samples.

thesegments aligned are monosyllabicwords,on ecan akeadvantage of anumber of well-definedacoustic features toguide thealignmentprocedure.Fo rexample, he release ofa prevocalic voiced sto p or the onset of frication of a post-vocalic fricative man ifest themselves by me ans of such acousticfeatures. The particular alignment procedure used meets theserequirements without requiring explicit decisions concerningthe nature of the acousticvents.The alignment operation employed a modified form of thedynamic programming algorithm irst applied to spoken wordsby Velichko and Zagoruyko [19] and subsequently modifiedby B ridle and Brown E201 a nd Itakura [4]. In view of the in-ten t to use the same algorithm for template generation as forrecognitionofunknown okens,asymmetricdynamicpro-gramming algorithm was utilized. Sakoe and Chiba [2 ] haverecently shown that a symmetric dynamic programming algo-rithm yields better word recognition results than previouslyused asymmetric forms.Execution of the algorithm proceeded in twotages (Fig. 4).First, the pair of tokens to be compared was time aligned by

appending silence to the marked endpoints and linearly shift-ing theshorterof he pair, with espect to the longer, toachieve a preliminary distance minimum . Since monosyllabicword s generally possess a prom inen t syllabic peak in energy ,this operation ensured that the syllabic peaks were lined upbefore henonlinearminimizationprocess was started. n-formal valuationhas hown that use of the preliminaryalignmentprocedure yields bette r results thanomitting heprocedure or using a linear time warping procedure o equalizethe time durations of the tokens. The two tokens, extendedby silence wherenecessary,were then subjected to he dy-namic programming search to find an improved distance mini-mum. The preliminary distance minimum, found as a result ofthe initial linear time alignment procedure, corresponded tothe distance computed along the diagonal of the search spaceand represented in most cases a good starting point for hesubseq uent deta iled search. Use of this preliminary tim e align-ment and headditional nvocation of apenalty unctionwhen the point selected along the dynamic programm ing pathimplied unequal time increments long the measured data, gen-


6/10

36 2 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, ANDIGNAL PROCESSING, VOL. ASSP -28,O. 4 , AUGUST 1980

erally forced th e optim um warping pa th to be near the diag-onal, unless prominent acoustic information was present toindicate the contrary. For efficiency in programming, zeros(representing silence) were never really a ppended to the data,rather, the time shift was retained and used to trigger a modi-fied Euclidean or Itakura distance measure when appropriate.The use of silence to extend the syllable tokens in the pre-liminary time alignment, instead of linear time expansion orcontraction as implied by asymmetric formulations of the dy-namic programming algorithm, requires some justification.The comparison here is am ong syllable-sized units which gen-erally possess an energy peak near th e center regions and lesserenergy near the ends. Based on a perceptual mod el, extensionof the tok ens by silence is clearly appropriate. Linear timescale changes would obscure equally the m ore significant dura-tion information in the co nsonantal regions and less significantduration nformation n the vocalic regions. Discriminationbetween w ords like pool and fool depends critically onthe d uration of th e prevocalic burst o r fricative. The align-ment ensures that the prominen t vowel regions are lined upbefore time scale hangesn the conson antal regions areexamined.C Dynamic Warping Algorithm

The dynamic warping algorithm serves to estimate the simi-larity between an unkno wn to ken a nd a reference template.Additionally, it serves to align a reference token with a partialtemplate to ensure that phonetically similar spectral framesare averaged in generating a composite tem plate. Through thepreliminary alignment procedure discussed above, the token ortem plate , whichever is shortest, is extended by silence frames onbo th sides. The resulting multidimensional acoustic representa-tions of th e pair of patterns com pared can be denoted by A ( m ) ,m = l , 2 ; . - , M a n d B ( n ) , n = l , 2 ; . . , M . For e a c hpa i r o fframes {A(m), B(n)}, local distance fun ction,D[A , B ] canbe defined for es-timating he similarity at point x(m, n). Achange of variables identifies x(m, n) as x ( p , 4) ,where p and4 are measured along and normal to th e diagonal illustrated inFig. 4. For each position along the diagonal {x@, 0), 1

comparison of parametric representations for

Documents