nonuniform speaker normalization using affine transformation

12
Nonuniform speaker normalization using affine transformation S. V. Bharath Kumar a Department of Electrical and Computer Engineering, University of California-San Diego, La Jolla, California 92093-0407 S. Umesh b Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208016, India Received 6 July 2006; revised 31 May 2008; accepted 2 June 2008 In this paper, a well-motivated nonuniform speaker normalization model that affinely relates the formant frequencies of speakers enunciating the same sound is proposed. Using the proposed affine model, the corresponding universal-warping function that is required for normalization is shown to have the same parametric form as the mel scale formula. The parameters of this universal-warping function are estimated from the vowel formant data and are shown to be close to the commonly used formula for the mel scale. This shows an interesting connection between nonuniform speaker normalization and the psychoacoustics based mel scale. In addition, the affine model fits the vowel formant data better than commonly used ad hoc normalization models. This work is motivated by a desire to improve the performance of speaker-independent speech recognition systems, where speaker normalization is conventionally done by assuming a linear-scaling relationship between spectra of speakers. The proposed affine relation is extended to describe the relationship between spectra of speakers enunciating the same sound. On a telephone-based connected digit recognition task, the proposed model provides improved recognition performance over the linear-scaling model. © 2008 Acoustical Society of America. DOI: 10.1121/1.2951597 PACS numbers: 43.72.Ar, 43.72.Ne DOS Pages: 1727–1738 I. INTRODUCTION Over the years, there has been much interest in trying to understand the fact that phonologically identical utterances show a great deal of acoustic variation and yet listeners are able to recognize words spoken by different speakers despite these variations. Many approaches have been proposed to reduce the interspeaker variation in the acoustic data espe- cially for formants of vowels. These include the formant- ratio theory, which is based on the idea that vowels are rela- tive patterns and not absolute formant frequencies. Some of these formulations include those by Syrdal and Gopal 1983 and Miller 1989, Syrdal – Gopal: BF 1 - BF 0 , BF 2 - BF 1 , BF 3 - BF 2 , Miller: log F 1 SR , log F 2 F 1 , log F 3 F 2 , where BF i is the “Bark” equivalent of the ith formant fre- quency F i and SR is the sensory reference derived from the geometric mean of F 0 over an interval of time. Nearey 1978 used constant log-interval normalization given by Nearey: logF 1 - L , logF 2 - L , logF 3 - L , where L is mean log value of the speakers’ F 1 and F 2 . Bladon et al. 1983 proposed a normalization method based on the observation that the average frequency differ- ence between formants of vowels produced by men and those produced by women approximately differ by 1 bark. This method uses the whole spectrum and not just the for- mant frequencies. Therefore, their method of normalization involved shifting down the auditory spectrum produced by women by about 1 bark. Another approach to look at the problem of vowel nor- malization is to use the idea of vocal-tract length normaliza- tion. Interspeaker variations are attributed to the physiologi- cal differences in the vocal tracts of the speakers. Nordström and Lindblom 1975 proposed a normalization procedure in which the formants are scaled by a constant scale factor based on the estimate of the speaker’s average vocal-tract length in open vowels, as determined from measurement of F 3 . This is usually referred to as uniform scaling. In this approach, all the formant frequencies of the subject to be normalized are simply divided by the factor = F ˆ 3,s / F ˆ 3,r . F ˆ 3,s and F ˆ 3,r are the average F 3 of open vowels of the subject s and the male reference speaker r, respectively. Fant 1975 argued that uniform scaling is a very crude approximation and proposed that the scale factor be made a function of both formant number and vowel category. With this approach, Fant 1974 claimed to reduce the female-male variance to one-half of that remaining after the simple uniform scaling based normalization suggested by Nordström and Lindblom 1975. Nearey 1978 extensively studied the validity of linear-scaling model and concluded that there may be some a This work was done while the author was a graduate student at the Depart- ment of Electrical Engineering, Indian Institute of Technology, Kanpur 208016, India. URL: http://ieng9.ucsd.edu/bsriperu. Electronic mail: [email protected] b URL: http://home.iitk.ac.in/sumesh. Electronic mail: [email protected] J. Acoust. Soc. Am. 124 3, September 2008 © 2008 Acoustical Society of America 1727 0001-4966/2008/1243/1727/12/$23.00 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Upload: s

Post on 06-Apr-2017

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonuniform speaker normalization using affine transformation

Redistribu

Nonuniform speaker normalization using affine transformationS. V. Bharath Kumara�

Department of Electrical and Computer Engineering, University of California-San Diego,La Jolla, California 92093-0407

S. Umeshb�

Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208016, India

�Received 6 July 2006; revised 31 May 2008; accepted 2 June 2008�

In this paper, a well-motivated nonuniform speaker normalization model that affinely relates theformant frequencies of speakers enunciating the same sound is proposed. Using the proposed affinemodel, the corresponding universal-warping function that is required for normalization is shown tohave the same parametric form as the mel scale formula. The parameters of this universal-warpingfunction are estimated from the vowel formant data and are shown to be close to the commonly usedformula for the mel scale. This shows an interesting connection between nonuniform speakernormalization and the psychoacoustics based mel scale. In addition, the affine model fits the vowelformant data better than commonly used ad hoc normalization models. This work is motivated bya desire to improve the performance of speaker-independent speech recognition systems, wherespeaker normalization is conventionally done by assuming a linear-scaling relationship betweenspectra of speakers. The proposed affine relation is extended to describe the relationship betweenspectra of speakers enunciating the same sound. On a telephone-based connected digit recognitiontask, the proposed model provides improved recognition performance over the linear-scalingmodel. © 2008 Acoustical Society of America. �DOI: 10.1121/1.2951597�

PACS number�s�: 43.72.Ar, 43.72.Ne �DOS� Pages: 1727–1738

I. INTRODUCTION

Over the years, there has been much interest in trying tounderstand the fact that phonologically identical utterancesshow a great deal of acoustic variation and yet listeners areable to recognize words spoken by different speakers despitethese variations. Many approaches have been proposed toreduce the interspeaker variation in the acoustic data espe-cially for formants of vowels. These include the formant-ratio theory, which is based on the idea that vowels are rela-tive patterns and not absolute formant frequencies. Some ofthese formulations include those by Syrdal and Gopal �1983�and Miller �1989�,

Syrdal – Gopal: B�F1� − B�F0�, B�F2� − B�F1� ,

B�F3� − B�F2� ,

Miller: log� F1

SR�, log�F2

F1�, log�F3

F2� ,

where B�Fi� is the “Bark” equivalent of the ith formant fre-quency Fi and SR is the sensory reference derived from thegeometric mean of F0 over an interval of time. Nearey�1978� used constant log-interval normalization given by

a�This work was done while the author was a graduate student at the Depart-ment of Electrical Engineering, Indian Institute of Technology, Kanpur208016, India. URL: http://ieng9.ucsd.edu/�bsriperu. Electronic mail:[email protected]

b�

URL: http://home.iitk.ac.in/�sumesh. Electronic mail: [email protected]

J. Acoust. Soc. Am. 124 �3�, September 2008 0001-4966/2008/124�3

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

Nearey: log�F1� − �L, log�F2� − �L, log�F3� − �L,

where �L is mean log value of the speakers’ F1 and F2.Bladon et al. �1983� proposed a normalization method

based on the observation that the average frequency differ-ence between formants of vowels produced by men andthose produced by women approximately differ by 1 bark.This method uses the whole spectrum and not just the for-mant frequencies. Therefore, their method of normalizationinvolved shifting down the auditory spectrum produced bywomen by about 1 bark.

Another approach to look at the problem of vowel nor-malization is to use the idea of vocal-tract length normaliza-tion. Interspeaker variations are attributed to the physiologi-cal differences in the vocal tracts of the speakers. Nordströmand Lindblom �1975� proposed a normalization procedure inwhich the formants are scaled by a constant scale factorbased on the estimate of the speaker’s average vocal-tractlength in open vowels, as determined from measurement ofF3. This is usually referred to as uniform scaling. In thisapproach, all the formant frequencies of the subject to be

normalized are simply divided by the factor �= F̂3,s / F̂3,r. F̂3,s

and F̂3,r are the average F3 of open vowels of the subject sand the male reference speaker r, respectively. Fant �1975�argued that uniform scaling is a very crude approximationand proposed that the scale factor be made a function of bothformant number and vowel category. With this approach,Fant �1974� claimed to reduce the female-male variance toone-half of that remaining after the simple uniform scalingbased normalization suggested by Nordström and Lindblom�1975�. Nearey �1978� extensively studied the validity of

linear-scaling model and concluded that there may be some

© 2008 Acoustical Society of America 1727�/1727/12/$23.00

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 2: Nonuniform speaker normalization using affine transformation

Redistribu

systematic speaker-dependent variation supporting some ofFant’s �1974� observations. However, his efforts to find abetter additive transform than log transform �which corre-sponds to linear scaling� using generalized linear models didnot yield any alternate scale �Nearey, 1992�.

In speaker-independent automatic speech recognition�ASR�, there is considerable interest in speaker normaliza-tion since one of the major factors affecting recognition per-formance is the acoustic variability in similar enunciationsby different speakers. To improve the performance of thesesystems and to make them approach that of speaker-dependent systems, various speaker normalization proce-dures based on spectral warping �Kamm et al., 1995; Leeand Rose, 1998; Wakita, 1977; Wegmann et al., 1996� havebeen proposed in speech recognition literature. Most of theseapproaches linearly scale the frequency axis of spectra of thegiven utterance to compensate for differences in the formantpositions between speakers. We refer to these normalizationprocedures as a linear scaling or uniform normalization sinceall frequencies are scaled or normalized by a constant scalefactor. The motivation for linear scaling comes from the factthat to a first-order approximation, the vocal-tract shape canbe assumed to be a tube of uniform cross section, and for thissimplifying approximation, the formant frequencies are in-versely proportional to the vocal-tract length �Wakita, 1977�.Although we often refer to the relation between formants, inspeech recognition, the same relation is extended to the en-tire spectral envelope since the features in ASR are usuallyderived from the spectral envelope.

The simple linear-scaling model discussed above ne-glects both the location of constrictions and the vocal-tractshape. Motivated by Fant’s �1975� work, there has been con-siderable interest in nonlinear scaling in speech recognitionliterature. Since Fant’s �1975� method requires the knowl-edge of phoneme and formant number, it cannot be directlyused in ASR. Therefore, in ASR, different ad hoc nonlinearfrequency-scaling functions have been proposed �Acero andStern, 1991; Eide and Gish, 1996; McDonough et al., 1998;Zhan and Waibel, 1997; Zhan and Westphal, 1997�, wherethe choice of a nonlinear frequency-scaling function ismostly driven by the ease of implementation and parsimonyof parameters. However, in most of these ad hoc schemes,there is no specific physiological or acoustic motivation forthe choice of a particular parametric form.

In this paper, we propose a well-motivated nonlinearfrequency-scaling function that affinely relates the formantfrequencies between speakers enunciating the same sound.The motivation for the proposed model comes from our pre-vious experiments �Umesh et al., 2002b� on actual speechdata where we empirically obtained a piecewise approxima-tion to a warping function that separates the speaker-dependent factor as a translation factor. In this paper, wederive the mathematical form of the universal-warping func-tion associated with the proposed affine model and show thatit is also similar in shape to the empirically obtained warpingfunction. The proposed model is compared with other linearand nonlinear models commonly used in speech recognitionand is shown to provide the best fit to formant data. The

efficacy of the method in speech recognition is demonstrated

1728 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

on a telephone data based connected digit recognition taskwhere the affine method provides improved recognition per-formance than the conventional uniform scaling model. Weconclude by pointing out to the similarity of universal-warping function associated with the affine model �based onspeech data alone� and mel-warp function �based on hearingexperiments�, thereby showing an interesting connection be-tween psychoacoustics and nonuniform speaker normaliza-tion.

II. REVIEW OF UNIVERSAL-WARPING FUNCTIONAND RELATION BETWEEN NORMALIZATIONSCHEMES

In this section, we review our previous work �Umeshet al., 2002a, 2002b�, where we have approached the prob-lem of speaker normalization through the concept ofuniversal-warping function. The basic idea of universal-warping function is to find a warping function of the fre-quency axis that maps the physical frequency, f , to an alter-nate domain, �, such that in the alternate domain the speaker-dependent parameter separates out as a pure translationfactor. It is universal in the sense that the same mappingshould be applicable to all speech data irrespective of thespeaker. Throughout this paper, when we talk of spectra �i.e.,functions of frequency�, we use f to denote the frequencyvariable, and when we talk of formants �i.e., specific fre-quency values�, we use Fi to denote the ith formant fre-quency. The universal-warping approach assumes that thespeaker dependencies can be modeled through a single trans-lation factor. For example, the commonly assumed linear-scaling model where all formant frequencies �Fi� betweenspeakers r and s are scaled by one constant �rs is given by

Fi,r = �rsFi,s . �1�

In this case, �rs is the speaker-dependent parameter that re-lates speaker s with the reference speaker r. This can beequivalently expressed in the log-warped domain as

log�Fi,r� = log��rs� + log�Fi,s� , �2�

where the speaker-dependent �rs separates out as a transla-tion factor, i.e., log��rs�. For the uniform scaling model ofEq. �1�, we see from Eq. �2� that the universal-warping func-tion is the log warping, and in this case, the speaker-dependent scale factor separates out as a translation factor inthe log-warped domain. Equivalently, if we take the ratio offormants �say, ith and jth� for the same speaker, we have

log�Fi,r

Fj,r� = log�Fi,r� − log�Fj,r�

= log��rsFi,s

�rsFj,s� = log�Fi,s� − log�Fj,s� . �3�

Hence Miller’s �1989�approach to normalization, which issimilar to the above equation �except for F1 /SR�, is equiva-lent to uniform or linear scaling. Nearey’s approach is also avariation of the uniform scaling model, with �L correspond-ing to the speaker-dependent shift factor. We recently be-came aware of the work of Nearey in Nearey �1978�, where

he introduced the use of the log-additive hypothesis. Our

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 3: Nonuniform speaker normalization using affine transformation

Redistribu

concept of universal-warping function is similar to this con-cept, except that we are looking in a generalized frameworkof translation for any transformation and not necessarily logtransformation. Note that since the shift factor does not de-pend on any specific phoneme, and only on the speaker, it isoften referred to as extrinsic normalization factor. Extendingthe model in Eq. �1� to spectral envelopes, we can assumethat the spectral envelopes of r and s are scaled versions ofone another, i.e., Pr�f�= Ps��rsf�. In the case of spectral en-velopes, log-warping the frequency-axis, i.e., �=log�f� re-sults in

Pr��� = Pr�e�� = Ps�e�+log �rs� = Ps�� + log �rs� , �4�

where Pr and Ps are log-warped versions of Pr and Ps, re-spectively. Therefore, the frequency-warped spectral enve-lopes are shifted versions of each other if the model in Eq.�1� is indeed true.

From experiments of Fant �1975� and some of our pre-vious experiments in Umesh et al. �2002a, 2002b, 2002c�, ithas been observed that there are deviations from the uniformscaling model. Since there are deviations from the uniformscaling model, log warping is not the appropriate universal-warping function to separate the speaker-dependent param-eter. In Umesh et al. �2002b, 2002c�, a piecewise approxi-mation to the universal-warping function was foundempirically from speech data, such that in the universal-warped domain, the same sound enunciated by differentspeakers were translated versions of one another. This em-pirically obtained universal-warping function is referred to asthe speech scale. Interestingly, the speech scale was found tobe “very similar” to the mel scale.

Note that since Bark scale is similar to mel scale �andhence speech scale�, the normalization method of Bladonet al. �1983� also uses a similar idea, which involves shiftingdown the auditory spectrum produced by women by about 1bark. While Bladon et al. �1983� used a gender-specific shift,every speaker has a speaker-specific relative shift in thespeech scale in the approach of Umesh et al. �2002b, 2002c�.Further, this normalization approach is very similar to thatproposed by Syrdal and Gopal �1983�. Note that the empiri-cal speech scale, the mel scale, and the Bark scale are verysimilar, and hence, we have the following normalization

S�Fi,r� − S�Fi,s� � mel�Fi,r� − mel�Fi,s� = crs, �5�

where S�.� is the speech scale and crs is a speaker-dependentconstant that is independent of i, i.e., formant number.

In this paper, our goal is to find the relation betweenspectra of the same sound enunciated by different speakers,such that the corresponding universal-warping function issimilar to our empirically obtained speech scale.

III. PROPOSED AFFINE TRANSFORMATION:RELATION BETWEEN FORMANTS OF SPEAKERS

We propose the following affine model to describe therelation between the formant frequencies of subject and ref-

erence speakers, i.e.,

J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath Kuma

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

�Fr + �� = �rs�Fs + �� , �6�

where Fr and Fs are the formant frequencies of the referencespeaker r and subject speaker s, respectively. �rs is a speaker-dependent parameter relating the reference and subjectspeakers. � is a constant in the model and is not dependenton the speakers. Equivalently, we can also rewrite Eq. �6� as

Fr = �rsFs + ���rs − 1� . �7�

We can clearly see the affine relation in the above equation,and hence we refer to the proposed model as the affinemodel. Note that unlike the conventional affine equation theshift factor is also a function of the scaling factor �rs. Asbefore, the same model is also used to describe the relation-ship between spectral envelopes. Before studying the model,let us first discuss the motivation behind its proposal.

A. Motivation

The proposed affine model relating the formant frequen-cies of speakers enunciating the same sound is broadly mo-tivated by the following observations.

�1� We are interested in finding a parametric model relatingthe formants of speakers such that the correspondinguniversal-warping function has a parametric form similarto the mathematical formula for the mel scale since themel scale closely matches the empirically obtainedpiecewise linear speech scale.

�2� It is well known that the higher-order formants aremostly affected by the vocal-tract length and so the first-order approximation of linear scaling is valid at higherfrequencies. Therefore, the choice of a parametric modelshould capture this behavior at higher frequencies wherethe scale factor should almost be constant between anytwo speakers. From Eq. �6�, we see that this is approxi-mately true for Fr, Fs��, in which case Fr��rsFs.

�3� Finally, at low frequencies �lower formants�, observationon Texas Instruments-Massachusetts Institute of Tech-nology �TIMIT� data �Umesh et al., 2002a� has shownthat there are significant deviations from uniform scalingwith the trend being increasing dilation/compressionwith decreasing frequencies �lower formants�. A similartrend in the values of warp factors at low frequencies hasbeen noticed by Potamianos and Narayanan �2003�.Nearey also observes a weak indication of a roughlysinusoidal pattern of deviation for low frequency for-mants in Sec. IV C of Nearey �1978�. As we will showlater, the proposed model reflects the above trends also.

For the proposed affine model in Eq. �6�, analogous touniform scaling, our goal is to find a one-to-one transforma-tion, �=��f� such that

Pr��� = Pr��−1����

= Ps�g��rs,�−1�����−1��� = Ps�� + �rs� , �8�

where g�·� is both a frequency- and a speaker-dependentscaling function. �rs is only speaker dependent and is inde-pendent of frequency. In this case, �=��f� is the universal-

warping function for the affine model.

r and Umesh: Speaker normalization by affine transformation 1729

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 4: Nonuniform speaker normalization using affine transformation

Redistribu

B. Universal frequency-warping function for affinemodel

The affine model that we have proposed in this paper inEq. �6� can be rewritten as

log�1 +Fr

�� = log �rs + log�1 +

Fs

��, � � 0. �9�

With �=log�1+ f /��, we have �r=�s+ log �rs, where �r and�s are the warped frequencies of f =Fr and f =Fs, respec-tively. Hence, the warped frequencies appear as shifted ortranslated versions in the �-domain and the translation factoris speaker dependent. Therefore, the universal-warping func-tion for the affine model is

� = ��f� = log�1 +f

�� . �10�

We are interested in the relation between ��f� and the melscale �which from Umesh et al. �2002a� we know is close tothe empirical speech scale�.

Stevens and Volkman �1940� experimentally obtained anonlinear mapping between perceived and physical frequen-cies of a tone and referred to it as the mel scale. In theiroriginal work, Stevens and Volkman �1940� had experimen-tally obtained the mapping at discrete set of points, for whichvarious closed-form curves have been fitted by researchers.The widely accepted closed-form approximations to melscale have the functional form

= a log10�1 +f

b� , �11�

where f is in Hz and is in mels. Fant’s technical melformula is defined with a=1000 / log 2 and b=1000, whereasin speech recognition the widely used formula is definedwith a=2595 and b=700, i.e.,

mel = �f� = 2595 log10�1 +f

700� . �12�

Although in speech recognition, the mel scale is the mostcommonly used psychoacoustic scale, in many other areas ofspeech the bark scale or equivalent rectangular bandwidth�ERB� scale is usually used. Note that these two scales alsohave a functional form similar to Eq. �11�. Therefore, for thepurposes of this paper, we will only compare with the melscale. We point out that Nearey �1978� also used a transfor-mation of the type log�f +b�, where b is a constant which isequivalent to our proposed universal-warp function given inEq. �10�. The warping function, ��·� obtained from the affinemodel and the mel-warp function, �·� have a similar para-metric form, but their behavioral similarity depends on thevalue of � �or, equivalently, of b�. It can be seen that �→mel /2595 log10 e for “large” values of � and �→�=log�f� for “small” values of �.

As mentioned previously, we can extend the relationshipfor formants in Eq. �9� to spectral envelopes also. We canrewrite the affine model as, f�=�rsf +���rs−1�. It is easy tosee that in the warped domain, �=��f�=log�1+ f /��, the

spectral envelopes are shifted versions of each other, i.e.,

1730 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

Pr��� = Pr�f = ��e� − 1��

= Ps�f� = �rsf + ���rs − 1��

= Ps���rs�e� − 1� + ���rs − 1��

= Ps���elog �rs+� − 1�� = Ps�� + log �rs� . �13�

In Eqs. �10� and �13�, we assume that � is a constant andalso that it is a speaker-independent parameter. However, thevalue of this constant � is yet to be determined. In the nextsubsection, we describe in detail how we have estimated thevalue of � using actual formant data.

C. Estimation of � and �

The estimation of � is carried out using the formant datafrom Peterson and Barney �1952� �PnB� and Hillenbrandet al. �1995� �HiL� vowel databases. The PnB database con-sists of 76 speakers �33 males, 28 females, and 15 children�,with each speaker contributing two utterances for each of tenvowels �/aa/, /ae/, /ah/, /ao/, /eh/, /er/, /ih/, /iy/, /uh/, /uw/�.Alternately, we can consider the PnB database as having 152speakers �66 males, 56 females, and 30 children�, with eachspeaker uttering ten vowels once. The HiL database effec-tively consists of 98 speakers �37 males, 33 females, 13 boys,and 15 girls�, with each of them speaking only once for eachof 12 vowels �/ae/, /ah/, /aw/, /eh/, /ei/, /er/, /ih/, /iy/, /oa/,/oo/, /uh/, /uw/�. We have not considered other speakers inthe HiL database since some of their formant estimates havebeen marked zero.

Every speaker in both of these databases is characterizedby formant frequencies �F1 ,F2 ,F3� for each vowel and wecreate a “formant vector,” f, by concatenating the formantfrequencies of different vowels spoken by that speaker.Therefore, each speaker is represented by a 30-dimensionalformant vector for PnB �since there are ten vowels and threeformats for each vowel� and by a 36-dimensional formantvector for HiL, which has 12 vowels. Let fs,j be the formantvector of the jth subject speaker of a given database. Simi-larly, let fr be the formant vector of the reference speaker.�We will discuss the selection of the reference speakershortly.� fr and fs,j are defined as

fr = �Fr,1 Fr,2 . . . Fr,n�T,

fs,j = �Fs,j1 Fs,j2 . . . Fs,jn�T, j = 1,2, . . . ,M ,

where n=30 and M =152 for PnB, and n=36 and M =98 forHiL. We have adopted the following approach to estimate thevalue of �. Initially we assume � to be speaker dependent.From Eq. �7�, the predicted formant frequency vector of the

reference with respect to the jth subject speaker, i.e., f̂r,j, is

f̂r,j = � jfs,j + � j�� j − 1�1 , �14�

where � j and � j are the parameters of the jth speaker and 1

is the n-dimensional vector �1 1 . . .n

1�T. Note that � is also afunction of j, the subject speaker; i.e., initially we assume �to be speaker dependent. The cost function to be minimized

is

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 5: Nonuniform speaker normalization using affine transformation

Redistribu

fr − f̂r,j2 = �i=1

n

�Fr,i − � jFs,ji − � j�� j − 1��2, �15�

which is quadratic in � j and � j. � j and � j are estimated byminimizing Eq. �15� over S= ��� ,���R++

2 for each subjectspeaker, i.e.,

�� j,� j� = arg minS

fr − f̂r,j2. �16�

As one reviewer suggested, we could have tried a joint opti-mization of � and a fixed � over all speaker pairs, but thereis no easy way to set non-negativity constraints for � and �.These constraints come from our empirical observations �dis-cussed below� that � takes mostly positive values and � liesin the conventional range of 0.7–1.3. Another reviewer hadsuggested the use of generalized linear models and a vari-ance stabilizing transform and had some doubts about thesuitability of our approach for � estimation. One of the mainreasons for the approach that we have followed below is dueto the fact that the variance of � estimates are affected by thevalue of the corresponding � in the affine model. We de-scribe the approach below.

In Eqs. �14� and �15�, we see that for the case of �=1, �can take any value without affecting the minimization. Ingeneral, the variance of � becomes extremely large as �→1. Therefore, to get reliable estimates of �, we shouldconsider subject speakers and reference speakers such thattheir corresponding � are different from unity. Note that thisconstraint is necessary only to get reliable estimates of �.Once, the speaker-independent � parameter has been esti-mated, we can take any pair of speakers and their corre-sponding � can be unity. We have taken male-female andmale-child combinations of speakers for the estimation of �.Figure 1�a� shows the estimates of � for the 37 male speakersof the Hillenbrand database with each of the 33 femalespeakers acting as a reference speaker. Similarly, Fig. 1�b�shows the estimates of � for the male speakers, with eachchild speaker acting as a reference. Figures 2�a� and 2�b�show the corresponding � estimates. In most of these cases,� are quite different from unity, and therefore the variance of� are reasonably small. Further, as expected, we see thatthere is a higher variance for � estimates when females areused as reference when compared to child reference speakerssince the � values are closer to unity. We also see in Fig. 1that the � values are positive for most speaker pairs.

In the above experiment, we have taken each female �orchild� speaker in the database as a reference speaker. Foreach subject speaker, we can average the estimates of � and� over the set of all female speakers, which would corre-spond to using an average female speaker. Instead of takingthe sample mean, we use the following procedure. Let therebe M �male� subject speakers and K female speakers. Let �ij

and �ij be the parameters of the jth subject with respect tothe ith female speaker as reference. From Eq. �6�, we have

f̂r,ij = �ijfs,j + �ij��ij − 1�1 , �17�

where i=1,2 , . . . ,K and j=1,2 , . . . ,M. f̂r,ij is the predicted

formant frequency vector of the ith female speaker with re-

J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath Kuma

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

spect to the jth subject speaker. �ij and �ij are estimated as

��ij ,�ij�=arg minSfr,i− f̂r,ij2. The predicted formant fre-quency vector of the average female �reference� speaker withrespect to the jth subject speaker is given by

f̂r,j =1

K�i=1

K

f̂r,ij =1

K�i=1

K

��ijfs,j + �ij��ij − 1�1�

= ��i=1

K�ij

K�fs,j + ��

i=1

K�ij��ij − 1�

K�1 . �18�

It is clear from Eqs. �14� and �18� that

� j =1

K�i=1

K

�ij and � j =�i=1

K �ij��ij − 1��i=1

K ��ij − 1�. �19�

By this approach, the estimate of � j has considerably lessvariance. The value � j of each male speaker with respect tothe average female �reference� speaker is shown in Fig. 3�a�.Similarly, the � j of each male speaker with respect to theaverage child speaker is shown in Fig. 3�b�. We can find theaverage � for the database by averaging over all M subjectspeakers, i.e., �mean=1 /M� j=1

M � j. The unconstrained meanvalue of � �for the Hillenbrand database� using a male-

FIG. 1. Estimates of � obtained for the 37 male speakers in Hillenbrand datawith each of the �a� female speakers and �b� child speakers in the databaseas reference. No constraint has been put on the range of � and was searchedover a very wide range of �−20000, +20000�.

female combination is 575 and using a male-child combina-

r and Umesh: Speaker normalization by affine transformation 1731

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 6: Nonuniform speaker normalization using affine transformation

Redistribu

tion is 330. If one leaves out the two outliers in the male-female experiment, then the mean � is 370, which is in thesame range as that obtained using a male-child combination.

Alternately, we have also computed the mean value of �using all �instead of only male� speakers from a given data-base with respect to the female �or child� speakers as refer-ence. Note that in this case, it is possible that for some pairsof speakers the � values are close to unity, with the corre-sponding � estimates having high variance. Therefore, wechose only those pairs of speakers whose � estimates were inthe interval between 0 and 2000. This is reasonable sincefrom Figs. 3�a� and 3�b�, we see that reliable estimates of �lie mostly in the region of 0–2000. Once again, we usedindividual female �or child� speakers as reference speakersand all the 98 speakers in the database as subject speakers.We then obtain averaged values of � j and � j over all thefemale �or child� reference speakers using Eq. �19�. Havingestimated � j, its mean estimate, �mean is computed for agiven database. The value of �mean has been computed to be508.04 for the PnB database and 495.67 for the HiL data-base. Similarly the �mean for child reference speakers wasfound to be 434.18 for the PnB database and 548.28 for theHiL database. As we will show both in terms of warpingfunction behavior and speech recognition performance, these

FIG. 2. Estimates of � obtained for the 37 male speakers in Hillenbranddata with each of the �a� female speakers and �b� child speakers in thedatabase as reference. No constraint has been put on the range of � and wassearched over a very wide range of �−20000, +20000�.

small variations in � do not have a significant effect.

1732 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

The warping functions �PnB and �HiL for PnB and HiLdata, respectively, are given by

�PnB = log�1 +f

508.04� , �20a�

�HiL = log�1 +f

495.67� . �20b�

As we mentioned previously, Nearey �1978� also inves-tigated the use of a transformation of the type log�f +b�,which is similar to our affine model. The main differencebetween our approach and Nearey �1978� is that the latterconsiders only the average formant values for males, fe-males, and children for several languages including theAmerican English database of Peterson and Barney �PnB�.On the other hand, we have considered pairwise all thespeakers in the PnB and Hillenbrand �HiL� databases. Wehave then averaged the estimates using Eq. �19�. Further,unlike Nearey �1978�, we have considered all the formantsand not just the first or second formant. One of the reasonsNearey was motivated to use the transform of the typelog�f +b� was the systematic speaker-dependent variation inacoustic parameters. However, his analysis using first and

FIG. 3. Estimates of � and � obtained for the 37 male speakers in Hillen-brand data with �a� the average female speaker and �b� the average childspeaker in the database as reference. No constraint has been put on the rangeof � and � and were searched over a very wide range of �−20000, +20000�.

second formants in Nearey �1978, 1992� did not show any

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 7: Nonuniform speaker normalization using affine transformation

Redistribu

substantial improvement by using other frequency scalesover the log scale. In this paper, however, we show that thereis an improvement in speech recognition performance usingour proposed affine model when compared to the linear-scaling model.

Equations �20a� and �20b� shows that the universal-warping functions obtained for the affine model from PnBand HiL databases are essentially the same and are function-ally closer to the mel-warp function than to the log-warpfunction. Figure 4 shows the plot of log-warp ���, mel-warp�mel�, and universal-warp ��PnB,�HiL� functions for the af-fine model. Since the value of � is almost the same for bothPnB and HiL databases, the universal-warp function for bothof these databases appear the same. It is interesting to notethat the affine-warp functions are almost the same as themel-warp function that is obtained by fitting a curve to psy-choacoustic experimental data of Stevens and Volkman�1940� whose equation is given in Eq. �12�. Therefore, ourexperiment on nonuniform speaker normalization, which isconducted on speech data alone, shows the required fre-quency warping to be close to the mel scale. This study thusjustifies the use of mel scale in speech recognition, not onlyfrom the psychoacoustic point of view but also from theviewpoint of nonuniform speaker normalization. We com-pare the normalization performance of affine-warp functionswith respect to mel warp and log-warp functions in terms ofword error rate in Sec. IV.

Apart from the nature of warping function, we are alsointerested in studying the distribution of reestimated � valuesacross genders. Using the fixed value of �mean �which is as-sumed to be speaker independent�, we reestimated � for eachspeaker. Figure 5 shows the histograms of � for male, fe-male, and child speakers of PnB and HiL databases. Thetrend in the estimates of � across genders shows the exis-tence of gender separability. Also, since the average femaleis considered as the reference subject, the female speakersare centered around an �=1 warping factor, while male andchild subjects have ��1 and ��1, respectively. Although

0 1000 2000 3000 4000 50000

500

1000

1500

2000

2500

Frequency, f (Hz)

War

ped

dom

ain

Log−warp (λ)Mel−warp (η

mel)

Affine−warp for PnB (νpnb

)

Affine−warp for HiL (νhil

)

FIG. 4. Comparison of different universal-warping functions including theaffine-warp, the log-warp, and the mel-warp functions. The affine-warpfunctions for Peterson and Barney �PnB� and Hillenbrand �HiL� databasesalmost overlap since they are functionally similar, as seen from Eqs. �20a�and �20b�.

not shown here, a similar distribution of � estimates occurs

J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath Kuma

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

when we use the male reference speaker. As an illustration ofthe normalization process, in Fig. 6, we show the smoothedspectra of a male-speaker appropriately normalized to matchthe female reference using the affine model. In the next sub-section, we will compare the nature of frequency warping ofthe proposed affine model with other models commonly usedin speech recognition.

D. Normalization models used in speech recognition

As we have mentioned previously, Fant’s work �Fant,1975� shows that there are significant deviations from theuniform scaling model of Eq. �1�. He suggests that the scalefactor be made a function of formant number and vowelcategory. Motivated by Fant’s work, in speech recognitionliterature, there has been an attempt to make the scaling afunction of both frequency and a speaker-dependent factor�rs, i.e., g��rs , f�. So, the relationship between spectral enve-lopes of the reference and subject speakers is assumed to be

Pr�f� = Ps�g��rs, f�f� . �21�

Note that we assume that there is only one speaker-dependent parameter �rs for each speaker, and this factor is

0.8 0.9 1 1.1 1.20

2

4

6

8(a)

Num

ber

ofsu

bjec

ts

0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.250

2

4

6(b)

α

Num

ber

ofsu

bjec

ts

MalesFemalesChildren

MalesFemalesBoys & Girls

FIG. 5. Histogram of the speaker-dependent parameter � in speaker normal-ization using the affine model for �a� the Peterson and Barney database and�b� the Hillenbrand database.

0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2

2.5x 10

9 Illustration of Affine Warping on Vowel /eh/

Frequency(Hz)

Mag

nit

ud

e

Female (Reference)Male SpeakerWarped Male Speaker

FIG. 6. Illustration of the normalization of the spectral envelope of a male-

subject speaker to a female-reference speaker using the affine model.

r and Umesh: Speaker normalization by affine transformation 1733

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 8: Nonuniform speaker normalization using affine transformation

Redistribu

the same for all phonemes. Some of the scaling functionsinclude those proposed by Eide and Gish �1996� and powerand bilinear transformation �Acero and Stern, 1991; McDon-ough et al., 1998�. These are ad hoc models, in that theyhave no physiological or acoustic motivation. The main mo-tivating factors for these models have been the ease of imple-mentation and parsimony of parameters. Therefore, untilnow no study has been done to understand the nature offrequency scaling in these models. In this subsection, wecompare the nature of frequency scaling of these modelswith the proposed affine model.

Let gu��rs , f�, ga��rs , f�, ge��rs , f�, gp��rs , f�, andgb��rs , f� be the frequency-dependent scale factors for uni-form, affine, Eide–Gish, power, and bilinear transformationmodels, respectively. The mathematical form of these non-linear frequency-scaling functions is given in Table I. FN inthese equations is the Nyquist frequency, which is 8 kHz fortelephone speech.

We show the nature of the frequency-dependent scalingfunction for different models using the average female as thereference speaker and the average male as the subjectspeaker. In this subsection, we use formant data, and theaverage female formant vector is obtained by averaging theformant vectors of all female speakers in the database. Thiscorresponds to the reference vector f r. Similarly, the averagemale subject vector, fs, is obtained by averaging all the maleformant vectors in the database. The parameter �rs for eachof these scaling functions is estimated by fitting f r

=g��rs , fs�fs in the least-squares sense.Table I shows the least-squares estimate of �rs for each

of nonuniform scaling functions using PnB and HiL data.Figure 7 shows the plot of g��rs , f� based on �rs estimates ofTable I for PnB and HiL databases. At low frequencies, manyof the models �except Eide–Gish� show deviation from uni-form scaling, with the nature of deviation being approxi-mately similar to empirical observations in Umesh et al.�2002b� and Potamianos and Narayanan �2003�. However, athigher frequencies, it is clear from Fig. 7 that only the affinemodel behaves similar to uniform scaling, whereas the other

TABLE I. Frequency-dependent scale factor g��rs , f� between the averagemale and the average female for various speaker normalization schemesusing the vowel formant data from Peterson and Barney �PnB� and Hillen-brand �HiL� databases. The �rs are estimated using least-squares criterion.

Method Scale factor, g��rs , f� PnB HiL

Uniform gu�f�=�rs 1.17 1.15Affine

g��f�=�rs+���rs−1�

f

1.14 1.12

Eide–Gish ge�f�=�rs3f/FN 1.20 1.16

Powergp�f�= � f

FN��rs−1 0.89 0.90

Bilinear

gb�f�=FN

2�ftan−1

�1−�rs2 �sin�2�f

FN�

�1+�rs2 �cos�2�f

FN�−2�rs

0.14 0.125

models exhibit completely different behaviors. Uniform scal-

1734 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

ing is more reasonable at higher frequencies since higherformants are mostly affected by the length of the vocal tract.This study again confirms that the affine model is a bettermodel for speaker normalization when compared to other adhoc normalization models and may also explain the limitedsuccess of these ad hoc models in terms of recognition per-formance when compared to uniform scaling.

E. Comparison to piecewise constant scale factormodel

In this subsection, we compare the universal-warpingfunction obtained for the affine model with the empiricalpiecewise approximation of universal-warping function esti-mated in Umesh et al. �2002b�. For the affine model, theuniversal-warping function is given by Eqs. �20a� and �20b�when estimated using PnB and HiL databases, respectively.

The following model is used in Umesh et al. �2002b� todescribe the relationship between formant frequencies of twospeakers enunciating the same sound,

Pr�f� = Ps��rs i f�, Li � f � Ui, �22�

where ars depends on the pair of speakers r and s and i ispurely a function of frequency �more precisely on the chosenfrequency band, i�. The frequency region of interest was di-vided into five logarithmically equispaced bands, where Li

and Ui are lower and upper frequency boundaries of the ithfrequency band. The above model can be thought of as mod-eling the deviation from uniform scaling by allowing i thefreedom to change from band to band. Note that if i=1 forall i �or more generally if i is the same for all i�, then onegets back the uniform scaling model of Eq. �1�. It has to benoted that for the model in Eq. �22�, there is no exact closed-

0 1000 2000 3000 40001

1.1

1.2

1.3

1.4

1.5

1.6

Frequency (Hz)

α~(f

)

UniformAffineEide−GishPowerBilinear

Peterson & Barney

0 1000 2000 3000 40001

1.1

1.2

1.3

1.4

1.5

Frequency (Hz)

α~(f

)

UniformAffineEide−GishPowerBilinear

Hillenbrand

FIG. 7. Frequency-dependent scale factor g��rs , fs� as a function of subjectformant frequency, fs, for various speaker normalization schemes usingPeterson and Barney and Hillenbrand vowel formant data �see Table I�.

form relationship between formant frequencies unlike the af-

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 9: Nonuniform speaker normalization using affine transformation

Redistribu

fine model, where the mathematical relation is given by Eq.�6�. Further, the affine model has only two free parameters ��and ��, while the piecewise approximation with five fre-quency bands has six parameters � i, i=1,2 , . . . ,5 and ars�,giving it more degrees of freedom.

In Umesh et al. �2002b�, i values are estimated usingTIMIT vowel data, and the corresponding piecewise approxi-mation to universal-warping function was obtained. We referto this scale as the speech scale. Since in this paper all theexperiments have been done using PnB and HiL formantdata, we have estimated the empirical universal-warpingfunction for PnB and HiL using the procedure detailed inUmesh et al. �2002b�. The empirically obtained curve is fit-ted to the parametric form in Eq. �11� using TABLECURVE2D.With a fitting accuracy of 97% for PnB and 99.6% for HiL,we obtain a continuous warping function ��f�. Theuniversal-warping functions, ��f� for PnB and HiL data-bases using the empirical method of Umesh et al. �2002b�are given as

�PnB = �PnB�f� = 2364.85 log�1 +f

588.05� , �23a�

�HiL = �HiL�f� = 2478.24 log�1 +f

641.94� . �23b�

It is interesting to note from Eqs. �23a� and �23b� that thewarping function ��f� is behaviorally closer to affine-warpand mel-warp functions than to the log-warp function. Thefact that we again get a “mel-like” curve is interesting sincewe have used a piecewise model to fit the formant data,which is quite different from the affine model. Since weobtain mel-like warping functions from two entirely differentmethods, we argue that mel scale may be important inspeaker normalization.

To summarize, the affine model provides a universal-warping function that is similar to the mel curve and speechscale. In addition, at higher frequencies, the affine modelbehaves similar to uniform scaling, which is a desired prop-erty. It fits the formant data better than other standard non-uniform models. We therefore claim that the proposed affinemodel is a more appropriate model for nonuniform speakernormalization.

IV. NORMALIZATION PERFORMANCE IN ASR

In this section, we compare the normalization perfor-mance of the affine model of Eq. �6� with the commonlyused linear model of Eq. �1� on a digit recognition task. Wedo not consider other models like Eide–Gish, power, andbilinear transformation in this comparison because of theiranomalous behavior, as seen in Fig. 7. We use word errorrate of the recognition system as a measure of normalizationperformance and study which of these models is best suitedfor speech recognition.

A. Task and database

The normalization experiments are evaluated on atelephone-based connected digit recognition task. The speech

data for training the recognizer is derived from the Numbers

J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath Kuma

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

v1.0cd corpus of Oregon Graduate Institute. The training setconsists of 6078 utterances from adult male and femalespeakers. The performance of different normalization proce-dures are evaluated on two different test sets. The test set,referred to as “adults,” is derived from Numbers corpus andconsists of 2169 utterances from 790 adult male and 1379female speakers other than those in training. The mismatchedtest set, referred to as “children,” is derived from other thanNumbers corpus, which is not publicly available. It consistsof 2798 utterances from speakers having ages between 6 and18 years. The children test set consists of 1225 boys and1554 girls. All the utterances have variable digit stringlengths and correspond to continuously spoken speech.

B. Front-end signal processing

Until now we have compared the models using formantdata from PnB and HiL. In speech recognition the featuresfor recognition are derived from the smoothed spectral enve-lope. The cepstral coefficients are the most commonly usedfeatures in state-of-the-art automatic speech recognition sys-tems. Most of these systems use mel-frequency cepstral co-efficients �MFCCs�, which are obtained from mel-filter banksmoothed spectra.

In our implementation, the spectral smoothing is doneusing weighted overlap segment averaging �WOSA� �Nuttalland Carter, 1982�, which is a variation of the averaged peri-odogram spectral estimation method. The details of the pro-cedure are as follows. Speech signals are sectioned into20 ms long frames �corresponds to 160 samples for 8 kHzsampling� with an overlap of 10 ms. A first-order backwarddifference signal is computed with a pre-emphasis factor of0.97. A given pre-emphasized speech frame is segmentedinto a number of overlapping subframes of 64 sample widthand 45 sample overlap, with each subframe being Hammingwindowed. This choice of parameters is for telephone speechsampled at 8 kHz. An average autocorrelation estimate for agiven frame is obtained by averaging over the autocorrela-tion estimates of the subframes. The Fourier transform of theaveraged autocorrelation estimate is essentially the smoothedspectral envelope. This method effectively suppresses thepitch since the duration of each subframe is less than theexpected pitch interval of an average adult male. In Umeshet al. �2004� and Sinha and Umesh �2008�, we have com-pared this approach to conventional MFCC for vocal-tractlength normalized �VTLN� and have shown slightly betterperformance. We have tested and found that this smoothingworks well for female and children also.

In this paper, we approach the problem of speaker nor-malization through the concept of universal-warping func-tion. In the universal-warped domain the speaker-dependentfactor will separate out as a translation factor. Normalizationis done by an appropriate shift of the universal-warped spec-tra. To compute the un-normalized features, the spectral fea-tures in the universal-warped domain are obtained by takingnonuniform discrete Fourier transform over the averaged au-tocorrelation estimate obtained from WOSA. The choice ofthe number of points in the warped domain is critical as it

controls the resolution of spectral shift. In our implementa-

r and Umesh: Speaker normalization by affine transformation 1735

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 10: Nonuniform speaker normalization using affine transformation

Redistribu

tion, 64 points spaced nonuniformly depending on theuniversal-warping function are considered between 270 and3850 Hz. The universal-warped spectrum is then log com-pressed �in magnitude�, and finally a discrete cosine trans-form �DCT� is applied to obtain cepstral features.

We now discuss the computation of features during nor-malization. In conventional speaker normalization, thelinear-scaling model of Eq. �1� is used. To obtain the normal-ized features, the speech signal is warped by resampling�Kamm et al., 1995� and then the mel-filter bank smoothingis applied to get VTLN features. This is more efficientlyimplemented by appropriately VTLN-warping the mel-filterbank �Lee and Rose, 1998�.

In our proposed approach to speaker normalization, nor-malized features are obtained by an appropriate shift in theuniversal-warped domain. If we use log warping given in Eq.�4�, then this corresponds to a linear-scaling model of Eq.�1�. The only difference between the conventional VTLN andthis method is that we use WOSA for spectral smoothing andthat normalization is done by appropriate shifts rather thanby using the frequency-scaling method of Kamm et al.�1995� and Lee and Rose �1998�. On the other hand, usingthe universal-warping function of Eqs. �20a� and �20b�would correspond to using the affine model. The only differ-ence between Eqs. �20a� and �20b� is in the small differencein � values obtained using the two different databases. Wealso see that the universal-warping functions given in Eqs.�23a� and �23b� are very similar to the affine universal-warpfunctions, except that they have been empirically obtainedusing the model of Eq. �22�. In all these cases, the universal-warped spectrum is appropriately shifted for normalization,then log compressed �in magnitude�, and finally a DCT isapplied to derive normalized features.

C. Base line speech recognizer

The digit recognizer is developed using the hidden Mar-kov model toolkit. The recognition of digit string is consid-ered as a task without restricting the string length. Elevendigit models are generated for 1–9, zero, and oh. The digitsare modeled as whole word hidden Markov model �HMMs�with the following parameters: 16 states per word, simpleleft-to-right models without skips over states, and a mixtureof five Gaussian with diagonal covariance matrices per state.The silence is modeled separately using two models. Thefirst one called sil consists of three states with a transitionstructure as suggested in Hirsch and Pearce �2000�. ThisHMM models the silences before and after the utterance anduses a mixture of six Gaussian models per state. The secondsilence model called sp is used to model pauses betweenwords. It consists of a single state, which is tied with themiddle state of the first silence model. A 39-dimensional fea-ture vector comprising normalized log energy, C1–C12 �ex-cluding C0� base cepstral coefficients, and their first andsecond-order derivatives is used. Finally cepstral features areliftered and cepstral mean subtraction is performed, as fol-

lowed conventionally.

1736 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

D. Estimation of normalization factor

In ASR, there is no concept of a reference or “golden”speaker. Therefore, conventionally the estimation of normal-ization factor is done using a maximum likelihood �ML�framework by comparing it with the statistical model builtusing data from all speakers. In conventional VTLN, thelinear-scaling model of Eq. �1� is used, and the scale factor�rs for every speaker is estimated using the maximum like-lihood criterion. In the case of shift-based normalization, theuse of universal-warping function results in the spectral en-velopes being translated versions of one and another in thewarped domain. Speaker normalization is done by an appro-priate shift in the warped domain �Sinha and Umesh, 2002,2008�. We estimate this shift factor using a maximum likeli-hood framework using a method analogous to the ML esti-mation of scale factor in conventional VTLN. We brieflydescribe the procedure.

Let Xi�= �xi,1

� , . . . ,xi,T� be the set of T feature vectors ob-

tained after the �-shift in the universal-warped domain for anutterance from speaker i. Note that each vector xi,t

� is thefeature vector obtained for the tth frame of speech after ap-plying a shift of �i in the universal-warped domain. Let Wi

denote the transcription of the utterance from speaker i. Thetranscription Wi is necessary for aligning the models with theacoustic data. During testing, this transcription is obtained bydoing first-pass recognition using un-normalized features andthe statistical model. If � denotes a set of given HMM mod-els trained from a large population of speakers, then the op-

timal translation factor �iˆ for speaker i is obtained by maxi-

mizing the likelihood of the shifted utterances with respect tothe model and the transcription, i.e.,

�iˆ = arg max

�Pr�Xi

���,Wi� . �24�

The optimal translation factor is obtained by doing a gridsearch over a range of values. In our experiments, we searchover a range of seven shifts between −3 and +3. The methodin Lee and Rose �1998� involves a grid search of scale factor� over 0.88���1.12 in steps of 0.04.

E. Recognition performance

We now discuss the recognition performance using ourproposed affine model and the corresponding universal warp-ing function given in Eqs. �20a� and �20b�. We will compareit with the linear model whose universal-warping function isthe log-warping function given in �=log�f�. We will alsocompare its performance to empirically obtained universal-warping function �discussed in Sec. III E� given by Eqs.�23a� and �23b�. For all these models, we use shift-basednormalization, and the performances are compared in a com-mon framework. We also show the performance of the con-ventional MFCC, which uses linear-model and filter-banksmoothing.

We now discuss the recognition setup. When there is nonormalization the features are computed without shifting inthe universal-warped domain. For each such universal-warping function, we first build the HMM using un-

normalized cepstral features of the training speakers. This is

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 11: Nonuniform speaker normalization using affine transformation

Redistribu

referred to as the base line system. To evaluate the perfor-mance during normalization, it is necessary to train theHMM with the normalized features from the training speak-ers. For each universal-warping function, we estimate theML shift factor for the training speakers with respect to thebase line model for that universal-warping function using Eq.�24�. We then do the appropriate shift in the universal-warped domain and compute the corresponding normalizedfeatures. The normalized HMM model is built from the nor-malized features of the training speakers. This is also theprocedure followed in all conventional VTLN implementa-tion methods since there is no “reference” or golden speakerfor whose speech the HMM model can be built. Using thefirst-iteration normalized HMM model, we again estimatethe ML shift factor for the training speakers. Using the newestimates of shift, we again compute normalized features,which are then used to build the second-iteration normalizedHMM. This process is iterated three times to refine the nor-malized HMM model �Lee and Rose, 1998�. During testing,the shift factor for each test speaker is again estimated usingthe maximum likelihood criterion. Similar to conventionalVTLN, the transcription, Wi, for speaker-factor estimationduring testing is obtained from the first recognition pass us-ing un-normalized features and the base line model. Afterappropriate normalization using the estimated shift factor,the corresponding features of the test speaker are used forrecognition.

Table II shows the word error rates before normalization�base line�, Eb, and after normalization, En. The first rowmarked “Conventional MFCC” is obtained by using the con-ventional filter-bank front-end and the normalization is doneusing the linear-scaling model �see Lee and Rose �1998��. Allthe other results are obtained under the common frameworkof shift-based normalization with WOSA spectral smoothing.In the shift-based normalization scheme, we need to use anappropriate universal-warping function for each model. Wecompare the performance of the following models:

�i� Linear-scaling model whose corresponding universal-warping function is the log-warping function �=log�f�.

�ii� The proposed affine model in Eq. �7� whose

TABLE II. Word error rate of various frequency-wrepresent the word error rates before normalization �bin brackets refers to the equation number of the univ

Warping function Scaling

ConventionalMFCC

Linear

� �Eq. �4�� Linear�PnB �Eq. �20a�� Nonlinear�HiL �Eq. �20b�� Nonlinear�PnB �Eq. �23a�� Nonlinear�HiL �Eq. �23b�� Nonlinear�=700 �Eq. �12�� Nonlinear

universal-warping function �PnB and �HiL is given by

J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath Kuma

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

Eqs. �20a� and �20b� for PnB and HiL, respectively.Note that these two universal-warp functions are verysimilar with � in the neighborhood of 500. We haveshown recognition results for both to illustrate the factthat small differences in � values do not affect theperformance significantly.

�iii� The empirically determined universal-warp functions,�PnB and �HiL of Eqs. �23a� and �23b� obtained byfitting the piecewise approximation to the universal-warp function of Umesh et al. �2002b, 2002c�.

�iv� Finally, note that all the nonuniform scaling modelshave universal-warping functions that are parametri-cally similar to the mel scale. In speech recognitionthe mel scale is given by Eq. �12�, with b �or �� being700. Therefore, we have also compared the perfor-mance using this universal-warping function, �=700,with �=700. Note that this universal-warp function��=700� corresponds to the affine model of Eq. �7�with �=700.

Some important observations from Table II are as fol-lows:

�1� The shift-based normalization scheme along with WOSAfront-end performs significantly better than the conven-tional MFCC front-end with scale-based normalization.This can be seen from the top two rows where we com-pare the performance of � with conventional MFCC bothof which correspond to uniform/linear-scaling model. InUmesh et al. �2004� and Sinha and Umesh �2008�, wediscuss the reasons for this improvement.

�2� All nonlinear normalization models perform significantlybetter than conventional linear-scaling model. This canbe seen by comparing the performance of the �=log�f�warp with the nonlinear warp functions. Note that allthese methods have used WOSA front-end.

�3� All the nonlinear models have a similar functional formfor the universal-warp function, with only � varying.Since they all have similar performances, it may be con-cluded that small changes in � value do not significantlyaffect recognition performance.

g functions on a digit recognition task. Eb and En

e� and after normalization, respectively. The number-warping function used.

Adults Children

En Eb En

2.98 14.38 9.35

2.85 13.04 9.202.56 13.49 8.002.57 13.42 8.022.53 13.59 7.882.49 13.60 7.772.52 13.73 7.96

arpinase linersal

Eb

3.48

3.242.973.043.093.033.02

We summarize from Table II that nonuniform normaliza-

r and Umesh: Speaker normalization by affine transformation 1737

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45

Page 12: Nonuniform speaker normalization using affine transformation

Redistribu

tion is more appropriate for speaker normalization than foruniform normalization. More importantly, we have a para-metric affine model to describe the nonuniform relationshipbetween speakers.

This study also shows the normalization performance fordifferent values of b for the warping function in Eq. �11�.The value of b=641.94 seems to be marginally better thanb=700 in terms of recognition performance, which is inter-estingly closer to b=657.6 as computed in Umesh et al.�1999� for the Stevens and Volkman �1940� data.

V. DISCUSSION AND CONCLUSION

We have proposed a parametric model that affinely re-lates the formant frequencies of any two speakers enunciat-ing the same sound and have shown the correspondinguniversal-warping function to be similar to the mel scale. Wealso justified the validity of this model by showing a detailedcomparison to other models in terms of formant fitting errorand recognition performance. Hence, we claim that the pro-posed affine model is a more appropriate model for nonuni-form speaker normalization than the conventional uniformscaling and other ad hoc nonuniform models.

The other important aspect of the paper is in showing apossible connection between speaker normalization and psy-choacoustics. We claim so because the proposed warpingfunction—computed from speech data alone—behaves verysimilar to the mel scale, which has been obtained from psy-choacoustic studies. While the mel scale is used widely inspeech recognition, we believe that this is the first time that ithas been shown that the mel scale is also important forspeaker normalization.

ACKNOWLEDGMENTS

We would like to thank Professor Terrance Nearey forreading the manuscript carefully and giving us many usefulcomments and suggestions, especially with respect to thematerial in Sec. III. We would also like to thank the anony-mous reviewers for their suggestions, which have helped toimprove the manuscript. This work was supported in part bythe Department of Science and Technology, Ministry of Sci-ence and Technology, India under SERC Project No. SR/S3/EECE/0008/2006.

Acero, A., and Stern, R. M. �1991�. “Robust speech recognition by normal-ization of the acoustic space,” in Proceedings of IEEE ICASSP, Toronto,Canada, pp. 893–896.

Bladon, R. A. W., Henton, C. G., and Pickering, J. B. �1983�. “Towards anauditory theory of speaker normalization,” Language andCommunication0271–5309 4, 59–69.

Eide, E., and Gish, H. �1996�. “A parametric approach to vocal tract lengthnormalization,” in Proceedings of IEEE ICASSP ’96, Atlanta, USA, pp.346–348.

Fant, G. �1975�. “A non-uniform vowel normalization,” Technical report,

1738 J. Acoust. Soc. Am., Vol. 124, No. 3, September 2008 Bharath

tion subject to ASA license or copyright; see http://acousticalsociety.org/co

Speech Transmission Laboratory, Royal Institutte of Technology, Stock-holm, Sweden.

Hillenbrand, J., Getty, L., Clark, M., and Wheeler, K. �1995�. “Acousticcharacteristics of american English vowels,” J. Acoust. Soc. Am. 97,3099–3111.

Hirsch, H. G., and Pearce, D. �2000�. “Aurora framework for the perfor-mance evaluation of speech recognition systems under noisy condition,” inISCA ITRW ASRU 2000, Automatic Speech Recognition: Challenges forthe New Millenium.

Kamm, T., Andreou, G., and Cohen, J. �1995�. “Vocal tract normalization inspeech recognition: Compensating for systematic speaker variability,” inProceedings of the 15th Annual Speech Research Symposium, Johns Hop-kins University, Baltimore, pp. 175–178.

Lee, L., and Rose, R. C. �1998�. “A frequency warping approach to speakernormalization,” IEEE Trans. Speech Audio Process. 6, 49–59.

McDonough, J., Bryne, W., and Luo, X. �1998�. “Speaker normalizationwith all-pass transforms,” in Proceedings of ICSLP ’98, Sydney, Australia.

Miller, J. D. �1989�. “Auditory-perceptual interpretation of the vowel,” J.Acoust. Soc. Am. 85, 2114–2134.

Nearey, T. M. �1978�. “Phonetic feature systems for vowels,” Technicalreport, Indiana University Linguistics Club.

Nearey, T. M. �1992�. “Applications of generalized linear modeling to voweldata,” in Proceedings of ICSLP ’92 Canada.

Nordström, P. E., and Lindblom, B. �1975�. “A normalization procedure forvowel formant data,” in International Congress on Phonetic Science,Leeds, England.

Nuttall, A. H., and Carter, G. C. �1982�. “Spectral estimation using com-bined time and lag weighting,” Proc. IEEE 70, 1115–1125.

Peterson, G. E., and Barney, H. L. �1952�. “Control methods used in a studyof the vowels,” J. Acoust. Soc. Am. 24, 175–184.

Potamianos, A., and Narayanan, S. �2003�. “Robust recognition of children’sspeech,” IEEE Trans. Speech Audio Process. 11, 603–616.

Sinha, R., and Umesh, S. �2002�. “Non-uniform scaling based speaker nor-malization,” in ICASSP ’02, Orlando, pp. 589–592.

Sinha, R., and Umesh, S. �2008�. “A shift-based approach to speaker nor-malization using non-linear frequency-scaling model,” Speech Commun.50, 191–202.

Stevens, S. S., and Volkman, J. �1940�. “The relation of pitch to frequency:A revised scale,” Am. J. Psychol. 53, 329–353.

Syrdal, A. K., and Gopal, H. S. �1983�. “Perceived critical distances be-tween F1−F0, F2−F1, F3−F2,” J. Acoust. Soc. Am. 74, S88–S89.

Umesh, S., Cohen, L., and Nelson, D. �1999�. “Fitting the mel scale,” inProceedings of IEEE ICASSP ’99, pp. 217–220.

Umesh, S., Cohen, L., and Nelson, D. �2002a�. “Frequency warping and themel scale,” IEEE Signal Process. Lett. 9, 104–107.

Umesh, S., Cohen, L., and Nelson, D. �2002b�. “The speech scale,” ARLO3, 83–88.

Umesh, S., Kumar, S. V. B., Vinay, M. K., Sharma, R., and Sinha, R.�2002c�. “A simple approach to non-uniform vowel normalization,” inProceedings of IEEE ICASSP ’02, Orlando, USA, pp. 517–520.

Umesh, S., Sinha, R., and Kumar, S. V. B. �2004�. “An investigation intofront-end-signal processing for speaker normalization,” in Proceedings ofIEEE ICASSP’04, Montreal, USA.

Wakita, H. �1977�. “Normalization of vowels by vocal-tract length and itsapplication to vowel identification,” IEEE Trans. Acoust., Speech, SignalProcess. ASSP-25, 183–192.

Wegmann, S., McAllaster, D., Orloff, J., and Peskin, B. �1996�. “Speakernormalization on conversational telephone speech,” in IEEE ICASSP ’96,Atlanta, USA, pp. 339–341.

Zhan, P., and Waibel, A. �1997�. “Vocal tract length normalization for largevocabulary continuous speech recognition,” Technical report, School ofComputer Science, CMU, Pittsburgh, USA.

Zhan, P., and Westphal, M. �1997�. “Speaker normalization based on fre-quency warping,” in Proceedings of IEEE ICASSP ’97, Munich, Germany,

pp. 1039–1042.

Kumar and Umesh: Speaker normalization by affine transformation

ntent/terms. Download to IP: 142.157.212.179 On: Wed, 03 Dec 2014 03:51:45