robust speech recognition using the modulation...

16
Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury a,b, * , Nelson Morgan a,b , Steven Greenberg a,b a International Computer Science Institute, Suite 600, 1947 Center Street, Berkeley, CA 94704-1105, USA b University of California at Berkeley, Berkeley, CA 94720, USA Received 1 September 1997; received in revised form 1 January 1998; accepted 1 March 1998 Abstract The performance of present-day automatic speech recognition (ASR) systems is seriously compromised by levels of acoustic interference (such as additive noise and room reverberation) representative of real-world speaking conditions. Studies on the perception of speech by human listeners suggest that recognizer robustness might be improved by fo- cusing on temporal structure in the speech signal that appears as low-frequency (below 16 Hz) amplitude modulations in subband channels following critical-band frequency analysis. A speech representation that emphasizes this temporal structure, the ‘‘modulation spectrogram’’, has been developed. Visual displays of speech produced with the modulation spectrogram are relatively stable in the presence of high levels of background noise and reverberation. Using the modulation spectrogram as a front end for ASR provides a significant improvement in performance on highly re- verberant speech. When the modulation spectrogram is used in combination with log-RASTA-PLP (log RelAtive SpecTrAl Perceptual Linear Predictive analysis) performance over a range of noisy and reverberant conditions is significantly improved, suggesting that the use of multiple representations is another promising method for improving the robustness of ASR systems. Ó 1998 Elsevier Science B.V. All rights reserved. Zusammenfassung Die Performanz heutiger Spracherkennungssysteme wird stark von akustischen Interferenzen (z.B. additivem Rauschen und Hall) beeintr achtigt, die typisch f ur reelle Sprechbedingungen sind. Untersuchungen zur menschlichen Sprachwahrnehmung zeigen, daß die Robustheit von Spracherkennern m oglicherweise durch Konzentration auf die zeitliche Struktur des Sprachsignals verbessert werden k onnte, die als tierequente (unter 16 Hz) Amplitudenmodu- lation in den Frequenzkan alen der kritischen Bandanalyse auftritt. Es wurde eine Sprachsignalrepr asentation, das sogenannte Modulationsspektrogramm (modulation spectrogram), entwickelt, die diese zeitliche Struktur betont. Vi- sualisierungen von Modulationsspektrogrammen zeigen eine relativ große Stabilit at auch bei hochgradig verrauschter Sprache und bei starkem Hall. Die Verwendung des Modulationsspektrogramms als Vorverarbeitungsmethode in ei- nem automatischen Spracherkenner liefert eine signifikante Verbesserung bei der Erkennung verhallter Sprache. Eine Kombination des Modulationsspektrogramms mit log-RASTA-PLP (log RelAtive SpecTrAl Perceptual Linear Pre- dictive analysis) erzielt eine signifikante Verbesserung der Performanz bei einer Reihe von verschiedenen Rausch- und Hallbedingungen. Dies deutet darauf hin, daß eine Kombination verschiedener Signalrepr asentationen eine vie- lversprechende Methode zur Verbesserung der Robustheit automatischer Spracherkennungssysteme ist. Ó 1998 Elsevier Science B.V. All rights reserved. Speech Communication 25 (1998) 117–132 * Corresponding author. Tel.: +1 510 642 4274 ext 143; fax: +1 510 643 7684; e-mail: [email protected]. 0167-6393/98/$ – see front matter Ó 1998 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 3 2 - 6

Upload: others

Post on 30-Apr-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

Robust speech recognition using the modulation spectrogram

Brian E.D. Kingsbury a,b,*, Nelson Morgan a,b, Steven Greenberg a,b

a International Computer Science Institute, Suite 600, 1947 Center Street, Berkeley, CA 94704-1105, USAb University of California at Berkeley, Berkeley, CA 94720, USA

Received 1 September 1997; received in revised form 1 January 1998; accepted 1 March 1998

Abstract

The performance of present-day automatic speech recognition (ASR) systems is seriously compromised by levels of

acoustic interference (such as additive noise and room reverberation) representative of real-world speaking conditions.

Studies on the perception of speech by human listeners suggest that recognizer robustness might be improved by fo-

cusing on temporal structure in the speech signal that appears as low-frequency (below 16 Hz) amplitude modulations

in subband channels following critical-band frequency analysis. A speech representation that emphasizes this temporal

structure, the ``modulation spectrogram'', has been developed. Visual displays of speech produced with the modulation

spectrogram are relatively stable in the presence of high levels of background noise and reverberation. Using the

modulation spectrogram as a front end for ASR provides a signi®cant improvement in performance on highly re-

verberant speech. When the modulation spectrogram is used in combination with log-RASTA-PLP (log RelAtive

SpecTrAl Perceptual Linear Predictive analysis) performance over a range of noisy and reverberant conditions is

signi®cantly improved, suggesting that the use of multiple representations is another promising method for improving

the robustness of ASR systems. Ó 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Die Performanz heutiger Spracherkennungssysteme wird stark von akustischen Interferenzen (z.B. additivem

Rauschen und Hall) beeintr�achtigt, die typisch f�ur reelle Sprechbedingungen sind. Untersuchungen zur menschlichen

Sprachwahrnehmung zeigen, daû die Robustheit von Spracherkennern m�oglicherweise durch Konzentration auf die

zeitliche Struktur des Sprachsignals verbessert werden k�onnte, die als tie�requente (unter 16 Hz) Amplitudenmodu-

lation in den Frequenzkan�alen der kritischen Bandanalyse auftritt. Es wurde eine Sprachsignalrepr�asentation, das

sogenannte Modulationsspektrogramm (modulation spectrogram), entwickelt, die diese zeitliche Struktur betont. Vi-

sualisierungen von Modulationsspektrogrammen zeigen eine relativ groûe Stabilit�at auch bei hochgradig verrauschter

Sprache und bei starkem Hall. Die Verwendung des Modulationsspektrogramms als Vorverarbeitungsmethode in ei-

nem automatischen Spracherkenner liefert eine signi®kante Verbesserung bei der Erkennung verhallter Sprache. Eine

Kombination des Modulationsspektrogramms mit log-RASTA-PLP (log RelAtive SpecTrAl Perceptual Linear Pre-

dictive analysis) erzielt eine signi®kante Verbesserung der Performanz bei einer Reihe von verschiedenen Rausch- und

Hallbedingungen. Dies deutet darauf hin, daû eine Kombination verschiedener Signalrepr�asentationen eine vie-

lversprechende Methode zur Verbesserung der Robustheit automatischer Spracherkennungssysteme ist. Ó 1998

Elsevier Science B.V. All rights reserved.

Speech Communication 25 (1998) 117±132

* Corresponding author. Tel.: +1 510 642 4274 ext 143; fax: +1 510 643 7684; e-mail: [email protected].

0167-6393/98/$ ± see front matter Ó 1998 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 3 2 - 6

Page 2: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

ReÂsumeÂ

La performance des syst�emes actuels de reconnaissance de la parole automatique est consid�erablement compromise

par des niveaux d'interf�erence acoustique (telle que du bruit additif et de la r�everb�eration) qui sont repr�esentatifs de

conditions r�eelles. Des �etudes sur la perception de la parole par des etres humains et une analyse des bandes

fr�equencielles critiques sugg�erent que la robustesse des syst�emes de reconnaissance pourrait etre am�elior�ee en se fo-

calisant sur la structure temporelle du signal qui apparaõt comme des modulations d'amplitude de basse fr�equence

(moins de 16 Hz) dans les sous-bandes. Une repr�esentation de la parole soulignant cette structure temporelle, appel�e``spectrogramme de modulation'' (modulation spectrogram), a �et�e d�evelopp�ee. Des visualisations de la parole utilisant

le spectrogramme de modulation sont relativement stables, malgr�e des niveaux �elev�es de bruit de fond et de

r�everb�eration. L'utilisation du spectrogramme de modulation apporte une am�elioration de performance importante en

pr�esence de beaucoup de r�everb�eration. La combinaison du spectrogramme de modulation avec le codage log-RASTA-

PLP (log RelAtive SpecTrAl Perceptual Linear Predictive analysis) permet d'obtenir des am�eliorations signi®catives

pour de nombreuses conditions de bruit et de r�everb�eration. Ceci sugg�ere que l'utilisation de plusieurs repr�esentations

est une m�ethode prometteuse pour am�eliorer la robustesse d'un syst�eme de reconnaissance de la parole automati-

que. Ó 1998 Elsevier Science B.V. All rights reserved.

Keywords: Robust speech recognition; Reverberation

1. Introduction

Automatic speech recognition (ASR), whichwas once little more than a laboratory curiosity, isnow being deployed in a wide range of tasks, in-cluding telephony and desktop dictation. As ASRtechnology emerges from the laboratory, however,fundamental shortcomings in the most widely usedrecognition algorithms and speech representationsare becoming apparent. While humans are capableof reliably understanding speech across a broadrange of environmental conditions and speakingstyles, automatic recognition systems are muchmore fragile. Recognition methods developed oncorpora of carefully-enunciated speech collectedunder laboratory conditions yield impressive re-sults on ``clean'' speech, but their results are muchless encouraging on material that contains realisticlevels of acoustic interference such as backgroundnoise, spectral shaping, and room reverberation.Although research on channel robustness in ASRhas produced techniques that greatly mitigate thedeleterious e�ects of acoustic distortions on rec-ognizer accuracy, these techniques generally failwhen di�erent, unanticipated distortions are en-countered.

While it is likely that ASR systems will alwayswork best on data similar to that on which theyhave been trained, considerable progress can still

be made towards reducing the sensitivity of ASRsystems to mismatches between training data andspeech data received during actual operation. Webelieve that a key to progress in this area is thedevelopment of speech representations and recog-nition algorithms that better use informationbased on speci®c temporal properties of speech.These representations and recognition methodsshould be more robust because the temporalcharacteristics of speech appear to be less variablethan static characteristics in the presence ofacoustic interference. In this paper, we focus onthe development of a speech representation, themodulation spectrogram, 1 that emphasizes thetemporal structure in speech. We begin with a re-view of current ASR technology, and next describecharacteristics of human speech perception thatpoint to the importance of temporal information.Then, we discuss the implementation of a tempo-rally-oriented speech representation and its use as

1 It should be noted that, while the representation described

here and the speech representation described in Kollmeier and

Koch (1994), share the name ``modulation spectrogram'', they

are quite distinct in the details of their signal processing and

overall motivations.

118 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 3: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

a stable visual display of speech. Finally, we des-cribe ASR experiments comparing its utility tothat of the RASTA-PLP representations for rec-ognizing clean, reverberant, and noisy speech.

2. Speech recognition by machines

Although the details of their implementationsdi�er, most ASR systems employ the same basicapproach to recognize speech. First, an incomingsignal is analyzed to produce feature vectors thatdescribe its short-time spectral envelope. Usually,the spectral analysis is based on a segment ofroughly 20 ms, and the feature vectors are pro-duced at a rate of around one every 10 ms. Thefeature vectors are then classi®ed, with the outputof the classi®er being a set of distances between thefeature vector and a set of phone classes. Thisclassi®cation is often made on the basis of featuresfrom a single frame, possibly augmented withdi�erential features derived from three to ®veframes. Finally, a search based on dynamic pro-gramming or best-®rst search is used to ®nd themost likely sequence of words, given the sequenceof phone distances, a set of hidden Markov modelscharacterizing the words in the recognizer's vo-cabulary, and a grammar that describes thestructure of the word sequences that the recognizeris expected to process.

This framework is powerful and ¯exible, en-abling the development of a broad array of rec-ognizers, from digit recognizers that work overtelephone lines to desktop dictation systems ca-pable of recognizing tens of thousands of di�erentwords. However, recognizers based on thisframework, including our own, still fall well shortof human performance in natural conditions, in-cluding moderate to high levels of backgroundnoise, moderate or greater levels of room rever-beration, and spontaneous speech. While this gapbetween human and machine performance hasmany causes, an especially important one is thatthe automatic systems do not take full advantageof the relatively invariant temporal encoding ofphonetic information in speech because they focusalmost exclusively on short segments of theacoustic signal.

3. Speech recognition by humans

A central result from the study of human speechperception is the importance of slow changes in thespeech spectrum for speech intelligibility. Thesechanges appear as low-frequency amplitude mod-ulations with rates of below 16 Hz in subbandsignals following spectral analysis. The ®rst evi-dence for this perspective emerged from the de-velopment of the channel vocoder in the early1930s (Dudley, 1939), when Homer Dudley andhis colleagues at Bell Labs found that they couldsynthesize high quality speech from spectral shapeestimates that were lowpass ®ltered at 25 Hz. Sincethen, studies on speech intelligibility in noisy andreverberant rooms (Houtgast and Steeneken,1973, 1985) and over communication channelsthat may impose nonlinear distortions (Steenekenand Houtgast, 1980) have demonstrated the linkbetween intelligibility and the ®delity with whichthese slow modulations are transmitted. Directperceptual experiments have shown that modula-tions at rates above 16 Hz are not required, andthat signi®cant intelligibility remains even ifmodulations at rates of 6 Hz and below are theonly ones preserved (Drullman et al., 1994).

A second key to human speech recognition isthe integration of phonetic information over rela-tively long intervals of time. In recognizing speechwith inserted gaps of silence, listeners appear to becapable of associating sounds where onsets areseparated by as much as 200 ms (Huggins, 1975).When speech is temporally decorrelated by ana-lyzing it into critical bands, randomly time-shiftingthe bands with respect to one another, and re-synthesizing the speech from the shifted bands, it isfound that listeners are remarkably tolerant to thedistortion. While the quality of the speech isclearly a�ected, the intelligibility of the speech is90% for shifts of 100 ms and nearly 70% for shiftsof 160 ms (Arai and Greenberg, 1998; Greenbergand Arai, 1998).

This focus on long time segments in humanspeech recognition may explain much of its ro-bustness to acoustic interference. Basing speechrecognition on modulations ± changes in the signal± reduces sensitivity to relatively stationary formsof interference such as spectral coloration and

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 119

Page 4: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

some forms of noise. The use of relatively slowmodulations and the integration of phonetic in-formation over hundreds of milliseconds reducesthe sensitivity of listeners to transient interference.It has also been shown that the slower modula-tions in speech are the least in¯uenced by rever-beration (Houtgast and Steeneken, 1973).

4. Incorporating temporal information into auto-

matic speech recognition

It seems likely, then, that the robustness of ASRsystems could be enhanced by using longer-timeinformation, both at the level of the front-endspeech representation, and at the level of phoneticclassi®cation. Indeed, many improvements in rec-ognizer robustness have already been attained byusing temporal information. Biphones, triphones,and other context-dependent phonetic units pro-vide one means for modeling coarticulation andintegrating information over longer periods oftime. Many robust front-end processing methodssuch as delta features (Furui, 1986), RASTA andother modulation ®ltering methods (Langhansand Strube, 1982; Hirsch, 1988; Hermansky andMorgan, 1994), cepstral-time matrices (Milner andVaseghi, 1995), and articulatory front ends (Ram-say and Deng, 1995) provide an enhanced repre-sentation of the dynamics of the speech signal.

As direct precursors of the modulation spec-trogram, the log-RASTA-PLP and J-RASTA-PLPfront ends are of particular interest. The twoRASTA-PLP algorithms are temporal processingextensions to perceptual linear prediction (PLP)(Hermansky et al., 1985; Hermansky, 1990), analgorithm for producing a spectral shape estimatethat re¯ects properties of the human auditorysystem. In PLP, incoming speech is segmented intooverlapping, ®xed-length frames that are typicallyaround 20 ms in duration with about a 10 msoverlap. A windowed FFT is performed on eachframe, and the square of the magnitude of eachcomponent is computed to produce a powerspectrum. Each power spectrum is convolved witha set of overlapping, trapezoidal ®lters that areequally spaced on the Bark scale, to approximatecritical-band frequency resolution. The critical-

band power spectra are mapped to an approxi-mation of perceptual loudness via a staticequal-loudness weighting and cube-root compres-sion. The perceptually-warped power spectra arethen approximated by an autoregressive all-polemodel, and the resulting linear prediction coe�-cients are converted into cepstral coe�cients. The®rst eight to twelve cepstral coe�cients plus theenergy term comprise the PLP output.

The general RASTA-PLP algorithm (Her-mansky and Morgan, 1994) is an extension to PLPthat attempts to model the sensitivity of humanspeech perception to preceding context (Summer-®eld et al., 1984) and apparent insensitivity toabsolute spectral shape (Lea and Summer®eld,1994). This is accomplished by interposing acompressive nonlinearity, a bandpass ®lter, and anexpansive nonlinearity between the output of thecritical-band ®ltering and the input to the loudnessapproximation in PLP. The bandpass ®lter is adi�erentiator followed by a leaky integrator, andpasses modulations between 1 and 12 Hz. log-RASTA-PLP, which increases the robustness ofASR systems to spectral shaping of speech, uses alogarithm for the compressive nonlinearity andan exponential for the expansive nonlinearity.J-RASTA-PLP (Morgan and Hermansky, 1992),which increases the robustness of ASR systems tojoint spectral shaping and additive noise, uses thefunction y� ln(1 + Jx) for the compressive non-linearity and x� ey/J for the expansive nonlinear-ity. The compressive nonlinearity is approximatelylinear for small values of Jx and approximatelylogarithmic for large values of Jx. During recog-nition, the J parameter is varied in inverse pro-portion to an estimate of the noise power in theincoming speech signal, so that channels with lowpower relative to the estimated noise are processedto suppress the noise, while channels with highpower relative to the estimated noise are processedto suppress spectral shaping.

Our research group is currently pursuing twoapproaches to making better use of temporal in-formation in ASR systems. First, we are investi-gating the development of new front-endrepresentations for speech recognition that at-tempt to capture and represent the temporalstructure of speech using signal processing meth-

120 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 5: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

ods suggested by studies of human speech per-ception. This work is being done in a ¯exibleframework that is a generalization and extensionof the RASTA-PLP algorithms. This frameworkuses an FIR ®lterbank instead of the short-timeFourier transform for frequency analysis, per-forms automatic gain control and modulation®ltering separately, and uses di�erent modulation®lters than RASTA-PLP. We have used bothvisualization and ASR experiments to developthese representations (Greenberg and Kingsbury,1997; Kingsbury and Morgan, 1997; Kingsburyet al., 1997). In parallel work described elsewhere(Wu et al., 1998; Greenberg, 1997), the use ofsyllables as a fundamental unit of speech recogni-tion is being explored. The use of multiple streamsof information, either from di�erent front-endrepresentations or from recognizers using di�erentunits of recognition, is central to both lines ofinquiry.

5. Visualizing speech with the modulation spectro-

gram

We began our current study of speech repre-sentations that emphasize temporal structure

with the development of a visual speech repre-sentation insensitive to room reverberation andnoise. The result of this development, which wecall the modulation spectrogram, displays thedistribution of slow modulations in the speechsignal across time and frequency. Although it isnot a detailed model of auditory processing, anumber of processing steps are incorporated intothe modulation spectrogram that capture keyaspects of the auditory cortical representation ofspeech (cf. Schreiner and Urbas, 1988). Namely,it displays amplitude modulations at rates of 0±8Hz, with a peak sensitivity at 4 Hz, in roughlycritical-band-wide channels, and includes auto-matic gain control and peak enhancementmechanisms.

Fig. 1 illustrates our implementation of themodulation spectrogram. A spectral analysis intoroughly critical-band-wide channels is performedon an incoming speech signal, sampled at 8 kHz,using an eighteen-channel FIR ®lterbank. The ®l-ters have a roughly trapezoidal magnitude re-sponse, with minimal overlap between adjacentchannels. The ®lter bandwidths are set accordingto a slightly modi®ed version of Greenwood'scochlear place-to-frequency mapping function(Greenwood, 1961). In each channel, an ampli-

Fig. 1. Diagram of the signal processing used to produce modulation spectrographic displays of speech. The processing attempts to

capture, in a simple manner, key aspects of the auditory cortical representation of speech. These aspects are critical-band frequency

resolution, automatic gain control (adaptation), sensitivity to low-frequency amplitude modulations and enhancement of spectro-

temporal peaks.

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 121

Page 6: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

tude-envelope signal is computed by half-waverecti®cation and lowpass ®ltering with a cuto�frequency of 28 Hz. Each amplitude envelopesignal is then downsampled by a factor of 100 toreduce subsequent processing time and to matchwith the typical data rate into the phonetic clas-si®cation stage of an ASR system. Each envelopesignal is normalized by computing the averagelevel over an entire utterance and dividing by thismagnitude. The slow modulations in each nor-malized envelope signal are then analyzed by ®l-tering the signal through a complex bandpass ®lterand taking the log of the magnitude of the output.The ®lter response is a Hamming window modu-lated by a 4 Hz complex exponential, so the resultof the ®ltering and magnitude calculation isequivalent to computing the FFT over the signal,windowed with a 250-ms Hamming window, andcomputing the log magnitude of the 4-Hz com-ponent. The ®lter has a passband of 1.25±6.72 Hz,and passes signi®cant modulation energy in the0±8 Hz range. The log magnitudes are then plottedin spectrographic format, with a normalizationand thresholding applied such that the peak levelover all channels is set to 0 dB, and levels morethan 30 dB below this peak are all set to )30 dB.Bilinear smoothing is used to produce the ®nalimage.

Displays of clean and corrupted versions of thesame utterance appear more alike in the modula-tion spectrographic format than in the traditionalspectrographic format, and the syllabic structureof the speech is emphasized in the modulationspectrographic display. Fig. 2 illustrates theseproperties of the modulation spectrogram. Itshows modulation spectrograms and widebandspectrograms of clean and noisy versions of thetelephone-bandwidth utterance ``seventy-three''produced by a male speaker. The noisy version ofthe utterance was produced by mixing the cleanutterance with pink noise at an overall signal-to-noise ratio of 0 dB. The wideband spectrogramswere produced by ®ltering the input signal, sam-pled at 8 kHz, with a pre-emphasis ®lter with atransfer function H(z)� 1 ) 0.94zÿ1, then com-puting 256-point FFTs over an 8 ms Hammingwindow, zero-padded to a length of 32 ms, onceevery 2 ms. To facilitate comparison with the

modulation spectrogram, the peak value in eachwideband spectrogram was normalized to 0 dB,and levels below )30 dB were set to )30 dB. Notethat Fig. 2 is much clearer in color. A color ver-sion of this ®gure is available on the World-WideWeb at http://www.icsi.berkeley.edu/

real/specom_fig2_color.gif.

The wideband spectrogram of the clean speechportrays a signi®cant amount of spectro-temporaldetail. Sharp onsets and pitch pulses are clearlyvisible, as are formant trajectories. In the wide-band spectrogram of the noisy speech, however,only a general indication of the energy distribu-tion of the speech, including formant trajectories,is evident. The modulation spectrogram of theclean speech is much less detailed than thewideband spectrogram. It provides a rough pic-ture of the energy distribution over time andfrequency, but there is no representation of har-monic structure, and pitch pulses and onsets aresmeared out by the modulation ®ltering. Thegross distribution of energy over time and fre-quency, however, is the information that is bestpreserved in the presence of acoustic interference.As a result, the modulation spectrogram ofspeech is more stable than the wideband spec-trogram.

The vertical black bars in the spectrograms inFig. 2 indicate the onsets of the syllables [s eh],[v ih n dcl], [d iy], and [th r iy] that make up theutterance. Most of the energy that appears inthe modulation spectrographic display falls be-tween these onsets, corresponding to syllabicnuclei.

This enhancement of the segmental structure ofspeech and representational stability are producedby several processing steps. The ®ltering that em-phasizes modulations at rates of 0 Hz to 8 Hz, withpeak sensitivity at 4 Hz, acts as a matched ®lter forsignals with temporal structure characteristic ofspeech. Modulations at rates around 4 Hz corres-pond to syllables (Houtgast et al., 1980; Greenberget al., 1996). The critical-band-like frequency res-olution expands the representation of the low-frequency, high energy portion of the speechsignal, and the thresholding emphasizes spectro-temporal peaks in the signal that rise above thenoise ¯oor.

122 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 7: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

Fig. 2. Modulation spectrograms and wideband spectrograms of clean and noisy (0 dB SNR pink noise) versions of the telephone-

bandwidth utterance ``seventy-three''. The black vertical bars mark syllable onsets. A time-aligned phonetic labeling using ARPAbet

symbols is provided at the top of each spectrogram. The symbol ``dcl'' marks a d-closure. This ®gure is much clearer in color. A color

version of it is available on the World-Wide Web at http://www.icsi.berkeley.edu/real/specom_fig2_color.gif.

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 123

Page 8: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

6. Automatic speech recognition with the modulation

spectrogram

The stability of the modulation spectrogram isuseful not only for producing relatively invariantdisplays of speech, but also for more robust ASR.Table 1 shows the results of a set of experimentscomparing the performance of recognizers that usePLP, log-RASTA-PLP, J-RASTA-PLP, or themodulation spectrogram, either singly or in com-bination, on six di�erent test sets: a clean set, areverberant set produced by convolving all theclean test set utterances with a real room impulseresponse having a reverberation time of 0.5 s and adirect-to-reverberant energy ratio of 0 dB, andfour noisy test sets generated by adding pink noisefrom the NOISEX database to the clean utterancesat SNRs of 30, 20, 10 and 0 dB. Because the rec-ognizers are trained only on clean speech, theseresults illustrate the relative invariance of the dif-ferent front ends in the presence of acoustic in-terference. Note that all recognition results arereported in terms of word error rate, which is thesum of the number of word substitutions, deletionsand insertions divided by the total number ofwords in the test set. Substitutions, deletions andinsertions are found by aligning the reference andrecognized word strings using a dynamic pro-gramming routine.

The experiments summarized in Table 1 wereperformed using a connected-word, hybrid hiddenMarkov model/multilayer perceptron (HMM/

MLP) recognizer in which phone probabilities areestimated from acoustic features using an MLPand speech decoding is performed with a Viterbisearch (Bourlard and Morgan, 1994). A variant ofthe modulation spectrographic features, describedin more detail below, was compared against PLP,log-RASTA-PLP, and J-RASTA-PLP featuressupplemented with delta features. The PLP, log-RASTA-PLP and J-RASTA-PLP front ends useda 25 ms window for spectral analysis, and pro-duced a vector of nine cepstral and nine delta-cepstral coe�cients (including the energy term) ata rate of one vector every 10 ms. The delta-cepstralfeatures were calculated from the cepstral featuresby regression over a nine-frame window centeredon the current frame. The modulation-spectro-graphic front end produced an output vector ofthirty features, produced by applying two di�erentmodulation ®lters to 15 channels, at a rate of oneevery 10 ms. The MLP phonetic classi®er for thePLP, log-RASTA-PLP, and J-RASTA-PLP fea-tures had 162 input units, corresponding to thecurrent frame, the four previous frames and thefour next frames, 488 hidden units, and 56 outputunits, while the MLP phonetic classi®er for themodulation spectrogram features had 270 inputunits, also corresponding to the current frame, thefour previous frames and the four next frames, 328hidden units, and 56 output units. A multiple-pronunciation lexicon with simple duration mod-eling and a backo� bigram grammar was used forrecognition. To ensure a good match between the

Table 1

Percent word error rates for clean, moderately reverberant (T60� 0.5 s) and noisy speech using PLP, log-RASTA-PLP, J-RASTA-PLP

and the modulation spectrogram. Statistically signi®cant di�erences range from 0.9 on the clean test to 1.7 on the 0 dB SNR test. The

criterion for statistical signi®cance is p < 0.5 using a one-tailed signi®cance test based on a normal approximation to a binomial

distribution

Experiment Features Clean Reverb 30 dB SNR 20 dB SNR 10 dB SNR 0 dB SNR

Baseline PLP 6.4 37.6 28.3 43.5 60.7 78.8

Log-RASTA 6.4 26.0 11.4 16.3 27.8 51.6

J-RASTA 6.6 27.9 15.6 23.5 35.7 54.4

Mod. spec. 8.5 27.3 14.6 22.9 38.7 61.5

Combined probabilities PLP & Log-RASTA 5.7 26.9 15.9 26.6 43.7 67.3

PLP & Mod. spec. 6.1 29.1 20.5 36.1 53.5 71.3

Log-RASTA & mod. spec. 5.5 20.1 10.4 14.7 23.2 44.7

Double num. of MLP

parameters

Log-RASTA 5.9 26.1 10.8 16.4 29.7 54.7

Mod. spec. 8.2 27.9 14.4 22.1 39.8 65.3

124 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 9: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

MLP phone probability estimator and the lexicon,an embedded Viterbi training procedure was usedin which a trained recognizer was used to label thetraining set via a forced-alignment procedure anda new recognizer was trained on the new labels.

The modulation-spectrographic features usedfor these recognition experiments di�er from thoseused to produce speech displays, such as those inFig. 2. Some changes do not signi®cantly alter therepresentation. For example, the features used forthe recognition experiments employed a ®lterbankwith quarter-octave frequency resolution andmodulation ®lters based on a Kaiser windowrather than a Hamming window. Other changesare more signi®cant. The recognition features usedno thresholding of modulation energy, while thefeatures used for producing displays had athreshold of 30 dB below a global peak level.Also, the recognition features used cube-rootcompression of the output instead of log com-pression and used two modulation ®lters corre-sponding to the real and imaginary components ofthe complex modulation ®lter used to produce thevisual displays. These changes were inspired by aseries of ASR experiments described in Section 7.

The experiments were performed on a subset ofthe Numbers corpus, a collection of speech datadistributed by the Center for Speech and LanguageUnderstanding at the Oregon Graduate Institute.The Numbers corpus consists of spontaneous,continuous utterances collected over the telephoneand digitized at an 8-kHz sampling rate with a 16-bit A/D converter. The subset used for these ex-periments does not contain any words cut o� bythe automatic segmentation used in the data col-lection, nor does it contain any out-of-vocabularywords. All of the material was phonetically tran-scribed by experienced transcribers. The vocabu-lary comprises thirty words and two ®lled pauses,and is restricted to numbers, including confusablesets such as ``seven,'' ``seventeen'' and ``seventy.''A sample utterance from the corpus is ``thirty-onethirty-®ve.'' For the recognition experiments, atraining set containing 3590 utterances (a total of13873 words) and a disjoint test set of 1206 ut-terances (a total of 4673 words) was used. Utter-ances in the subset range in length from one to tenwords, with a mean of 3.9 words. The distribution

of utterance durations is roughly Gaussian, with amean of 1.7 s and a standard deviation of 0.7 s.

Two sets of experiments were performed. In thebaseline experiments, a recognizer was trained onclean speech and its performance on the six dif-ferent test sets measured. In a second set of ex-periments, phone probability estimates from pairsof MLPs trained on PLP, log-RASTA-PLP, andthe modulation spectrogram were combined andthen used for recognition. The MLPs used asphonetic classi®ers estimate posterior probabili-ties, that is, the probability of a phone given theacoustic data. These posterior probabilities areconverted to scaled phone likelihoods by dividingby the phone priors, estimated from a labeling ofthe MLP training set. The combination of phoneprobabilities is accomplished by multiplying to-gether scaled phone likelihoods. This methodworks best when the MLPs tend to make inde-pendent errors and produce relatively ¯at outputdistributions for di�cult-to-classify inputs. Be-cause recognizers using these combined probabil-ities e�ectively have twice as many parameters intheir phonetic classi®ers, the results of these testsare compared with recognizers that use a singlefeature set and have twice as many weights in theirMLP phonetic classi®ers to provide a fair refer-ence.

This combination of evidence occurs on aframe-by-frame basis. Other combination methodshave also been used successfully in combining theRASTA-PLP and modulation spectrogram fea-tures, including combination on a whole-utterancebasis via rescoring of N-best lists (Wu et al., 1998)and combination on a syllable-by-syllable basisusing a two-level dynamic programming search(Wu, 1998). All three methods achieve similarperformance on clean speech. On reverberantspeech the frame-by-frame and whole utterancemethods achieve comparable levels of perfor-mance, and the syllable-by-syllable method per-forms signi®cantly better, with a word error rate of17.6% on the test set used in the experiments inTable 1.

In the baseline experiments the log-RASTA-PLP features give the best performance across allconditions, including reverberation. The J-RAS-TA-PLP and modulation spectrographic features

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 125

Page 10: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

give roughly equal performance on the reverbe-rant, 30-dB SNR, and 20-dB SNR cases, withJ-RASTA-PLP outperforming the modulationspectrographic features on the clean, 10-dB SNR,and 0-dB SNR cases. Except on clean speech,where it matches the performance of log-RASTA-PLP, the simple PLP front end exhibits the worstperformance. In this experiment, the front endsthat incorporate some form of temporal process-ing emphasizing slow modulations, log-RAS-TA-PLP, J-RASTA-PLP, and the modulationspectrogram, are more robust to acoustic inter-ference than PLP, the one front end that performsno such modulation ®ltering. We are not certainwhy log-RASTA-PLP outperforms the modula-tion spectrogram and J-RASTA-PLP matches iton these tasks, when earlier experiments haveshown that J-RASTA-PLP has better performancethan log-RASTA-PLP on noisy speech and thatmodulation spectrogram features are better thanlog-RASTA-PLP and J-RASTA-PLP on highlyreverberant speech. In any case, these resultsshould be interpreted with care because the moresophisticated, multiple-pronunciation lexicon withduration modeling used in the most recent taskswas trained on a recognition that used log-RAS-TA-PLP features with deltas, while the earlier ex-periments used a lexicon that was not specializedfor any of the feature sets.

When phone probability estimates from pairs ofMLPs are combined and used for recognition, thecombination of the modulation spectrogram andlog-RASTA-PLP is the only one to provide asigni®cant improvement over the baseline log-RASTA recognition. This improvement isachieved on the reverberant, 20-dB SNR, 10-dBSNR, and 0-dB SNR tests. Performance on theclean and 30-dB SNR tests is also improved, butthe di�erence is not statistically signi®cant. Theperformance of the recognizers using log-RASTA-PLP or modulation spectrogram features andtwice as many MLP weights are either the same asor slightly worse than their respective baselinerecognizers.

The PLP, log-RASTA-PLP, and modulationspectrogram features have signi®cantly di�erenttemporal properties. PLP does no temporal pro-cessing, while log-RASTA-PLP strongly enhances

the onsets of sounds (cf. Fig. 8 in Hermansky andMorgan, 1994) and the modulation spectrogramenhances syllabic nuclei, as shown in Fig. 2. Thismay explain the success of the log-RASTA-PLPand modulation spectrogram combination. Bothfeature sets have enhanced robustness due to theirfocus on slow modulations in speech, but they tendto emphasize distinct temporal portions of thesignal, and are thus somewhat independent.

7. Optimizing the modulation spectrogram forautomatic speech recognition

The modulation spectrogram was originallydeveloped as a visual representation of speech. Ingeneral, representations of speech intended forvisual display do not give the best performancewhen they are used as representations for ASR.Therefore, before performing the experiments de-scribed in Section 6 the modulation spectrographicrepresentation was optimized for use in ASR. Aseries of recognition experiments were performedin which di�erent variants on the modulationspectrogram were tested on clean and reverberantspeech. The best variant representation was thenused for the recognition experiments describedabove.

To accomplish a faster turnaround, these pre-liminary recognition experiments used a smallersubset of the Numbers corpus, with a training setof 875 utterances (a total of 3315 words) and a testset of 657 utterances (a total of 2426 words). Thissubset used a slightly larger vocabulary of 36words and no ®lled pauses. The vocabulary of thissubset included all the words in the subset de-scribed in Section 6, plus the words, ``a'', ``and'',``dash'', ``double'', ``hyphen'' and ``thousand''.Two versions of the test set were used: a clean setand a reverberant set that was generated by digi-tally convolving all of the utterances in the cleanset with an impulse response designed to matchthe gross acoustic characteristics of a highly re-verberant hallway approximately 6.1 m long, 2.4m high and 1.7 m wide with concrete walls, ¯oorand ceiling. Reverberation times in di�erent fre-quency bands were estimated from a simultaneousrecording of speech produced in the hallway using

126 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 11: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

a head-mounted, close-talking microphone and anomnidirectional microphone located on the ¯oor,roughly 2.5 m from the speaker. Table 2 summa-rizes the estimated reverberation times. A sampleof Gaussian white noise was analyzed into iden-tical subbands, and each noise band was modu-lated with a decaying exponential envelopematched to the estimated reverberation times. Themodulated noise bands were added together toproduce the reverberant tail of the impulse re-sponse, while the early re¯ections were estimatedusing a time-domain image expansion simulation.The ratio of direct to reverberant energy was setmanually to match the recording from the omni-directional microphone. The resulting impulseresponse has a gross reverberation time of about2 s and a direct-to-reverberant energy ratio of)16 dB.

These experiments also used a simpler recogni-tion system than the earlier experiments. The PLP,log-RASTA-PLP, and J-RASTA-PLP representa-tions were not supplemented with delta featuresand the energy term (zeroth-order cepstral coe�-

cient) was not included in the feature vector. Also,all four front ends produced one feature vectorevery 12.5 ms instead of one vector every 10 ms.To compensate for the lack of delta features, alonger acoustic context window of ®fteen frameswas used as input to the MLP phonetic classi®ers.The MLPs were roughly 15% smaller, with thePLP, log-RASTA-PLP, and J-RASTA-PLP clas-si®ers containing 120 input units, corresponding tothe current frame, the previous seven frames andthe next seven frames, 512 hidden units, and 56output units, and the modulation spectrogramclassi®er containing 225 input units, correspondingto the current frame, the previous seven framesand the next seven frames, 320 hidden units, and56 output units. For recognition, a single-pro-nunciation lexicon of word models with no dura-tion modeling and a class bigram grammar wereused. Embedded Viterbi training was used to en-sure a good match between the MLP phoneprobability estimator and the lexicon.

7.1. Initial recognition experiments

In the ®rst set of experiments, summarized inTable 3, three di�erent types of experiment werecarried out. In the baseline experiment, a recog-nizer for PLP, log-RASTA-PLP, J-RASTA-PLP,and the version of the modulation spectrogramused to produce the visual displays, was trained onthe clean training set then tested on the clean andhighly reverberant test sets. In the second type ofexperiment, phone probability estimates from

Table 2

Estimated subband reverberation times for a highly reverberant

hallway

Freq. band (Hz) Reverb. time (s)

0±250 3.1

250±500 2.6

500±1000 2.2

1000±2000 1.6

2000±4000 1.4

Table 3

Percent word error rates for clean and highly reverberant (speci®ed in Table 2) speech using the PLP, log-RASTA-PLP, J-RASTA-

PLP and modulation spectrogram features. Statistically signi®cant di�erences range from 1.6 for the clean test with combined

probabilities to 2.2 for the reverberant test with the baseline recognizers. The criterion for statistical signi®cance is p < 0.5 using a one-

tailed signi®cance test based on a normal approximation to a binomial distribution

Experiment Features Clean error Reverb error

Baseline PLP 15.8 70.1

Log-RASTA 14.5 72.7

J-RASTA 15.1 77.3

Mod. spec. 30.1 65.2

Combined probabilities PLP & Log-RASTA 11.9 68.7

PLP & Mod. spec. 13.6 64.1

Train on reverb PLP 72.5 48.5

Mod. spec. 45.4 43.5

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 127

Page 12: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

MLPs trained in the baseline experiment on dif-ferent feature sets were combined and used forrecognition. In the third type of experiment, rec-ognizers using the modulation spectrogram andPLP features were trained on a reverberant versionof the training set, then tested on the clean andreverberant test sets.

On the baseline clean tests, there is no signi®-cant di�erence in performance between the PLP,log-RASTA-PLP and J-RASTA-PLP recognizers,while the modulation spectrogram recognizerperforms signi®cantly worse. On the reverberanttest, however, there are signi®cant di�erences be-tween all scores, with the modulation spectrogramrecognizer performing best. When phone proba-bilities from the PLP and log-RASTA-PLP MLPsare combined, the best recognition performance onthe clean test set is attained, and performance onthe reverberant test set is signi®cantly improved.Combining probabilities from the PLP and mod-ulation spectrogram MLPs gives the second-bestperformance on the clean test set and the bestperformance on the reverberant test set. The im-provement in performance observed when theprobabilities are combined is not the result of anincreased number of parameters in the phoneticclassi®er. Using only one feature set and doublingthe number of weights in the MLP does not im-prove recognizer performance. The di�culty of thereverberant test set is illustrated by the results ofthe third experiment in which reverberant speechwas used for training. Even when the training andtesting conditions for the recognizer are matched,the word error rates from the test set are still in the44±50% range.

We also performed a human listening experi-ment in order to obtain a measure of human rec-ognition performance under comparableconditions. For this experiment, three subjectswho were native speakers of American English,had no known hearing impairments, and wereexperienced at phonetic transcription of speechwere asked to perform word-level transcription ofthe reverberant test set. The subjects were given alist of the words present in the test set, to providethem with the same knowledge available to theautomatic recognition systems. The order of pre-sentation of the utterances was randomized to

prevent the listeners from learning speaker char-acteristics. The subjects were allowed to listen toeach utterance as many times as they wished, andwere provided with an initial training on ten ut-terances from the reverberant training set to fa-miliarize them with the task. During transcriptionof the test set, the subjects had no feedback ontheir transcription accuracy. The utterances wereproduced by the 16-bit D/A converter in aSPARC-5 workstation at a sampling rate of 8 kHz,and were presented through headphones at acomfortable listening level in a quiet o�ce. Thelisteners' transcriptions were scored by theprogram used for scoring machine recognition re-sults.

The human listeners had considerably less dif-®culty on the reverberant test set than the auto-matic recognizers did. The human listeners had anaverage word error rate of 6.1%, roughly ten timesless error than the best automatic system. Al-though transcription accuracy on the clean test setwas not measured, we note that the word errorrate for humans transcribing utterances from theTI DIGITS corpus was 0.105% (Leonard, 1984).

7.2. Finding the important processing steps

The results of these initial ASR experimentswere encouraging, with the modulation spectro-gram features giving the best performance onhighly reverberant speech and when combinedwith PLP giving good performance on cleanspeech. However, it would be preferable for themodulation spectrogram features to give goodperformance on clean speech without combiningthem with other feature sets. We therefore per-formed a second series of experiments aimed atunderstanding which steps in the modulationspectrogram processing were important for ro-bustness in reverberation and for improving theperformance on clean speech.

First, di�erent variants on the modulationspectrogram in which some processing steps wereomitted were used as ASR front ends, and theperformance of recognizers using these variantrepresentations was measured on clean and re-verberant speech. The steps that were optionallyomitted in the variants were

128 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 13: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

1. the normalization of the envelope signals bytheir long-term averages,

2. the complex ®ltering that measures modulationlevels in the 0±8 Hz range,

3. the normalization of the global peak to 0 dB,and

4. the thresholding of levels more than 30 dB be-low the global peak.

When all of these steps are omitted the resultingfront end produces the log of the squared subbandenvelope signals ± a simple ®lterbank-based powerspectrum. Table 4 summarizes the results of theseexperiments.

The most signi®cant result is that thresholding,which is vital for the production of stable visualdisplays of speech in low signal-to-noise ratio orhighly reverberant conditions, has deleterious ef-fects on the performance of automatic recognitionsystems. Eliminating the thresholding reduces theerror rate on the clean test set by about half, whileonly slightly increasing the error rate on the re-verberant test set. Because the thresholding isbased on a global peak level, it is likely that itimpairs representation of low-energy segments ofspeech. It is also clear that the complex ®lteringoperation is vital for good recognition perfor-mance on the reverberant test set. Comparing therecognition scores of all pairs of recognizers thatdi�er only by the presence or absence of thecomplex modulation ®lter, the recognizers with the®lter perform signi®cantly better on the reverbe-rant test set than the recognizers without the ®lter.

Second, the role of the complex modulation®lter was examined. Although the complex ®lterhas the advantage of producing a strictly positive

output that may be converted to a decibel scale, itrequires twice as much computation as a real ®lterof the same length. Also, the temporal resolutionof the complex ®lter is lower than that of either itsreal or imaginary part, as illustrated in Fig. 3. Wetherefore investigated the use of the real part of the®lter, the imaginary part of the ®lter, or both partswithout the magnitude calculation. Because theindividual parts of the ®lter could produce nega-tive outputs, the log compression was replacedwith a cube-root compression. When the outputsof both the real and imaginary ®lters were used,the front end produced an output vector withthirty spectral features, so an MLP phoneticprobability estimator with 450 input units, 176

Fig. 3. The magnitude of the impulse response of the complex

modulation ®lter is compared with the impulse responses of its

real and imaginary components. Note that the temporal re-

sponse of the complex ®lter is broader than that of either

component.

Table 4

Percent word error rates for clean and highly reverberant speech using variants of the modulation spectrogram features. The presence

of an `X' in a cell indicates that the corresponding processing step was included in the feature calculation. Statistically signi®cant

di�erences range from 1.6 for the best clean tests to 2.2 for the reverberant tests. The criterion for statistical signi®cance is p < 0.5 using

a one-tailed signi®cance test based on a normal approximation to a binomial distribution

Env. norm Cplx ®lt Glob peak Out. thrsh Clean error Reverb error

X X X X 30.1 65.2

X X X 30.6 67.8

X X X 17.5 66.1

X X 13.6 69.9

X 18.3 68.8

16.1 73.5

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 129

Page 14: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

hidden units, and 56 output units, was used. ThisMLP had roughly the same number of weights asthe MLPs used for the other recognizers. Table 5summarizes the results of these experiments.

When the complex modulation ®lter was usedwith cube-root compression, the performance onthe clean test set was unchanged, but the perfor-mance on the reverberant test set was signi®cantlydegraded. This may be a result of the small degreeof compression at high signal levels a�orded by thecube-root function relative to log-based compres-sion. When cube-root compression is used there isno signi®cant di�erence in performance betweenthe complex ®lter and its real part on either theclean or reverberant test set. Using the imaginarypart of the ®lter also has no signi®cant e�ect onperformance for the clean test set, but it doesprovide a signi®cant improvement on the re-verberant test set. This is probably due to theenhancement of changes in the amplitude envelopeproduced by the imaginary part of the ®lter. Fi-nally, using the outputs of the real and imaginaryparts of the ®lter together provides the best per-formance on the clean test set. Performance onthe reverberant test set is equivalent to that of thecomplex ®lter with log compression. Using theoutputs of both ®lters is similar to using bothspectral and delta-spectral features. The real partof the ®lter is a smoothing ®lter, while the imagi-nary part is a di�erentiator.

8. Conclusions

Focusing on the temporal structure in speechis a promising direction for work on robust

speech representations. Even a very simple set ofprocessing steps that capture basic aspects of theauditory cortical representation of speech ± criti-cal-band frequency analysis, automatic gain con-trol, and sensitivity to slow modulations ± issu�cient to produce relatively robust visual dis-plays of speech and to signi®cantly improve theperformance of an ASR system on highly re-verberant speech. Although the modulationspectrogram is not as robust as log-RASTA-PLPin the tests on moderate reverberation or additivenoise, the combination of the two representa-tions, both emphasizing slow modulations insomewhat di�erent ways, yields a signi®cant im-provement in performance over log-RASTA-PLPon its own.

Research towards improving the modulationspectrogram is continuing. The current emphasis ison the development of an entirely on-line versionof the algorithm, the reduction of the spectralresolution of the ®nal representation, and the in-tegration of the front end with a syllable-basedASR system. An on-line gain control mechanismwould enhance the representation of onsets, andthus may further improve robustness of the rep-resentation, as suggested by the results of com-bining log-RASTA-PLP and the o�-linespectrogram features. Preliminary results fromexperiments with an on-line modulation spectro-gram support this hypothesis. On its own, an on-line modulation spectrogram front end achieves aword error rate of 19.1% on the reverberant testset used in the experiments summarized in Table 1,and in framewise combination with log-RASTA-PLP features a word error rate of 17.6% isachieved. In both cases performance on the clean

Table 5

Percent word error rates for clean and highly reverberant speech using the modulation spectrogram features computed with di�erent

compression methods and modulation ®lters. Statistically signi®cant di�erences range from 1.8 for the best clean tests to 2.2 for the

reverberant tests. The criterion for statistical signi®cance is p < 0.5 using a one-tailed signi®cance test based on a normal approxi-

mation to a binomial distribution

Filter Compress. Clean error Reverb error

Complex Log 17.8 63.8

Complex Cube root 17.8 67.2

Real Cube root 16.5 68.3

Imaginary Cube root 17.3 64.3

Real and imag. Cube root 14.7 63.5

130 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132

Page 15: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

set is comparable to the results reported here forthe o�-line algorithm.

Acknowledgements

We are grateful to Jim West and Gary Elko,from Bell Labs, and Carlos Avenda~no, now at theUniversity of California, Davis, for collecting a setof room impulse responses and making themavailable to us, to Hynek Hermansky of the Ore-gon Graduate Institute for many valuable con-versations on the robust representation of speech,to Katrin Kirchho� for translating the abstract ofthis paper into German and to Holger Schwenkfor translating the abstract of this paper intoFrench. Support for this project came from aEuropean Community Basic Research grant(Project Sprach) and from the InternationalComputer Science Institute.

References

Arai, T., Greenberg, S., 1998. Speech intelligibility in the

presence of cross-channel spectral asynchrony. Proceedings

of the 1998 IEEE International Conference on Acoustics,

Speech and Signal Processing.

Bourlard, H., Morgan, N., 1994. Connectionist Speech Recog-

nition: A Hybrid Approach. Kluwer Academic Publishers,

Dordrecht, pp. 155±183.

Drullman, R., Festen, J.M., Plomp, R., 1994. E�ect of temporal

envelope smearing on speech reception. J. Acoust. Soc.

Amer. 95 (2), 1053±1064.

Dudley, H., 1939. Remaking speech. J. Acoust. Soc. Amer. 11

(2), 169±177.

Furui, S., 1986. Speaker-independent isolated word recogni-

tion based on emphasized spectral dynamics. Proceedings

of the 1986 IEEE-IECEJ-ASJ International Conference

on Acoustics, Speech and Signal Processing, pp. 1991±

1994.

Greenberg, S., 1997. On the origins of speech intelligibility in

the real world. Proceedings of the ESCA±NATO Workshop

on Robust Speech Recognition for Unknown Communica-

tion Channels, pp. 23±32.

Greenberg, S., Arai. T., 1998. Speech intelligibility is highly

tolerant of cross-channel spectral asynchrony. Proceedings

of the Joint Meeting of the Acoustical Society of America

and the International Congress on Acoustics.

Greenberg, S., Hollenback, J., Ellis, D., 1996. Insights into

spoken language gleaned from phonetic transcription of the

Switchboard corpus. Proceedings of the Fourth Interna-

tional Conference on Spoken Language Processing, pp.

S24±27.

Greenberg, S., Kingsbury, B.E.D., 1997. The modulation

spectrogram: In pursuit of an invariant representation of

speech. Proceedings of the 1997 IEEE International

Conference on Acoustics, Speech and Signal Processing,

pp. 1647±1650.

Greenwood, D.D., 1961. Critical bandwidth and the frequency

coordinates of the basilar membrane. J. Acoust. Soc. Amer.

33, 1344±1356.

Hermansky, H., 1990. Perceptual linear predictive (PLP)

analysis of speech. J. Acoust. Soc. Amer. 87 (4), 1738±1752.

Hermansky, H., Morgan, N., 1994. RASTA processing of

speech. IEEE Trans. on Speech and Audio Processing 2 (4),

578±589.

Hermansky, H., Hanson, B.A., Wakita, H., 1985. Low-dimen-

sional representation of vowels based on all-pole modeling

in the psychophysical domain. Speech Communication 4,

181±187.

Hirsch, H.G., 1988. Automatic speech recognition in rooms. In:

Lacoume, J.L., Chehikian, A., Martin, N., Malbos, J.

(Eds.), Signal Processing IV: Theories and Applications,

Proceedings of the EUSIPCO-88, Fourth European Signal

Processing Conference, Elsevier, Amsterdam, pp. 1177±

1180.

Houtgast, T., Steeneken, H.J.M., 1973. The modulation trans-

fer function in room acoustics as a predictor of speech

intelligibility. Acustica 28, 66±73.

Houtgast, T., Steeneken, H.J.M., 1985. A review of the MTF

concept in room acoustics and its use for estimating speech

intelligibility. J. Acoust. Soc. Amer. 77 (3), 1069±1077.

Houtgast, T., Steeneken, H.J.M., Plomp, R., 1980. Predicting

speech intelligibility in rooms from the modulation

transfer function I. General room acoustics. Acustica 46

(1), 60±72.

Huggins, A.W.F., 1975. Temporally segmented speech. Percep-

tion and Psychophysics 18 (2), 49±157.

Kingsbury, B.E.D., Morgan, N., 1997. Recognizing reverberant

speech with RASTA-PLP. Proceedings of the 1997 IEEE

International Conference on Acoustics, Speech and Signal

Processing, pp. 1259±1262.

Kingsbury, B.E.D., Morgan, N., Greenberg, S., 1997. Improv-

ing ASR performance for reverberant speech. Proceedings

of the ESCA±NATO Workshop on Robust Speech Recog-

nition for Unknown Communication Channels, pp. 87±90.

Kollmeier, B., Koch, R., 1994. Speech enhancement based on

physiological and psychoacoustical models of modulation

perception and binaural interaction. J. Acoust. Soc. Amer.

95 (3), 1593±1602.

Langhans, T., Strube, H.W., 1982. Speech enhancement by

nonlinear multiband envelope ®ltering. Proceedings of the

1982 IEEE International Conference on Acoustics, Speech

and Signal Processing, pp. 156±159.

Lea, A.P., Summer®eld, Q., 1994. Minimal spectral contrast of

formant peaks for vowel recognition as a function of

spectral slope. Perception and Psychophysics 56 (4), 379±

391.

B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132 131

Page 16: Robust speech recognition using the modulation spectrogramccc.inaoep.mx/~villasen/bib/ASRSpect.pdf · Robust speech recognition using the modulation spectrogram Brian E.D. Kingsbury

Leonard, R.G., 1984. A database for speaker-independent digit

recognition. Proceedings of the 1984 IEEE International

Conference on Acoustics, Speech and Signal Processing, pp.

42.11.1±42.11-4.

Milner, B.P., Vaseghi, S.V., 1995. An analysis of cepstral-time

matrices for noise and channel robust speech recognition.

EUROSPEECH 95, Proceedings of the Fourth European

Conference on Speech Communication and Technology, pp.

519±522.

Morgan, N., Hermansky, H., 1992. RASTA extensions:

Robustness to additive and convolutional noise. ESCA

Workshop on Speech Processing in Adverse Conditions, pp.

115±118.

Ramsay, G., Deng, L., 1995. Maximum-likelihood estimation

for articulatory speech recognition using a stochastic target

model. EUROSPEECH 95, Proceedings of the Fourth

European Conference on Speech Communication and

Technology, pp. 1401±1404.

Schreiner, C.E., Urbas, J.V., 1988. Representation of amplitude

modulation in the auditory cortex of the cat. II. Compar-

ison between cortical ®elds. Hearing Research 32 (1), 49±63.

Steeneken, H.J.M., Houtgast, T., 1980. A physical method for

measuring speech-transmission quality. J. Acoust. Soc.

Amer. 67 (1), 318±326.

Summer®eld, Q., Haggard, M., Foster, J., Gray, S., 1984.

Perceiving vowels from uniform spectra: Phonetic explora-

tion of an auditory aftere�ect. Perception and Psychophys-

ics 35 (3), 203±213.

Wu, S.-L., 1998. Incorporating information from syllable-

length time scales into automatic speech recognition. Ph. D

Thesis, University of California, Berkeley.

Wu, S.-L., Kingsbury, B.E.D., Morgan, N., Greenberg, S.,

1998. Incorporating information from syllable-length time

scales into automatic speech recognition. Proceedings of the

1998 IEEE International Conference on Acoustics, Speech

and Signal Processing.

132 B.E.D. Kingsbury et al. / Speech Communication 25 (1998) 117±132