binary and ratio time-frequency masks for robust speech...

16
Binary and ratio time-frequency masks for robust speech recognition Soundararajan Srinivasan a, * , Nicoleta Roman b , DeLiang Wang b a Biomedical Engineering Department, The Ohio State University, 395 Dreese Laboratories, 2015 Neil Avenue, Columbus, OH 43210, USA b Department of Computer Science and Engineering and Center for Cognitive Science, The Ohio State University, Columbus, OH 43210, USA Received 14 June 2005; received in revised form 11 September 2006; accepted 14 September 2006 Abstract A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequency units that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimate an ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stron- ger than the interference. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is increased. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Ideal binary mask; Ratio mask; Robust speech recognition; Missing-data recognizer; Binaural processing; Speech segregation 1. Introduction The performance of automatic speech recognizers (ASRs) degrades rapidly in the presence of noise, microphone variations and room reverberation (Gong, 1995; Lippmann, 1997). Speech recognizers are typically trained on clean speech and face a prob- lem of mismatch when used in conditions where speech occurs simultaneously with other sound sources. To mitigate the effect of this mismatch on recognition, noisy speech is typically preprocessed by speech enhancement algorithms, such as micro- phone arrays (Bradstein and Ward, 2001; Cardoso, 1998; Ehlers and Schuster, 1997; Hughes et al., 1999), computational auditory scene analysis (CASA) systems (Brown and Wang, 2005; Rosenthal and Okuno, 1998) or spectral subtraction techniques (Boll, 1979; Droppo et al., 2002). Microphone arrays require the number of sensors to increase as the num- ber of interfering sources increases. Monaural CASA systems employ harmonicity as the primary 0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.09.003 * Corresponding author. Tel.: +1 614 292 7402. E-mail addresses: [email protected] (S. Srinivasan), [email protected] (N. Roman), [email protected] (D.L. Wang). Speech Communication 48 (2006) 1486–1501 www.elsevier.com/locate/specom

Upload: others

Post on 24-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

Speech Communication 48 (2006) 1486–1501

www.elsevier.com/locate/specom

Binary and ratio time-frequency masks forrobust speech recognition

Soundararajan Srinivasan a,*, Nicoleta Roman b, DeLiang Wang b

a Biomedical Engineering Department, The Ohio State University, 395 Dreese Laboratories, 2015 Neil Avenue, Columbus, OH 43210, USAb Department of Computer Science and Engineering and Center for Cognitive Science, The Ohio State University,

Columbus, OH 43210, USA

Received 14 June 2005; received in revised form 11 September 2006; accepted 14 September 2006

Abstract

A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. Weestimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract thespeech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare theperformance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequencyunits that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimatean ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stron-ger than the interference. We find that the performance of the missing data recognizer is better on a small vocabularyrecognition task but the performance of the conventional recognizer is substantially better when the vocabulary size isincreased.� 2006 Elsevier B.V. All rights reserved.

Keywords: Ideal binary mask; Ratio mask; Robust speech recognition; Missing-data recognizer; Binaural processing; Speech segregation

1. Introduction

The performance of automatic speech recognizers(ASRs) degrades rapidly in the presence of noise,microphone variations and room reverberation(Gong, 1995; Lippmann, 1997). Speech recognizersare typically trained on clean speech and face a prob-lem of mismatch when used in conditions where

0167-6393/$ - see front matter � 2006 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2006.09.003

* Corresponding author. Tel.: +1 614 292 7402.E-mail addresses: [email protected] (S. Srinivasan),

[email protected] (N. Roman), [email protected](D.L. Wang).

speech occurs simultaneously with other soundsources. To mitigate the effect of this mismatch onrecognition, noisy speech is typically preprocessedby speech enhancement algorithms, such as micro-phone arrays (Bradstein and Ward, 2001; Cardoso,1998; Ehlers and Schuster, 1997; Hughes et al.,1999), computational auditory scene analysis(CASA) systems (Brown and Wang, 2005; Rosenthaland Okuno, 1998) or spectral subtraction techniques(Boll, 1979; Droppo et al., 2002). Microphone arraysrequire the number of sensors to increase as the num-ber of interfering sources increases. MonauralCASA systems employ harmonicity as the primary

.

Page 2: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1487

cue for grouping acoustic components correspond-ing to speech. These systems, however, do notperform in time-frequency (T-F) regions that aredominated by unvoiced speech. Spectral subtractionsystems typically assume stationary noise. Hence, inthe presence of non-stationary noise sources, theirperformance is not adequate for recognition (Cookeet al., 2001). If samples of the corrupting noise sourceare available a priori, a model for the noise sourcecan additionally be trained and noisy speech maybe jointly decoded using the trained models of speechand noise (Gales and Young, 1996; Varga andMoore, 1990) or enhanced using linear filteringmethods (Ephraim, 1992). However, in many realis-tic applications, adequate amounts of noise samplesare not available a priori and hence training of anoise model is not feasible.

Recently, a missing-data approach to speech rec-ognition in noisy environments has been proposedby Cooke et al. (2001). This method is based ondistinguishing between reliable and unreliable data.When speech is contaminated by additive noise,some time-frequency units contain predominantlyspeech energy (reliable) and the rest are dominatedby noise energy. The missing-data method treatsthe latter T-F units as missing or unreliable duringrecognition (see Section 4.2). Missing T-F units areidentified by thresholding the T-F units based onlocal SNR. Spectral subtraction is typically usedto estimate the local SNR. The performance ofthe missing-data recognizer is significantly betterthan the performance of a system using spectralsubtraction for speech enhancement followed byrecognition of enhanced speech (Cooke et al.,2001).

A potential disadvantage of the missing-datarecognizer is that recognition is performed in thespectral or T-F domain. It is well known that recog-nition using cepstral coefficients yields a superiorperformance compared to recognition using spectralcoefficients under clean speech conditions (Davisand Mermelstein, 1980). The superiority of the ceps-tral features stems from the ability of the cepstraltransformation to separate vocal-tract filtering fromexcitation source in speech production (Rabiner andJuang, 1993). Additionally, the cepstral transformapproximately orthogonalizes the spectral features(Shire, 2000). Since the missing-data recognition isbased on marginalizing the unreliable T-F featuresduring recognition, it is coupled with a spectral orT-F representation. Any global transformation ofthe spectral features (e.g. cepstral transformation)

smears the information from the noisy T-F unitsacross all the global features, preventing its effectivemarginalization. Attempts to adapt the missing-data method to the cepstral domain have centeredaround reconstruction or imputation of the missingvalues in the spectral domain followed by transfor-mation to the cepstral domain (Cooke et al., 2001;Raj et al., 2004). Alternatively, van Hamme (2003)performs imputation directly in the cepstral domain.These reconstructions are typically based either onthe speech recognizer itself or on other trained mod-els of speech. The success of these model-basedimputation techniques depend on the adequacy ofreliable data for identification of the correct speechmodel for imputation. In addition, errors in imputa-tion procedures affect the performance of the systemeven when the model is correctly identified.

Another potential drawback of the missing-datarecognizer, which has not been well studied, is theproblem of data paucity. The amount of ‘‘reliable’’data available to the recognizer is a function of bothSNR and frequency characteristics of the noisesource. A decrease in SNR, as well as an increasein the bandwidth of the noise source causes anincrease in the amount of missing data. This leadsto a deterioration in performance for a small vocab-ulary task (Cooke et al., 2001). The reduction inreliable data may pose an additional problem forrecognition with larger vocabulary sizes. Paucityof reliable data constrains the missing-data recog-nizer to use only a small portion of the total T-Facoustic model space. This reduced space may beinsufficient to differentiate between a large numberof competing hypotheses during decoding. In thispaper, we study this issue by comparing the perfor-mance of the missing-data recognizer on two taskswith different vocabulary sizes.

Binaural CASA systems that compute binarymasks have been used successfully as front-endsfor the missing-data recognizer on small vocabularytasks (Palomaki et al., 2004; Roman et al., 2003).Such systems compare the acoustic signals at thetwo ears in order to extract the binaural cues ofinteraural time differences (ITD) and interauralintensity differences (IID). These binaural cues arecorrelated with the location of a sound source andhence provide powerful mechanisms for segregatingsound sources from different locations. Moreover,binaural processing is independent of the signal con-tent and hence can be used to segregate both voicedand unvoiced speech components from a noisymixture. The computational goal of the binaural

Page 3: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1488 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

CASA systems is an ideal binary mask. A T-F unitin the ideal binary mask is labeled 1 or reliable if thecorresponding T-F unit of the noisy speech containsmore speech energy than interference energy; it islabeled 0 or unreliable otherwise.1 We employ arecent binaural speech segregation system (Romanet al., 2003) to estimate an ideal binary T-F mask.This mask is fed to the missing-data recognizerand recognition is performed in the spectraldomain.

The minimum mean-square error (MMSE) basedshort-time spectral amplitude estimator, which uti-lizes a priori SNR in a local T-F unit, has been usedpreviously to effectively enhance noisy speech(Ephraim and Malah, 1984). A priori SNR can beobtained if premixing speech and noise signals areavailable. Roman et al. (2003) have shown that ina narrow frequency band, there exists a systematicrelationship between a priori SNR and values ofITD and IID. Motivated by this observation, weestimate an ideal ratio T-F mask using statisticscollected for ITD and IID at each individual fre-quency bin. A unit in the ratio mask is a measureof the speech energy to total energy (speech andnoise) in the corresponding T-F unit of the noisysignal. The ratio mask is then used to enhance thespeech, enabling recognition using Mel-frequencycepstral coefficients (MFCCs). We use ‘‘conven-tional recognizer’’ to refer to a continuous densityhidden Markov model (HMM) based ASR usingMFCCs as features.

We compare the performance of the conven-tional recognizer to that of the missing-data recog-nizer on a robust speech recognition task. Inparticular, we examine the effect of vocabulary sizeon the performance of the two recognizers. We findthat on a small vocabulary task, the missing-datarecognizer outperforms the conventional ASR.Our finding is consistent with a previous compari-son using a binaural front-end made on a smallvocabulary ‘‘cocktail-party’’ recognition task(Glotin et al., 1999; Tessier et al., 1999). The accu-racy of results obtained using the missing-datamethod in the spectral domain was reported to bebetter than those obtained using the conventionalASR in the cepstral domain. With an increase inthe vocabulary size, however, the conventional

1 Similar masks have also been referred to as ‘a priori masks’and ‘oracle masks’ in the literature.

ASR performs substantially better. Results usingthe missing value imputation methods have beenreported on a larger vocabulary previously (Rajet al., 2004). Their method uses a binary maskand therefore is subject to the same limitationsstated previously.

The rest of the paper is organized as follows.Section 2 provides an overview of the proposedsystems. We then describe the binaural front-endfor both the conventional and missing-data recog-nizers in Section 3. The section additionallyprovides the estimation details of ideal binary andratio T-F masks. The conventional and missing-data recognition methods are reviewed in Section4. The recognizers are tested on two different taskdomains with different vocabulary sizes. Section 5discusses the two tasks and presents the evaluationresults of the recognizers along with a comparisonof their relative performance. Finally, conclusionand future work are given in Section 6.

2. System overview

In this study, we analyze two strategies for robustspeech recognition: (1) missing-data recognition and(2) a system that combines speech enhancementwith a conventional ASR. The performance isexamined at various SNR conditions and for twovocabulary sizes. Fig. 1 shows the architecture ofthe two different processing strategies.

The input to both systems is a binaural mixtureof speech and interference presented at different,but fixed, locations. The measurements of head-related transfer functions (HRTFs) are a standardmethod for binaural synthesis. HRTF encompassesthe filtering effects of head, torso and pinna, andencodes information corresponding to the sourcelocation. In this paper, the HRTFs are providedby the measured left and right responses of theKEMAR manikin from a distance of 1.4 m in thehorizontal plane, resulting in two 128 point impulseresponses at a sampling rate of 44.1 kHz (Gardnerand Martin, 1994). We generate the left and rightear signals by filtering the monaural signals withthe left and right HRTFs. Note that different loca-tions provide different HRTFs, resulting in differentbinaural signals. The responses to multiple sourcesare added at each ear. Due to the differential filter-ing effects between the two ears, the HRTFs providelocation-dependent ITD and IID which can beextracted independently in each T-F unit (seeSection 3). The T-F resolution is 20 ms time frames

Page 4: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

Fig. 1. Architecture of the two robust speech recognition strategies with binaural preprocessing: The missing-data recognizer and theconventional ASR. Left and right ear signals are obtained by filtering with HRTFs. A short-time Fourier analysis is applied to the signals,resulting in a time-frequency decomposition. ITD and IID are computed in each T-F unit. The missing-data recognizer works with abinary mask. A ratio mask is used as a speech enhancement strategy and is fed to the conventional recognizer.

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1489

with a 10 ms frame shift, and 512 DFT coefficients.Frames are extracted by applying a runningHamming window to the signal. The frequency binsuniformly cover the complete frequency range, upto the Nyquist frequency.

The missing-data speech recognizer operates inthe log-spectral domain and requires a binarymask that informs the recognizer of which T-Funits of the noisy input are dominated by speechenergy. A 64-channel auditory filterbank was usedpreviously as a front-end for the missing-data recog-nizer (Cooke et al., 2001). We have chosen a DFTrepresentation for the missing-data recognizer inorder to be consistent with the conventional recog-nizer. A comparison between the DFT representa-tion and the auditory filterbank representation hasshown that the difference in recognition perfor-mance is statistically insignificant. Statistics basedon mixtures of multiple speech sources show thatthere exists a systematic correlation between the a

priori energy ratio and the estimated ITD/IIDvalues, resulting in a characteristic clustering acrossfrequencies (Roman et al., 2003). To estimate theideal binary mask we extend the non-parametricclassification method of Roman et al. (2003) in thejoint ITD/IID feature space independently for eachfrequency bin. The frequency decomposition usedby Roman et al. (2003) was generated by a gamm-atone filterbank. This classification results in binaryBayesian decision rules that determine whetherspeech is stronger than interference in individualT-F units (energy ratio greater than 0.5). The systemof Roman et al. (2003) was chosen because of theexcellent match between their estimated binarymask and the ideal binary mask.

The conventional approach to robust speechrecognition involves preprocessing of the corruptedspeech by speech enhancement algorithms. Thisallows for the subsequent usage of decorrelatingtransformations (cepstral transformation, linear dis-criminant analysis), temporal processing methods(delta features, RASTA filtering) and normalizationtechniques (mean subtraction, variance normaliza-tion) on enhanced spectral features (Chen et al.,2005; Shire, 2000). In this study we use cepstraland delta features along with cepstral mean subtrac-tion, which are known to provide improved recogni-tion accuracy. To enhance noisy speech we estimatea ratio T-F mask. The statistics described aboveshow that the estimated ITD and IID have a func-tional relationship with the a priori energy ratio.We employ this relationship in a non-parametricfashion to estimate the ideal ratio mask. Finally,to decode using the conventional ASR, MFCCsare computed from speech enhanced by maskingthe corrupted signal with the estimated ratio mask.

We evaluate the performance of the missing-datarecognizer using both ideal and estimated binarymasks. This is then compared to that of the conven-tional ASR using both ideal and estimated ratiomasks.

3. A localization based front-end for ASR

When speech and additive noise are orthogonal,the linear MMSE filter is the Wiener filter (vanTrees, 1968). With a frame-based processing, theMMSE filter corresponds to the ratio of speecheigenvalues to the sum of eigenvalues of speechand noise (van Trees, 1968). The eigenvalues can

Page 5: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1490 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

be computed from the auto-covariance functionsprior to mixing by considering speech and noise tobe two distinct random processes. Under asymp-totic conditions, the MMSE filter corresponds tothe frame-based Wiener filter (McAulay andMalpass, 1980; van Trees, 1968). Ephraim andMalah (1984) have additionally shown that the opti-mal MMSE estimate of speech spectral amplitude ina local T-F unit is strongly related to the a priori

SNR. To estimate the speech in a local T-F unit,we approximate the frame-based filter with an idealratio mask defined using the a priori energy ratioR(x, t):

Rðx; tÞ ¼ jSðx; tÞj2

jSðx; tÞj2 þ jNðx; tÞj2

" #; ð1Þ

where S(x, t) and N(x, t) are the target and noisespectral values at frequency x and time t computedfrom the signal at the ‘‘better ear’’ – the ear withhigher SNR. Our computational goal for front-end processing with the conventional ASR is toestimate R(x, t) directly from the noisy mixture.

In addition, we define the ideal binary T-F maskas:

Bðx; tÞ ¼1 if Rðx; tÞP h;

0 otherwise;

�ð2Þ

where h is set to be 0.5. Such masks have beenshown to generate high-quality reconstruction fora variety of signals and also provide an effectivefront-end for missing-data recognition on a smallvocabulary task (Cooke et al., 2001; Roman et al.,2003).

The objective of our front-end processing is todevelop effective mechanisms for estimating bothideal binary and ratio masks. We propose an esti-mation method based on observed patterns for thebinaural cues caused by the auditory interaction ofmultiple sources presented at different locations.Roman et al. (2003) have shown that ITD andIID undergo systematic shifts as the energy ratiobetween the target and the interference changes. Ina particular frequency bin, the ITD and IID corre-sponding to the target source exhibit locationdependent characteristic values. As the SNR in thisfrequency bin decreases due to the presence of inter-ference, the ITD and IID systematically shift awayfrom the target values. Theoretical derivations forthe case of two sinusoidal sources can be found inRoman et al. (2003). Moreover, statistics collectedfrom real signals have shown similar patterns. In

particular, the empirical mean approximates wellthe theoretical mean obtained in the case of twosinusoids (Roman et al., 2003). Note that trainingfor each frequency bin is required since frequency-dependent combinations of ITD and IID arise nat-urally for a fixed spatial configuration. We employthe same training corpus as used by Roman et al.(2003) consisting of 10 speech signals from theTIMIT database (Garofolo et al., 1993). Five sen-tences correspond to the target location set andthe rest belong to the interference location set.Binaural signals are obtained by convolving withKEMAR HRTFs as described in Section 2. Thisdataset is different from the databases used in train-ing the ASRs.

The ITD/IID estimates are computed indepen-dently in each T-F unit based on the spectral ratioat the left and right ears:

ðIT D; I IDÞðx; tÞ ¼ � 1

xA

X Lðx; tÞX Rðx; tÞ

� �;jX Lðx; tÞjjX Rðx; tÞj

� �;

ð3Þ

where XL(x, t) and XR(x, t) are the left and right earspectral values of the noisy speech at frequency xand time t and A(rej/) = /, �p < / 6 p. The func-tion A(Æ) computes the phase angle, in radians, ofa complex number with magnitude r and phaseangle /. Note that the phase is ambiguous corre-sponding to integer multiples of 2p. To disambigu-ate, we identify ITD in the range of 2p/x centeredat zero delay. By dividing the relative phase angleby the radian frequency, IT D estimates the timedifference between the left and the right ear signals.I ID calculates the relative magnitude between theleft and the right ear spectral values and hence esti-mates the intensity difference.

3.1. Estimation of the binary mask

Fig. 2 shows empirical results from the trainingcorpus for a two-source configuration: target sourcein the median plane and interference at 30�. In thetraining corpus, we have access to target and interfer-ence signals separately. Hence, we can compute R foreach mixture signal. The scatter plot in Fig. 2(a)shows samples of IT D and R, as well as the mean,the standard deviation, and the histogram (thebottom panel), for a frequency bin at 1 kHz. Simi-larly, Fig. 2(b) shows the results that describe the var-iation of I ID and R for a frequency bin at 3.4 kHz.The results are similar to those obtained by Roman

Page 6: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

Fig. 2. Relationship between ITD/IID and the energy ratio R. Statistics are obtained with target in the median plane and interference onthe right side at 30�. (a) The top panel shows the scatter plot for the distribution of R with respect to ITD for a frequency bin at 1 kHz. Thesolid white curve shows the mean curve fitted to the data. The vertical bars represent the standard deviation. The bottom panel shows thehistogram of ITD samples. (b) Corresponding results for IID for a frequency bin at 3.4 kHz. (c) Histogram of ITD and IID samples for afrequency bin at 2 kHz.

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1491

et al. (2003), who use an auditory filterbank for fre-quency decomposition. Note that for the T-F unitsdominated by target (R � 1), the binaural cues areclustered around the target values. Similarly, forthe T-F units dominated by interference (R � 0),the binaural cues are clustered around the interfer-ence values. Furthermore, the scatter plots exhibit asystematic shift of the estimated ITD and IID as R,as defined in (1), varies from 1 to 0. Moreover, a loca-tion-based clustering is observed in the joint ITD–IID space as shown in Fig. 2(c). Each peak in thehistogram corresponds to a distinct active source.Therefore, to estimate the binary mask Bðx; tÞ, weemploy non-parametric classification in the joint

ITD–IID feature space as used by Roman et al.(2003). There are two hypotheses for the binarydecision: H1 – target is stronger or R P 0.5 andH2 – interference is stronger or R < 0.5. The esti-mated binary mask, Bðx; tÞ, is obtained using themaximum a posteriori (MAP) decision rule:

Bðx; tÞ ¼1 if pðH 1ÞpðxjH 1Þ > pðH 2ÞpðxjH 2Þ;0 otherwise;

�ð4Þ

where x is the ðIT D; I IDÞðx; tÞ feature vector. Theprior probabilities, p(Hi), are computed as the ratioof the number of samples in each class to the total

Page 7: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1492 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

number of samples. The conditional probabilities,p(xjHi), are estimated from the training data usingthe kernel density estimation method (Romanet al., 2003).

3.2. Estimation of the ratio mask

In order to estimate the ideal ratio mask, we usethe same training data. It is well known that ITD issalient at low frequencies while IID becomes more

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

Fig. 3. Comparison between estimated and ideal binary and ratio T-Fmedian plane and an interference signal presented at 30�. The SNR is 0 dof the mixture. (c) The ideal binary mask. (d) The estimated binary msignals correspond to the left ear.

prominent at higher frequencies (Blauert, 1997). Inthe high-frequency range, the wavelength of theacoustic signal is much shorter compared to the dis-tance between the two ears. This results in a com-pressed range for the unambiguous values of ITD,which reduces the discriminative power of this cue.On the other hand, while the range of IID is verysmall at low frequencies, it can be as high as30 dB at high frequencies. Hence, IID is a morereliable cue at high frequencies.

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0 0.5 1 1.5 2 2.5 3 3.5 4

500

1000

1500

2000

2500

3000

3500

4000

masks for a mixture of a clean speech utterance presented in theB. (a) Spectrogram of the clean speech utterance. (b) Spectrogramask. (e) The ideal ratio mask. (f) The estimated ratio mask. The

Page 8: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1493

ITD exhibits different patterns across frequencybins as seen in the number of modes that characterizesthe distribution of the samples (Roman et al., 2003).Hence, there is no unique parametric curve for allfrequencies. Moreover, in the absence of evidencefor a parametric estimate to provide better recogni-tion results, a mean curve is fitted to the distributionof ITD. This is our estimated ratio mask below 3 kHz.For higher frequencies, we utilize the informationprovided by the IID cues and use the same methodto estimate the energy ratio. For improved results,we remove the outliers outside of 0.2 distance fromthe median. The resulting mean curves are shown inFigs. 2(a) (ITD) and (b) (IID). Thus, for givenIT Dðx; tÞ and I IDðx; tÞ, the estimated energy ratioRðx; tÞ is the corresponding value on the mean curve.

Fig. 3 shows the comparison between ideal andestimated masks. Figs. 3(a) and (b) show the spec-trograms of a clean speech utterance and the noisymixture, respectively. The noisy mixture is gener-ated by combining the clean speech utterance pre-sented in the median plane with a factory noisesignal presented on the right side at 30�. The SNRis 0 dB. The mask estimation algorithms describedabove are applied to the mixture and the resultsare presented in Fig. 3(c)–(f). Figs. 3(c) and (d) showthe ideal binary mask and the estimated binarymask, respectively. Notice that the estimated binarymask approximates well the ideal binary mask (seealso Roman et al., 2003). Figs. 3(e) and (f) showthe ideal ratio mask and the estimated ratio mask,respectively. Observe that the estimated ratio maskis very similar to the ideal ratio mask, especially inthe high speech energy T-F units.

4. Recognition strategies

We evaluate the binaural segregation systemdescribed in Section 3 as the front-end for robustASR using two different recognizers. ConventionalASR uses MFCCs as the parameterization ofobserved speech. MFCCs are computed from thesegregated speech obtained after applying the ratiomask to the noisy input signal (see Eq. (5)). Themissing-data recognizer uses log-spectral energy asfeature vectors in conjunction with the binary mask,generated by the binaural system. An HMM toolkit,HTK (Young et al., 2000), is used in the training ofboth recognizers and the testing with the conven-tional ASR. During testing with the missing-datarecognizer, the decoder is modified to incorporatethe missing-data methods.

4.1. The conventional speech recognizer

We use the standard continuous density HMMbased speech recognizer trained on clean speech tomodel each word in the vocabulary (Section 5).Observation densities are modeled as a mixture ofGaussians with diagonal covariance. The input tothis ASR is the estimated speech spectral energyjSðx; tÞj2, obtained by an element-wise multiplica-tion of the estimated ratio mask and the spectralenergy of the noisy speech.

jSðx; tÞj2 ¼ jX ðx; tÞj2 � Rðx; tÞ; ð5Þ

where jX(x, t)j2 is the spectral energy of the noisysignal at the better ear (see Section 5). One of ourevaluation corpora (see Section 5) is originallyband-limited to 4 kHz. Hence, we apply a rectangu-lar window to the estimated spectra and truncatethem to 4 kHz. These truncated speech spectra areprocessed by a Mel-frequency filterbank (Rabinerand Juang, 1993), comprising 26 triangle-shaped fil-ters. The low frequency cut-off of the first filter is setto 105 Hz and the high frequency cut-off of the lastfilter was set to 4300 Hz. A log compression is thenapplied to the resulting spectra. Finally, the spectralcoefficients are converted to cepstral coefficients viathe discrete cosine transform (Oppenheim et al.,1999). The performance of the conventional ASRwith full-band spectra is shown in an earlier study(Srinivasan et al., 2004). We observe that the appli-cation of the rectangular window eliminates theeffect of the spurious and inaccurately estimatedhigh-frequency components. This helps significantlyimprove accuracy compared to recognition usingfull-band spectra.

MFCCs are chosen as feature vectors as they aremost commonly used in state-of-the-art recognizers(Rabiner and Juang, 1993). Thirteen cepstral coeffi-cients along with delta and acceleration coefficientsare extracted each frame, including the 0th ordercepstral coefficient C0 as the energy term. Framesare extracted as described in Section 2. A first-orderpreemphasis coefficient of 0.97 is applied to thesignal. Cepstral mean normalization is additionallyimplemented for improved ASR performance(Young et al., 2000).

4.2. The missing-data speech recognizer

The missing-data recognizer (Cooke et al., 2001)is an HMM-based ASR that makes use of

Page 9: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1494 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

spectro-temporal redundancy in speech to recog-nize noisy speech based on its speech dominantT-F units. Given an observed speech vector Y,the problem of word recognition is to maximizethe posterior P(WijY), where Wi is a valid wordsequence according to the grammar for the recog-nition task. When parts of Y are corrupted byadditive noise, it can be partitioned into its reliableand unreliable constituents as Yr and Yu. One canthen seek the Bayesian decision given the reliableconstituents. In the marginalization method, theposterior probability using only the reliable con-stituents is computed by integrating over the unre-liable ones (Cooke et al., 2001). In missing-datamethods, recognition is typically performed usingspectral energy as feature vectors. If Y representsthe observed spectrum and sound sources are addi-tive and uncorrelated, then the unreliable partsmay be constrained as 0 6 eY 2

u 6 Y 2u. This con-

straint, therefore, states that the true value of thespectral energy in the unreliable parts, eY 2

u, liesbetween 0 and the observed spectral energy. Thesebounds are used as limits on the integral involvedin marginalizing the posterior probability overthe unreliable features. This bounded marginaliza-tion method is shown by Cooke et al. (2001) tohave a better recognition score than the simplemarginalization method, and is hence used in allour experiments employing the missing-data recog-nizer. We use mixture of Gaussians with diagonalcovariance to model the observed speech featuresas suggested by Cooke et al. (2001). Feature vec-tors for the missing-data recognizer are derivedfrom the DFT coefficients in each frame extractedas described in Section 2. Log compression isapplied to the resulting energy spectrum of the sig-nal. To be consistent with experiments on the con-ventional ASR, we apply a rectangular window totruncate the log-spectral energy to 4 kHz. Hence 98spectral coefficients along with delta coefficients ina two-frame delta window are extracted in eachframe. Note that the delta window covers twopreceding frames and two succeeding frames(Young et al., 2000). Additionally, during decodingof the noisy input, the missing-data recognizer usesthe estimated binary mask to provide the reliabilityinformation for the static features. For the deltafeatures, the binary mask is defined as follows. Aunit in the mask is labeled 1 if all the static spectralcoefficients used in the computation of the corre-sponding delta feature are reliable, and 0 otherwise(Barker et al., 2000).

5. Evaluation results

To compare the effect of vocabulary size on thetwo recognition approaches outlined above, wechoose two task domains. The first task isspeaker-independent recognition of connected dig-its. The grammar for this task allows for the repeti-tion of one or more digits. This is the same task usedin the original study of Cooke et al. (2001). Thirteen(1–9, a silence, very short pause between words, zeroand oh) word-level models are trained for bothrecognizers. All except the short pause model have8 emitting states. The short pause model has a singleemitting state, tied to state 4 of the silence model.The output distribution in each state is modeled asa mixture of 10 Gaussians, as suggested by Cookeet al. (2001). The grammar for this task allows forone or more repetitions of digits. All digits areequally probable. The average number of digits inan utterance is 3.2, with a minimum of 1 and amaximum of 7. The TIDigits database’s malespeaker data are used for both training and testing(Leonard, 1984). Specifically, the models are trainedusing 4235 utterances in the training set of this data-base. Testing is performed on a subset of the testingset consisting of 461 utterances from 6 speakers,comprising 1498 words. All test speakers are differ-ent from the speakers in the training set. The signalsin this database are sampled at 20 kHz.

The second task is the speaker-independent rec-ognition of command and control type phrases.There are 40 phrase templates that range from a 1word template such as ‘‘STOP’’ to a 8 wordtemplate such as ‘‘CALL MY DAUGHTER ATELEVEN PM ON <WEEKDAY>’’. Note thatsome templates generate many utterances by usingvariables such as <WEEKDAY>. The grammarfor this task assigns equal probability to all 40phrase templates in the database. The average num-ber of words in an utterance is 4.2. Two hundredand eight (206 words, a silence and a short pausebetween words) word-level models are trained forboth recognizers. This task allows us to increasethe vocabulary size from 13 to 208, a natural pro-gression in testing the effect of vocabulary size onthe recognizers. All except the short pause modelhave 8 emitting states, whose output distributionis modeled as a mixture of 10 Gaussians. The shortpause model has a single state. The digital data sub-set of the Apple Words and Phrases database is usedfor both training and testing (Cole et al., 1995). Inparticular, 1996 speakers with IDs 21–2604 are used

Page 10: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

–5 0 5 10 inf0

20

40

60

80

100

SNR (dB)

Acc

urac

y (%

)

Ideal RMEstimated RMIdeal BMEstimated BMETSI AFEUnprocessed

Fig. 4. Performance of conventional and missing-data recogniz-ers on the digits recognition task. Ideal RM refers to theperformance of the conventional ASR using the ideal ratio mask.Estimated RM refers to its performance when using the estimatedratio mask by the binaural front-end. Ideal BM refers to theperformance of the missing-data ASR using the ideal binarymask. Estimated BM refers to the performance of the same whenusing the estimated binary mask by the binaural front-end. Forcomparison, the performance of the conventional ASR withoutthe use of any front-end processing and with processing by theETSI advanced feature extraction algorithm are also shown.

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1495

for training. This corresponds to 63,835 utterances.Data from 14 speakers with IDs 4–19 are used fortesting. This corresponds to 454 utterances, com-prising 1823 words. The signals are sampled at8 kHz.

The two tasks also differ in perplexity. Perplexityis one indicator of difficulty of the recognition taskalong with vocabulary size (Rabiner and Juang,1993). Perplexity is a measure of the average num-ber of different words that can occur followingany given word. For the digit recognition task, theperplexity is 11.0 as any digit can follow any otherdigit. For the command and control task theperplexity is 3.05. For our task, we calculate theperplexity empirically from the word level lattice(Young et al., 2000). The lower perplexity for thesecond task is due to the use of a restrictive gram-mar for this task (Cole et al., 1995). To test therobustness of the two recognizers in the aforemen-tioned tasks, noise is added at a range of SNRs from�5 to 10 dB in steps of 5 dB. Higher positive valuesof SNRs are not explored, as one of the recognizerssaturates to ceiling performance at 10 dB. The noisesource for both recognition tasks is the factory noisefrom the NOISEX corpus (Varga et al., 1992),which is also used by Cooke et al. (2001). The fac-tory noise is chosen as it has energy in the formantregions, therefore posing challenging problems forrecognition. It also contains contributions fromimpulsive sources, making it difficult to estimateits spectrum using spectral subtraction methods(Cooke et al., 2001). In all our experiments, thetarget speech source is in the median plane andthe noise source on the right side at 30�, makingthe left ear the better ear in terms of SNR (see Sec-tion 3). The binaural mixture is created by convolv-ing the monaural target and noise signals with theHRTFs corresponding to the respective source loca-tions as described in Section 2. Note that for bothtasks, training and testing are performed usingspeech band-limited to 4 kHz.

Fig. 4 summarizes the performance of the tworecognizers on the digit recognition task. Perfor-mance is measured in terms of word-level recogni-tion accuracy under various SNR conditions.‘‘Unprocessed’’ refers to the baseline performanceof the conventional ASR, without the use of anyfront-end, speech enhancement processing. The fig-ure shows the recognition accuracy of the conven-tional ASR with the use of ideal and estimatedratio T-F masks (‘‘Ideal RM’’ and ‘‘EstimatedRM’’, respectively). This is compared to the accu-

racy of the missing-data recognizer, which uses idealand estimated binary T-F masks (‘‘Ideal BM’’ and‘‘Estimated BM’’, respectively). We also comparethe performance to that obtained by using anadvanced front-end feature extraction algorithm(‘‘ETSI AFE’’), which is standardized by the Euro-pean Telecommunication Standards Institute(ETSI) (STQ-AURORA, 2005). This is a state-of-the-art feature extraction algorithm for noisy envi-ronments (Macho et al., 2002). The algorithmachieves noise robustness by combining a two-stageWiener filter with an SNR-dependent waveformprocessing. Finally, power spectra extraction andblind equalization are used in the cepstrum calcula-tion in order to further improve robustness of theextracted features. The algorithm is used to generatea 39 dimensional feature vector as suggested byMacho et al. (2002). These features are used to trainand test an ASR system in a manner similar to thatused by the conventional ASR.

Fig. 4 shows the robust performance of the idealratio mask when used as a front-end for conven-tional ASR. Only a minor performance degradationis observed even at �5 dB. The performance withthe use of the estimated ratio mask shows substan-tial improvement over that with no preprocessing

Page 11: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1496 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

across all SNR conditions. Additionally, the esti-mated ratio mask also outperforms the ETSIadvanced front-end at SNR <10 dB. The perfor-mance improvement is especially substantial at lowSNRs. As reported by Cooke et al. (2001), the per-formance of the missing-data recognizer degradesvery little with increasing amounts of noise added,indicating the adequacy of recognition using a bin-ary mask for this task. Also, the performance withthe estimated binary mask is close to that with theideal binary mask, indicating the high quality ofthe front-end to estimate the ideal binary mask(see also Roman et al., 2003). Notice that, for thistask, the performance of the missing-data recog-nizer is close to the performance of the conventionalASR with the ideal ratio mask and better than theperformance with the estimated ratio mask.

Similarly, Fig. 5 summarizes the performance ofthe two recognizers on the task of recognition ofcommand and control phrases. The relative perfor-mance of the two recognizers reverses with thisincrease in the vocabulary size. As in the digits rec-ognition task, the performance of the conventionalASR using the ideal ratio mask is close to the ceilingperformance. Additionally, its performance usingthe estimated ratio mask is close to the that withthe ideal ratio mask, especially at SNR >0 dB. Alsonotice that the estimated ratio mask outperformsthe ETSI advanced feature extraction algorithmacross all SNRs. As in the previous task, the perfor-mance improvement is substantial at low SNRs.The increased accuracy of the conventional ASR

–5 0 5 10 inf0

20

40

60

80

100

SNR (dB)

Acc

urac

y (%

)

Ideal RMEstimated RMIdeal BMEstimated BMETSI AFEUnprocessed

Fig. 5. Performance of conventional and missing-data recogniz-ers on the command and control task. See Fig. 4 caption fornotations.

using the estimated ratio mask compared to its per-formance on the digits recognition task is due to thelower perplexity of this task. Its performance now issubstantially better than that of the missing-datarecognizer using both ideal and estimated binarymasks, particularly at SNR P 0 dB. Notice thatthe performance of the missing-data recognizer withthe estimated binary mask is close to that with theideal binary mask as in the digits recognition task,confirming the ability of the front-end to estimatethe ideal binary mask accurately.

In the experiments with the missing-data ASRthus far, the threshold h used in defining the idealbinary mask (Eq. (2)) is fixed at 0.5. Fig. 6 showsthe performance of the missing-data ASR on bothtasks as h is varied in the range 0.2–0.8. The perfor-mance is measured in terms of change in recognitionaccuracy relative to that obtained by setting h = 0.5.The SNR is 0 dB. Note that for both tasks, theperformance is relatively stable for h in the range0.4–0.7. While the highest accuracy on the digit rec-ognition task is obtained when h = 0.3, the improve-ment is only small. Moreover, this choice ofthreshold results in a significant degradation in theperformance on the command and control task.For this task, h = 0.5 provides the best performance.Hence, the selection of 0.5 as the threshold in the def-inition of the ideal binary mask is an appropriatechoice for the tasks considered in the present study.

Lower accuracy values for the missing-datarecognizer using both binary masks in Fig. 5 may

0.2 0.3 0.4 0.5 0.6 0.7 0.8–10

–5

0

5

10

θ

Rel

ativ

e ch

ange

in a

ccur

acy

(%)

Digit Recognition TaskCommand and Control Task

Fig. 6. Effect of varying the threshold, h, used in the definition ofthe ideal binary mask (see Eq. (2)) on the performance of themissing-data ASR on the two tasks. The performance is relativeto its performance when using h = 0.5. The SNR for both tasks is0 dB.

Page 12: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

Table 2Effect of amplitude distortion in the reliable T-F regions onrecognition accuracy (%) of the missing-data recognizer for thecommand and control task

Amplitude SNR (dB)

�5 0 5 10

Distorted 66.54 71.2 76.51 80.88Undistorted 69.17 74.38 79.1 82.67

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1497

be attributed to a number of reasons. It is knownthat the use of mixtures of Gaussians with diagonalcovariance structure does not adequately representthe observed spectral vectors (Cooke et al., 2001)and this problem gets exacerbated with an increasein the vocabulary size. Thus, under clean speechconditions, the difference between the accuracy ofconventional and missing-data recognizers increaseswith increase in vocabulary size (see also Raj et al.,2004). One could compute MFCCs from the speechresynthesized using the binary T-F masks by substi-tuting the ratio mask with the binary mask in (5),followed by the overlap and add resynthesis(Oppenheim et al., 1999). Under clean speech condi-tions, the missing-data recognizer would then havethe same recognition accuracy as that of the conven-tional ASR. The performance though degradesrapidly with decreasing SNR (de Veth et al., 1999).

The use of binary masks does not compensate foramplitude distortions, because the mixture spectralvalues are used in recognition for those T-F unitslabeled 1. Could this be the reason for reduced per-formance in larger vocabulary recognition? To testthe effect of this distortion, we replace the spectralvectors of the reliable T-F regions with their corre-sponding clean speech values, calculated a priori.The performance, at various SNRs, is summarizedin Tables 1 and 2. ‘‘Distorted’’ refers to the perfor-mance of the missing-data recognizer on the mixturespectral values for all T-F units. ‘‘Undistorted’’refers to its performance when the reliable T-F unitscontain clean speech values. In the unreliable units,we retain the spectral values of noisy speech. We usethe ideal binary mask generated at each SNR toprovide the reliability information for both condi-tions. Table 1 shows the effect of amplitude distor-tion on the digits recognition task. For this task,the effect of amplitude distortion is seen, asexpected, to be minimal across all SNRs, since therecognition accuracy is already quite high. Table 2shows the effect of amplitude distortion on the taskof command and control phrases. Only a small

Table 1Effect of amplitude distortion in the reliable T-F regions onrecognition accuracy (%) of the missing-data recognizer for thedigits recognition task

Amplitude SNR (dB)

�5 0 5 10

Distorted 94.89 95.97 97.18 98.12Undistorted 94.76 96.24 97.31 98.25

improvement is observed by eliminating the noiseenergy from the reliable T-F units. Hence, thedegradation to the overall performance of the miss-ing-data recognizer caused by this amplitude distor-tion is statistically insignificant at the range of SNRsconsidered here. When using the ideal binary maskgenerated at each SNR directly on clean speech,we observe a degradation in performance. Thismay be attributed to the use of energy bounds forthe unreliable units in the marginalization method.The bounded marginalization method averages theobservation probability over all possible spectralenergy values between 0 and the observed value.Hence, when the observed value is the clean spectralenergy, the bounded marginalization method over-estimates the true observation probability.

Comparing Figs. 4 and 5, we can see that the per-formance curve for the missing-data recognizer issteeper on the second task compared to the firsttask. This may be caused by the inability of themissing-data recognizer to represent all the speechmodels adequately. The log-spectral representationmay have a limited expressibility in terms ofdistinct words that can be uniquely represented.The TIDigits database has a small vocabulary.The Applewords database with a larger vocabularycreates many more competing models during decod-ing. Thus, within the same T-F grid, an increasednumber of words need to be discriminated. Withthe use of a binary mask, only a small portion ofthe total T-F acoustic model space is utilized duringrecognition. This makes it difficult for the missing-data recognizer to differentiate between competinghypotheses. Fig. 7 shows the effect of using the samebinary T-F mask on two signals. Fig. 7(a) shows thespectrogram of the word ‘‘Billy’’ and Fig. 7(b)shows the spectrogram of the word ‘‘Delete’’ inquiet. Fig. 7(c) shows an ideal binary T-F mask gen-erated at low SNR. The reliable units in this maskare black and the unreliable white. This binary maskis applied to the spectrograms in Figs. 7(a) and (b)and the resulting spectrograms with only reliable

Page 13: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

Time (sec)

Fre

qu

ency

(H

z)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Fre

qu

ency

(H

z)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

500

1000

1500

2000

2500

3000

3500

4000

Fig. 7. An illustration of similarity of reliable regions. (a) The spectrogram of the word ‘‘Billy’’ in quiet. (b) The spectrogram of the word‘‘Delete’’ in quiet. (c) An ideal-binary T-F mask. Reliable T-F units are marked black and unreliable white. (d) The spectrogram obtainedfrom (a) by applying the ideal mask in (c). (e) The spectrogram obtained from (b) by the same ideal masking as in (d).

1498 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

T-F units are shown in Figs. 7(d) and (e), respec-tively. Notice that the reliable regions of the twospectrograms are very similar. In the absence ofinformation in the unreliable regions, it is difficultfor the recognizer to distinguish between the twowords. Indeed the recognizer frequently substitutesone word with the other. The bounded marginaliza-tion method treats the information in the unreliableregions only as counter-evidence for recognition ofcertain models (Cunningham and Cooke, 1999).Hence, the missing-data recognizer faces increasedacoustic complexity during decoding.

6. Discussion

The advantage of the missing-data recognizer isthat it imposes a lesser demand on the speechenhancement front-end than the conventionalASR. Only knowledge of reliable T-F units of noisyspeech, or an ideal binary mask, is required from thefront-end. Moreover, Roman et al. (2003) haveshown that the performance of the missing-datarecognizer degrades gradually with increasing devia-tion from the ideal binary mask. The binaural sys-tem employed here is able to estimate this mask

Page 14: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1499

accurately. Hence, we achieve performance close tothe ceiling performance of missing-data recognition.Conventional ASR, on the other hand, requiresspeech enhancement across all T-F units throughfront-end processing. In this study, we haveemployed a ratio T-F mask as a front-end for theconventional ASR, which is estimated using statis-tics of ITD and IID. Estimation of the ideal ratiomask is less robust than the estimation of theideal binary mask. As a result, the performanceof the missing-data recognizer on the small vocabu-lary task is better than that of the conventionalASR.

The marginalization method for missing-datarecognition is the optimal spectral domain recogni-tion strategy provided that the missing T-F unitscan be ignored for classification (Little and Rubin,1987). The missing-data recognizer assumes thatthe unreliable units carry redundant informationfor speech recognition. This, however, is not alwaystrue. For a small vocabulary task, the unreliableunits may be safely marginalized for good recogni-tion results. When vocabulary size increases, theacoustic model space becomes densely populated.Under such conditions, good recognition resultsmay not be obtained by completely ignoring themissing T-F units. This may be caused by the inabil-ity to represent all the acoustic models adequatelyusing only a small number of reliable T-F units.On the other hand, the ratio T-F mask attemptsto recover the speech in the unreliable T-F unitsfor use in recognition (see Eq. (5)). Additionally,under clean speech conditions, recognition accuracyusing spectral features is inferior to using cepstralfeatures. The cepstral transformation retains theenvelope of speech while removing its excitationsource (Rabiner and Juang, 1993). The speech enve-lope contains most relevant information for recog-nition. In addition, cepstral features are used fortheir quasi-orthogonal properties (Shire, 2000).Hence, advantage of the conventional ASR showswhen vocabulary size increases.

Raj et al. (2004) have previously reported thatconventional ASR with reconstructed missing T-Fregions outperforms the missing-data recognizerwhen tested on the Resource Management database(Price et al., 1988). The missing or unreliable T-Funits were reconstructed either using speech clustersor based on their correlations with reliable regions.The speech clusters and the knowledge of correla-tions between reliable and unreliable T-F units areobtained from the training portion of the Resource

Management database. Unlike their system, ourestimation of the ideal ratio mask is independentof the signals used in the training and testing ofthe speech recognizers. Hence, it is applicable evenwhen samples of clean speech are unavailable.Additionally, the accuracy and computational com-plexity of our ratio mask based system are notdependent on the nature and size of the vocabulary.

Note that our supervised training captures onlythe location information, and hence our system isnot sensitive to the content of sound sources. Whilewe have shown results only for one configuration oftarget and interference source positions, similarresults are expected for other spatial configurations,including those with interferences at multiple loca-tions (see Roman et al., 2003). The study of Romanet al. (2003) also addresses the localization of vari-ous sounds sources in a mixture. However, the esti-mation of the ratio and the binary masks requirestraining for different configurations of sources andreestimating the mean curves as described in Section3. Although extrapolation from trained configura-tions to untrained ones may be possible, this is alimitation that needs to be addressed in futureresearch.

Although our estimated T-F ratio mask providespromising results, other approaches for the estima-tion of this mask could also be explored; should aparametric curve be suspected, the parameters couldbe optimized to minimize recognition errors. Futurework will also extend to large vocabulary tasksand explore the robustness of the binaural front-end to changes in location and number of noisesources.

To summarize, we have proposed a ratio T-Fmask, estimated using a binaural processor, as afront-end for conventional ASR. At two differentvocabulary sizes, the use of this mask results in siz-able improvement in recognition accuracy at variousSNRs when compared to the baseline performance.Additionally, it significantly outperforms the ETSIadvanced robust feature extraction algorithm atmost SNRs. On the larger vocabulary task, the per-formance of the proposed ASR is substantially bet-ter than that of the missing-data recognizer. Ourstudy indicates that optimal preprocessing strategiesfor robust speech recognition may depend on thevocabulary size of the task. For small vocabularyapplications, computation of the ideal T-F binarymask may be desirable, whereas a ratio mask mayprovide an improved performance with increasedvocabulary sizes.

Page 15: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

1500 S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501

Acknowledgments

This research was supported in part by anAFOSR Grant (FA9550-04-1-0117) and an NSFGrant (IIS-0081058). We thank M. Cooke fordiscussion and assistance in implementing the miss-ing-data recognizer, and D. Pearce for helping usobtain the ETSI advanced front-end algorithm.We also thank the three anonymous reviewers fortheir suggestions/criticisms. A preliminary versionof this work was presented in 2004 ICSLP.

References

Barker, J., Josifovski, L., Cooke, M., Green, P., 2000. Softdecisions in missing data techniques for robust automaticspeech recognition. In: Proc. International Conference onSpoken Language Processing ’00, pp. 373–376.

Blauert, J., 1997. Spatial Hearing – The Psychophysics of HumanSound Localization. MIT Press, Cambridge, MA.

Boll, S.F., 1979. Suppression of acoustic noise in speech usingspectral subtraction. IEEE Trans. Acoust. Speech SignalProcessing ASSP-27 (2), 113–120.

Bradstein, M., Ward, D. (Eds.), 2001. Microphone Arrays: SignalProcessing Techniques and Applications. Springer, Berlin,Germany.

Brown, G.J., Wang, D.L., 2005. Separation of speech bycomputational auditory scene analysis. In: Benesty, J., Mak-ino, S., Chen, J. (Eds.), Speech Enhancement. Springer, NewYork, pp. 371–402.

Cardoso, J.F., 1998. Blind signal separation: statistical principles.Proc. IEEE 86 (10), 2009–2025.

Chen, C.-P., Bilmes, J., Ellis, D.P.W., 2005. Speech featuresmoothing for robust ASR. In: Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Processing ’05,vol. 1, pp. 525–528.

Cole, R., Noel, M., Lander, T., Durham, T., 1995. Newtelephone speech corpora at CSLU. In: Proc. EuropeanConference on Speech Communication and Technology ’95,pp. 821–824.

Cooke, M., Green, P., Josifovski, L., Vizinho, A., 2001. Robustautomatic speech recognition with missing and unreliableacoustic data. Speech Commun. 34, 267–285.

Cunningham, S., Cooke, M., 1999. The role of evidence andcounter-evidence in speech perception. In: Proc. InternationalCongress on Phonetic Sciences ’99, pp. 215–218.

Davis, S.B., Mermelstein, P., 1980. Comparison of parametricrepresentations for monosyllabic word recognition in contin-uously spoken sentences. IEEE Trans. Acoust. Speech SignalProcessing ASSP-28 (4), 357–366.

de Veth, J., de Wet, F., Cranen, B., Boves, L., 1999. Missingfeature theory in ASR: make sure you miss the right typeof features. In: Proc. Workshop on Robust Methods forSpeech Recognition in Adverse Conditions ’99, pp. 231–234.

Droppo, J., Acero, A., Deng, L., 2002. A nonlinear observationmodel for removing noise from corrupted speech log mel-spectral energies. In: Proc. International Conference onSpoken Language Processing ’02, pp. 1569–1572.

Ehlers, F., Schuster, H.G., 1997. Blind separation of convolutivemixtures and an application in automatic speech recognitionin a noisy environment. IEEE Trans. Signal Processing 45(10), 2608–2612.

Ephraim, Y., 1992. A Bayesian estimation approach for speechenhancement using hidden Markov models. IEEE Trans.Signal Processing 40 (4), 725–735.

Ephraim, Y., Malah, D., 1984. Speech enhancement using aminimum mean-square error short-time spectral amplitudeestimator. IEEE Trans. Acoust. Speech Signal ProcessingASSP-32 (6), 1109–1121.

Gales, M.J.F., Young, S.J., 1996. Robust continuous speechrecognition using parallel model combination. IEEE Trans.Speech Audio Processing 4, 352–359.

Gardner, W.G., Martin, K.D., 1994. HRTF measurements of aKEMAR dummy-head microphone. Technical Report #280,MIT Media Lab Perceptual Computing Group.

Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D.,Dahlgren, N., 1993. Darpa timit acoustic–phonetic continu-ous speech corpus. Technical Report NISTIR 4930, NationalInstitute of Standards and Technology, Gaithersburg,MD.

Glotin, H., Berthommier, F., Tessier, E., 1999. A CASA-labellingmodel using the localisation cue for robust cocktail-partyspeech recognition. In: Proc. European Conference on SpeechCommunication and Technology ’99, pp. 2351–2354.

Gong, Y., 1995. Speech recognition in noisy environments: asurvey. Speech Commun. 16, 261–291.

Hughes, T.B., Kim, H.S., DiBase, J.H., Silverman, H.F., 1999.Performance of an HMM speech recognizer using a real-timetracking microphone array as input. IEEE Trans. SpeechAudio Processing 7 (3), 346–349.

Leonard, R.G., 1984. A database for speaker-independentdigit recognition. In: Proc. IEEE International Conferenceon Acoustics, Speech, and Signal Processing ’84, pp. 111–114.

Lippmann, R.P., 1997. Speech recognition by machines andhumans. Speech Commun. 22, 1–15.

Little, R.J.A., Rubin, D.B., 1987. Statistical Analysis withMissing Data. Wiley, New York, NY.

Macho, D., Mauuary, L., Noe, B., Cheng, Y.M., Ealey, D.,Jouvet, D., Kelleher, H., Pearce, D., Saadoun, F., 2002.Evaluation of a noise-robust DSR front-end on auroradatabases. In: Proc. International Conference on SpokenLanguage Processing ’02, pp. 17–20.

McAulay, R., Malpass, M.L., 1980. Speech enhancement using asoft-decision noise suppression filter. IEEE Trans. Acoust.Speech Signal Processing ASSP-28 (2), 137–145.

Oppenheim, A.V., Schafer, R.W., Buck, J.R., 1999. Discrete-timeSignal Processing. second ed. Prentice-Hall, Upper SaddleRiver, NJ.

Palomaki, K.J., Brown, G.J., Wang, D.L., 2004. A binauralprocessor for missing data speech recognition in the presenceof noise and small-room reverberation. Speech Commun. 43,361–378.

Price, P., Fisher, W.M., Bernstein, J., Pallett, D.S., 1988. TheDARPA 1000 word Resource Management database forcontinuous speech recognition. In: Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Processing ’88,pp. 651–654.

Rabiner, L.R., Juang, B.H., 1993. Fundamentals of SpeechRecognition. second ed. Prentice-Hall, Englewood Cliffs, NJ.

Page 16: Binary and ratio time-frequency masks for robust speech ...web.cse.ohio-state.edu/~wang.77/papers/SRW.spcomm06.pdfspeech recognition: (1) missing-data recognition and (2) a system

S. Srinivasan et al. / Speech Communication 48 (2006) 1486–1501 1501

Raj, B., Seltzer, M.L., Stern, R.M., 2004. Reconstruction ofmissing features for robust speech recognition. Speech Com-mun. 43, 275–296.

Roman, N., Wang, D.L., Brown, G.J., 2003. Speech segregationbased on sound localization. J. Acoust. Soc. Am. 114, 2236–2252.

Rosenthal, D.F., Okuno, H.G. (Eds.), 1998. ComputationalAuditory Scene Analysis. Lawrence Erlbaum Associates,Mahwah, NJ.

Shire, M.L., 2000. Discriminant training of front-end andacoustic modeling stages to heterogeneous acoustic environ-ments for multi-stream automatic speech recognition. Ph.D.thesis, University of California, Berkley.

Srinivasan, S., Roman, N., Wang, D.L., 2004. On binary andratio time-frequency masks for robust speech recognition. In:Proc. International Conference on Spoken Language Pro-cessing ’04, pp. 2541–2544.

STQ-AURORA, 2005-11. Speech processing, transmission andquality aspects (STQ); distributed speech recognition;advanced front-end feature extraction algorithm; compressionalgorithms. In: ETSI ES 202 050 V1.1.4. European Telecom-munications Standards Institute.

Tessier, E., Berthommier, F., Glotin, H., Choi, S., 1999. ACASA front-end using the localisation cue for segregationand then cocktail-party speech recognition. In: Proc. IEEEInternational Conference on Speech Processing, pp. 97–102.

van Hamme, H., 2003. Robust speech recognition using missingfeature theory in the cepstral or LDA domain. In: Proceed-ings of the European Conference on Speech Communicationand Technology ’03, pp. 3089–3092.

van Trees, H.L., 1968. Detection, Estimation, and ModulationTheory, Part I. Wiley, New York, NY.

Varga, A.P., Moore, R.K., 1990. Hidden Markov model decom-position of speech and noise. In: Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Processing ’90,pp. 845–848.

Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D., 1992.The NOISEX-92 study on the effect of additive noise onautomatic speech recognition. Technical Report, SpeechResearch Unit, Defense Research Agency, Malvern, UK.

Young, S., Kershaw, D., Odell, J., Valtchev, V., Woodland, P.,2000. The HTK Book (for HTK Version 3.0). MicrosoftCorporation.