robust speech recognition - fbk

Robust Speech Recognition

DICIT Consortium

(ITC-irst (Trento, Italy), University of Erlangen-Nuernberg (Erlangen,Germany), IBM (Praha - Czech Republic, T.J Watson Research Center -USA), Fracarro Radioindustrie (Castelfranco Veneto, Italy), 3Soft (Erlan-gen, Germany), CitecVoice (Torino, Italy), Alpikom (Trento, Italy))1

1 Introduction

An Automatic Speech Recognition (ASR) component can degrade its on-field performance significantly if it was not designed taking into account thevariabilities that can be found in the real-world.

According to the early literature in the field, the term “robustness” wasinitially associated to the main variability factors in the input speech due tothe environmental noise, which now is called as “environmental robustness”.Beside this, today other types of robustness are investigated as, for instance,the robustness “to distortion in the transmission channel”, “to speaker vari-ability”, “to distant-talking interaction”, and “of spoken dialogue systems”.

For what regards the latter one, it is evident [42] that in a spoken dialoguesystem there is not an obvious relationship between an increase/decrease inword error rate and an improvement or deterioration in the behaviour of theentire system. This makes the design and development of a conversationalinterface more difficult. In fact, when the ASR component is embeddedin a spoken dialogue system, a wide range of techniques and parameters isinvolved in the process of conversion from the input speech sequence to thesemantic concept and to the related action.

Moreover, due to the proliferation of applications of ASR in contexts inwhich the user can not hold the microphone in an ideal mode, the environ-mental (in general non stationary) noise, the reverberation effects, and atime-varying distance between the user and the microphone introduce newdimensions on which robustness is to be defined.

It is also worth noting that ASR technologies are primarily based on

1Copyright c© Partners of the DICIT consortium. This paper, or a short extractof it, can be reproduced, republished, or distributed, only if the DICIT Consortium isacknowledged and authors give permission. For further information, please contact thereference person, Maurizio Omologo (ITC-irst, Italy), at the following e-mail address:“[email protected]”.

1

a diffuse use of statistical methods requiring large corpora of speech exam-ples. Input signal degradation, inter/intra speaker variabilities, environmen-tal conditions, transmission channel, mismatch in vocabulary and languagerepresent some of the possible reasons which lead to an unpredictable situ-ation, that is a situation never “seen” during training or foreseen by the de-signer during system development. Having large corpora to train the systemwith the “right” speech examples for every possible situation is unfeasible.So, in most of those unpredictable situations a robust system is expected toproduce a satisfactory output, or behaviour, thanks to a smart and balancedadoption of effective solutions to each sub-problem being tackled inside thesystem. Given the current state-of-the-art, in fact, there is not any recipeto develop an error-free component, neither for a simple task as connecteddigit recognition.

The following of this document tries to synthesize an up-to-date state-of-the-art in most of the topics of interest related to robustness. Due to thevery rich literature in the field and to the complexity of the problem, thedocument can not be comprehensive of all the most significant achievements;the authors apologize if some relevant references were missed. Moreover, themain focus will often be on the development of a robust system for distant-talking ASR in a noisy and reverberant environment and given a possiblemultiple speaker context (also called “cocktail party”).

Due to this focusing, some new approaches will not be addressed here,as those based on the fusion of audio and visual information [111, 113]or of audio and special device information [121], although they representother effective ways to increase the robustness of speech interfaces in someapplicative contexts.

2 Surveys on ASR robustness

There is a vast literature on robustness of ASR systems.For a general introduction to environmental robustness, one can refer to

chapter 10 of the book written by Huang, Acero and Hon in 2001 [3].Other important references available in the literature are:

• Junqua and Haton (1996). Robustness in Automatic Speech Recogni-tion. Kluwer Academic Publishers

• Juang, B.H., ”Speech Recognition in Adverse Environments,” Com-puter Speech and Language, 1991, 5, pp. 275-294.

2

• Rabiner and Juang, “Fundamentals of speech recognition”, PrenticeHall, 1993, (in particular section 5.7).

• De Mori Ed. “Spoken Dialogues with Computers”, 1998 (in particular,chapter 12).

• Y. Gong, “Speech recognition in noisy environments: A survey”, SpeechCommunication, vol. 16, pp. 261-291, 1995.

• Acero, A., Acoustical and Environmental Robustness in AutomaticSpeech Recognition, 1993, Boston, MA, Kluwer Academic Publishers.

• C.H. Lee, F.K. Soong, and K.K. Paliwal, “Automatic speech andSpeaker Recognition”, Kluwer, 1996.

• Speech Comm. Special Issue on Noise Robust ASR vol. 34, n. 1-2,April 2001

For what regards distant-talking speech recognition, one can find anintroduction to the problem in [43, 41].

Several workshops also focused on the problem as, for instance:

• International Workshop on Hands-Free Speech Communication Kyoto,Japan, April 9-11, 2001.

• Robust Speech Recognition for Unknown Communication ChannelsPont-a-Mousson, France, April 1997.

• COST278 and ISCA Tutorial and Research Workshop (ITRW) on Ro-bustness Issues in Conversational Interaction, 2004 (in particular thepaper written by R. Rose).

• Joint Workshop on Handsfree Speech Communication and MicrophoneArrays (HSCMA), Rutgers University, Piscataway, March 2005

3 Environmental noise and reverberation

The first source of mismatch between ideal conditions (for which an ASRsystem is designed to provide the best performance) and real-world con-ditions is due to a transformation of the acoustic wave from the time itleaves the speaker’s mouth to the digital representation resulting from themicrophone transduction and A/D conversion. The high quality of moderntransducers and A/D hardware makes today more evident that the additive

3

noise generated by various sources (e.g., machinery, electrical devices, com-petitive speakers, background music, door slam, etc) and the convolutivereverberation due to early and late reflections of waves in the environmentstill represent very tough problems to tackle. For what regards reverber-ation effects, it is worth noting that a distance of few decimeters betweenthe speaker’s mouth and the microphone can cause a deep alteration in thespeech waveform.

Various studies reported in the literature during the last decade agreedon the evidence that a SNR of 5-10 dB, given just additive noise, leadsto a dramatic decrease in performance [83]. The application of ASR inthe car environment, mainly characterized by stationary noise and a smallcontribution in terms of reverberation, represents a good example of thisfact.

Other studies showed that in a reverberant environment the drop ofASR performance is significant when increasing a little the distance be-tween speaker and microphone [41]. In the simple case of connected digitrecognition, given a speaker at 1.5 meter from an omnidirectional micro-phone and typical noisy office conditions, a 56% WER was reported in [44]while a 0.9% WER was observed on the corresponding close-talk recordings.

As discussed in the following, to reduce the impact of noise and reverber-ation on ASR system performance, many different methods and techniqueshave been so far explored, which are aimed either to “clean” the input sig-nal (see Section 4), to derive robust acoustic features (see Section 6), or totrain robust statistical models (See Section 7). In the two latter cases, areduced mismatch between training and test conditions is generally pursuedby applying techniques aimed to clean the acoustic features (feature-basedcompensation) or to update the statistical models (model-based compensa-tion and adaptation).

4 Speech enhancement

There is a wide literature on speech enhancement that is aimed to cleanthe input speech from the effects of environmental noise and reverbera-tion. Here, the basic assumption (not always confirmed by an experimentalevidence) is that feeding an ASR system with an enhanced signal, char-acterized by a better perceptual quality, will lead to an improvement inASR performance. Enhancement can be accomplished either using a singlemicrophone (i.e., a traditional or a noise-canceling microphone) or using amulti-microphone based solution, in particular a microphone array.

4

4.1 Single Channel techniques

Comprehensive discussions of the state-of-the art in single-channel noisereduction can be found in [1] and [2]. Speech recognition results using single-channel noise reduction are given e. g. in [3].

For what regards reverberation effects, filtering of the microphone signalwith the inverse of the transfer function of the path between speaker andmicrophone is the ideal processing to dereverberate the speech signal. Astypical room impulse responses are non-minimum-phase, a causal, exactinverse filter is not stable and therefore can only be approximated [11].

Another dereverberation approach is proposed in [15], observing thatthe linear prediction residual of clean speech exhibits a single dominatingpeak within each glottal cycle, while several peaks can be found in theresidual of reverberant speech. By attenuating the additional impulses dueto reflections, a dereverberation effect can be obtained.

In [20] dereverberation is obtained by using the harmonic structure ofspeech; based on pitch estimation, the harmonic part of the speech is ex-tracted by adaptive filtering. Averaging the ratio of the discrete Fouriertransforms of the harmonic part of speech and that of the reverberant part,a dereverberation filter is calculated which reduces reverberation in bothvoiced and unvoiced speech segments. A MAP formulation is given in [21].The technique is suitable when a sufficient number of training utterances isavailable and when the room impulse response does not change significantly.

4.2 Multi-Channel techniques

The use of a microphone array can mitigate the distortion by providing spa-tial filtering and focusing to the speaker position (i.e. beamforming). Onecritical issue in beamforming is Time Delay Estimation (TDE) between dif-ferent microphones. This operation is the first step in many speaker local-ization and tracking algorithms and turns out to have a significative impactin the next calculations. Having a reliable and accurate TDE technique, asGCC-PHAT (also called CSP [49]), a good estimate of the speaker position[50] can be obtained as well as an effective beamformed signal [5].

Many state-of-the-art techniques basically improve the quality of thedesired speech, increasing the corresponding SNR. The simplest - and ef-fective - method is the delay-and-sum beamforming (DSB), based on thetemporal re-alignment of the signals. By compensating for the different de-lays, a coherent addition of the signal originating from the desired directionis achieved, while the signals originating from other directions add inco-

5

herently and are therefore attenuated. Thus the interfering signals can beattenuated relative to the desired signal.

Because of the limited directionality, DSB achieves only a moderate re-verberation reduction at low frequencies. In [4] the reverberation reductionof DSBs is analyzed using tools from statistical room acoustics.

In [5], where the DSB is used as a preprocessing unit for speech recog-nition, a noticeable increase of the recognition rate is reported.

Another approach exploits the statistical properties of the acquired sig-nals: each channel is filtered by an adaptive filter which is adjusted tooptimize a certain cost function. For example the minimum variance dis-tortionless beamformer minimizes the signal energy under the constraint ofundistorted response in the look direction. Many implementations have beenproposed, starting from the original Generalized Sidelobe Canceler describedin [7]. The application of this technique is most beneficial for interferingpoint sources, as spatial zeros can be placed in the directions of the inter-ferers. In [8] it is shown that the recognition accuracy can be significantlyincreased by adaptive beamforming if interferers are active. To avoid theproblem of desired signal cancellation in reverberant environments, carefuladaptation control is required. While adaptive beamforming is very succes-ful for attenuating interferers, the reverberation reduction capability is onlyslightly better than that of the DSB.

In the approach suggested by [9] the processing is oriented to improvedirectly the speech recognizer output, by adaptation of the coefficients of afilter-and-sum-beamformer in order to maximize the probability of the cor-rect transcription. Because of the nonlinear relationship between the filtercoefficients and the cost function, the adaptation of the filter coefficients isvery challenging.

In [12], Miyoshi and Kaneda extend the dereverberation approach tothe multi-channel case, showing that under some specific situations it ispossible to perform exact inverse filtering, even though the involved impulseresponses are non-minimum-phase. The main problem is that the roomimpulse responses need to be known exactly and even small inaccuraciescan lead to significant deviations from the optimum solution [13]. A concisereview of this method can be found in [14].

Dereverberation based on linear prediction residuals can be generalizedto multichannel setups [16, 17, 18, 19] with a significant reduction of rever-beration at the cost of distorting the desired signal.

Methods for blind system identification are proposed in [24, 25, 26, 27]. Ifthe order of the required room impulse responses is not known, the derever-beration is significantly reduced. To improve the performance for unknown

6

lengths of the impulse responses, Hikichi proposed a post processing scheme[28].

Buchner et al. propose in [29] a framework for multi-channel blind signalprocessing which can be used for blind dereverberation.

5 Speech Activity Detection

To activate speech signal capture one can either adopt a push-to-talk modeor a “continuously listening” mode. The more flexible and challenging lattermode requires a very precise algorithm to distinguish speech from non-speechsequences.

The impact of a good Speech Activity Detector (SAD) on ASR per-formance is twofold: it aims at reducing the speech sequences fed into therecognizer, this way limiting the possible insertions of non-speech events andconsequently improving ASR robustness; it reduces the computational loadon the ASR component.

There is a wide literature in the field; early works focused on close-talking interaction and were based on the use of energy thresholding andzero-crossing features [45, 46, 47, 48], in some cases exploring the joint useof SAD and noise reduction [51]. Most recent works have tackled the prob-lem for more critical communication channels and environmental conditionsand introducing other acoustic features, for instance long-term speech infor-mation to better detect speech boundaries. In [52], more details are givenon the problem, together with a good introductory survey of the SAD tech-niques explored more recently.

In a noisy and reverberant environment, given a distant-talking interac-tion, detecting the speech activity gets a quite complex task. The use ofmulti-microphone input and information typically used for speaker localiza-tion purposes was explored in [53]. Recent results obtained under the CHILproject for seminars with a speaker-microphone distance of 3-4 meters arereported in [54] and show the critical role of this component. In a multiplespeaker context [56, 57] and given an audio interference to compensate withan acoustic echo cancelation technique, SAD represents a very challengingstate-of-the-art research issue.

6 Acoustic features

Acoustic feature extraction aims at reducing the data rate from speech signalto a more compact representation which allows a better characterization at

7

statistical level for ASR purposes.The most commonly used acoustic features for speech recognition are

Mel-based Frequency Cepstral Coefficients [58]. They are usually combinedwith their first and second order time derivatives [38, 39], to obtain, in arather simple way, an effective representation of a segment (of 10-30 ms) ofspeech input.

In the last two decades, several attempts were devoted to improve perfor-mance offered by this set of features. For instance, LPC-based coefficients,PLP [40], and other auditory-model based features represent good alterna-tive sets, in some cases performing better in terms of robustness to envi-ronmental noise. Also time-frequency distributions [114], supra-segmentalfeatures, articulatory-based features and other processing methods inspiredon a detailed modeling of the speech production system seem to be innova-tive ways to improve ASR robustness [59]

6.1 Normalization, transformation and compensation tech-

niques

Given a noisy speech input, or other variabilities introduced by the speakeror the channel, most of the above mentioned features (in particular MFCC)do not perform in a robust way, due to the mismatch that is introduced.However, different techniques have been explored that transform the givenset into a more robust one.

Cepstral mean [60] and variance normalization [61] represent a simplebut effective processing that alleviates the drawback of the given mismatch.Recent attempts to map features to a standard normal distribution led tomore effective solutions as described in [62, 63, 64].

RASTA filtering [65] allows to process the time contour of the givenfeatures in order to compensate for convolutional noise introduced by thechannel. Vocal tract normalization is also very effective to deal with inter-speaker variabilities [66].

Recent significant works on noise robust features and on feature com-pensation can be found in [67, 68] and in [69] that proposes a feature spacecompensation based on vector Taylor series. In [70], a feature enhancementtechnique based on a probabilistic and phase-sensitive environment modelfor acoustic distortion, i.e. taking into account the phase relationship be-tween clean speech and corrupting noise. The technique is proposed also totackle non-stationary noise.

Actually, most of the techniques so far mentioned do not perform well inpresence of non-stationary noise. On the other hand, most of the real-world

8

applications are characterized by a speech degradation due to non-stationarynoise sources.

Some recent research activities on acoustic feature and related transfor-mations suitable to support non-stationary noise distortion are described in[71] where particle filter-based sequential noise estimation method is out-lined; for slowly time-varying mismatch due to non-stationary noise also se-quential estimation with forgetting has been proposed as described in [72];in [73], a feature extraction procedure is proposed, based on kernel predic-tive coding cepstra, which results to be quite robust to time-varying noisecharacteristics.

Also missing-data [74] and multi-band approaches [75, 76] aim at improv-ing ASR robustness in the case of real, non-stationary noise of unknowncharacteristics, by assuming to have tools and knowledge to extract fromthe spectral information the most reliable bands, which can be eventuallyprocessed and recognized as independent channels. These methods derivein principle from the theories of Computational Auditory Scene Analysis(CASA) and perceptual models of human hearing [77] and from the evi-dence that knowing exactly where noise is in the spectrum a relevant ASRimprovement can be attained [78]. One of the main issues is how to developthat “smart” automatic tool in order to be reliable in any unknown noisyconditions.

7 Acoustic modeling

Hidden Markov Models (HMMs) [79] represent the most commonly usedtechnology for speech recognition today. Although some strong assumptionsare made in the definition of the related framework, HMMs still represent avery effective way to model both spectral and time-varying characteristicsof speech signals. Among the various attempts to introduce alternativeframeworks or generalizations of the HMM one, it deserves to be mentionedANN, hybrid ANN-HMM, Bayesian networks [80] and Dynamical modeling.

7.1 Robust HMM training

In order to derive HMMs robust to variabilities caused by noisy condi-tions, distorted channels, reverberant environments, etc, the best trainingapproach relies on the availability of speech data that match the conditionsencountered in the real world. This is not always possible, due to the lackof large amount of data and the cost to collect a new specific speech corpus

9

for the given application. An effective way to partially circumvent the prob-lem and ensure a reasonable ASR performance is to adopt multi-conditiontraining [81], which implies having collected large corpora under very differ-ent conditions and speaking-styles, and then use all the given data to trainmodels “suitable”, on the average, for different real-world conditions.

Another effective training method to improve ASR robustness is basedon artificially corrupting clean speech with noise at different SNR levels[82, 83]. This method, also called “contaminated speech based training”, wasthen generalized [44] in order to take into account the convolutional effectsdue to the acoustics of the environment in a distant-talking interaction.Its effectiveness was also confirmed in other works as for instance [31] and[85, 84]. For the purpose of improving ASR robustness another significantapproach is Parallel Model Combination (PMC) [35], that aims at deriving astatistical distribution of noisy speech given the distributions of clean speechand noise as mixture of Gaussians. For non-stationary noise, it is also worthmentioning the approach described in [86], although its complexity, due toa 3-dimensional Viterbi search, is very high.

7.2 HMM adaptation and compensation

To reduce the mismatch between training and real-world conditions, anothereffective procedure is that of modifying statistical models, given a small setof data representative of the testing conditions. Maximum A Posteriori(MAP) Estimation [33], Maximum Likelihood Linear Regression (MLLR)[34, 87, 88] are the most common ones; they were originally conceived toadapt to speaker variabilities and then turned out to be useful to adapt tonew environmental conditions as well. Good overviews on these approachescan be found in [89, 116, 90].

A disadvantage of these approaches is that they require an initial recog-nition pass or a certain amount of adaptation from matching conditions toestimate the transformation parameters in a reliable way. If only short signalsegments representing the current noise condition are available, the Jacobianadaptation [91, 92] approach can be used as an effective alternative.

Recent literature discussed variants of MLLR, as for instance Constrainedmodel space adaptation [117, 93, 94] and new approaches to noise robustASR as uncertainty decoding [120, 118].

With regard to distant-talking interaction, in [44] the effectiveness ofMLLR adaptation is also shown when applied to models previously trainedon contaminated speech.

Model Adaptation by State Splitting has also been explored, as discussed

10

in [37], for environments characterized by a long reverberation.Finally, it is worth noting that “adaptive training” [115, 116, 119], origi-

nally conceived for robustness to speaker variabilities, represents an effectiveway to train canonical models as well as to adapt during recognition in arobust way against mismatch in environmental noise conditions.

8 Pronunciation modeling

Another important source of inter/intra-speaker variability, particularly ev-ident in spontaneous speech interaction, is the different way of pronuncingthe same word or word sequence due to various effects, among which coar-ticulation, stress, etc. As a matter of fact, the pronunciation of a given wordcan differ substantially from the baseform described to the ASR system interms of canonical phone sequence. Trivial solutions as those based on alist of alternative phone sequences do not solve the problem in an effectiveway. Although several works have been conducted on this topic [95, 96, 97],pronunciation variation as well as the intrinsic pronunciation ambiguity noteasy to address (even by a human trasncriber), represent a very complexresearch task to pursue in order to enhance system robustness, in partic-ular where spontaneous speech phenomena are expected (e.g. in-car voiceinteraction). Another way to face with the problem is to adopt a completelyunsupervised statistical method as, for instance, it is described in [98] or ona mix of data-driven and knowledge based methods [99].

9 Spoken Language Understanding and Dialogue

Robustness is also a key issue in spoken language understanding and di-alogue systems. Ill-formed sentences (typically occurring in spontaneousspeech interaction and with naive users) as well as recognition errors due toenvironmental noise or other previously addressed factors, can cause a wronginterpretation of the given input sentence and a misleading behaviour of thedialogue system.

Actually, research in the field show that in a good spoken dialogue sys-tem the rate at which the end-to-end performance of the complete systemdegrades can be significantly slower than that of the ASR component (eval-uated in terms of word error rate) [100, 101, 102, 103].

A first evidence is that language models, grammars as well as vocabularysize and characteristics should be designed and trained with the objective

11

of the best accuracy in understanding rate rather than in word recognitionrate.

Then, one of the key components in reducing the impact of ASR errorsis the semantic parser: having a robust parser [104, 105, 106] leads to thecapability of the system to handle incomplete information and ambiguitiesgenerated by lower layers of the system (in particular the ASR component).

Confidence measures at this conceptual level (as well as at any other layerof the system) represent another tool to prevent the system from an incorrectfeedback or behavior. Confidence measures can be derived at concept levelas well as at word or sentence level or through their combination. Moredetails on this topic and the most significant references can be found in[107].

Dialogue strategies including effective solution for recovery from errorsare also of fundamental importance to cope with a lack of robustness in thelower layers of the system.

Towards the development of more robust, portable and easy-to-trainSLU and Dialogue systems, it is worth mentioning the statistically basedapproaches as described in [108].

10 Corpora, System Evaluation, Benchmarking

To study ASR robustness, one fundamental aspect is the availability andshareability, across research centers, of corpora to use for benchmarking ofdifferent technologies. Sharing also the same evaluation metrics is of funda-mental importance to allow a fair comparison between different technologies.

In the last two decades several attempts have been done along thesedirections. As the performance of the ASR technology improved, ARPAperiodically defined new, more ambitious, benchmarks, from TIDIGIT toResource Managements Task, to TIMIT, ATIS, SWITCHBOARD, WSJ,etc. [109]. These tasks have been progressively focused on dealing withnoisy speech and ASR robustness issues.

In the recent years, it is also worth mentioning the standardization activ-ity called Aurora, initiated within European Telecommunications StandardsInstitute (ETSI) and aimed to define a standard for a front-end of a Dis-tributed Speech Recognition (DSR). The related activity led to the availabil-ity and distribution (through ELDA, see www.elda.org) of both simulatedand real data corpora very useful for comparison purposes in research onASR robustness [67, 110].

To this regard, also NIST group (www.nist.gov/speech) contributes to

12

the advancement of the state-of-the art of spoken language processing (speechrecognition and understanding). More recently NIST has extended its activ-ity to other technologies (e.g., speaker id, speech activity detection, speakerlocalization, etc) related to multi-modal interfaces and to tasks, as the anal-ysis of conference and meetings, being investigated under the Europeanprojects AMI (see www.amiproject.org) and CHIL (see chil.server.de). Re-lated to this, the automatic transcription of distant-talking spontaneousspeech represents one of most difficult task to pursue at state-of-the-artlevel.

From 2006, a joint effort under the CLEAR evaluation program (seewww.clear-evaluation.org) will be conducted to evaluate systems that aredesigned to analyse people, their identities, activities, interactions and rela-tionships in human-human interaction scenarios. CLEAR is meant to bringtogether projects and researchers working on related technologies (both onaudio and video processing) in order to establish a common internationalevaluation campaign in this field. Currently, the CLEAR 2006 evaluationis supported by the European Integrated project CHIL, the US ARDAVACE program and the US National Institute of Standards and Technology(NIST).

13

References

[1] R. Martin, “Statistical methods for the enhancement of noisy speech,”Proc. IEEE Int. Workshop on Acoustic Echo and Noise Control(IWAENC), pp. 1–6, September 2003.

[2] J. Benesty, S. Makino, and J. Chen, Eds., Speech Enhancement,Springer, Berlin, Germany, 2005.

[3] X. Huang, A. Acero, and H.-W. Hon, Spoken language processing: aguide to theory, algorithm, and system development, Prentice Hall,Upper Saddle River, NJ, USA, 2001.

[4] N. D. Gaubitch and P. A. Naylor, “Analysis of the dereverberationperformance of microphone arrays,” Proc. Int. Workshop on AcousticEcho and Noise Control (IWAENC), pp. 121–124, September 2005.

[5] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani, “Microphonearray based speech recognition with different talker-array positions,”Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP), vol. 1, pp. 227–230, April 1997.

[6] B. van Veen and K. Buckley, “Beamforming: A versatile approach tospatial filtering,” IEEE ASSP Magazine, pp. 4–24, April 1988.

[7] L. Griffiths and C. Jim, “An alternative approach to linearly con-strained adaptive beamforming,” IEEE Trans. on Antennas and Prop-agation, vol. 30, no. 1, pp. 27–34, January 1982.

[8] W. Herbordt, Combination of Robust Adaptive Beamforming withAcoustic Echo Cancellation for Acoustic Human/Machine Interfaces,Ph.D. thesis, University of Erlangen Nuremberg, Erlangen, Germany,December 2003.

[9] M. L. Seltzer, Microphone Array Processing for Robust Speech Recogni-tion, Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA,July 2003.

[10] M. L. Seltzer, B. Raj, and R. M. Stern, “Likelhood-maximizing beam-forming for robust hands-free speech recognition,” IEEE Transactionson Speech and Audio Processing, vol. 12, no. 5, pp. 489–498, Sept 2004.

14

[11] S. Neely and J. Allen, “Invertibility of a room impulse response,” Jour-nal of the Acoustical Society of America, vol. 66, no. 1, pp. 165–169,July 1979.

[12] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEETransactions on Acoustics, Speech, and Signal Processing, vol. 36, no.2, pp. 145–152, February 1988.

[13] W. Putnam, D. Rocchesso, and J. Smith, “A numerical investigationof the invertibility of room transfer functions,” Proceeding of the IEEEASSP Workshop on Applications of Singal Processing to Audio andAcoustics, pp. 249–252, October 1995.

[14] P. A. Naylor and N. D. Gaubitch, “Speech dereverberation,” Proc. Int.Workshop on Acoustic Echo and Noise Control (IWAENC), September2005.

[15] B. Yegnanarayana and P. Satyanarayana Murthy, “Enhancement ofreverberant speech using lp residual signal,” IEEE Transactions onSpeech and Audio Processing, vol. 8, no. 3, pp. 267–281, May 2000.

[16] B. Yegnanarayana, S. R. Mathadeva Prasanna, and K Sreenivasa Rao,“Speech enhancement using excitation source information,” Proc. IEEEInt. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol.1, pp. 541–544, May 2002.

[17] S. M. Griebel and M. S. Brandstein, “Microphone array speech dere-verberation using coarse channel modeling,” Proc. IEEE Int. Conf. onAcoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 201–204,May 2001.

[18] B. W. Gillespie and L. E. Atlas, “Strategies for improving audiblequality and speech recognition accuracy of reverberant speech,” Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),vol. 1, pp. 676–679, April 2003.

[19] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “On the use of linearprediction for dereverberation of speech,” Proc. IEEE Int. Workshop onAcoustic Echo and Noise Control (IWAENC), pp. 99–102, September2003.

[20] T. Nakatani and M. Miyoshi, “Blind dereverberation of single channelspeech signal based on harmonic structure,” Proc. IEEE Int. Conf. on

15

Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 92–95,April 2003.

[21] T. Nakatani, B.-H. Juang, K. Kinoshita, and M. Miyoshi, “Harmonicitybased dereverberation with maximum a posteriori estimation,” Proc.IEEE Workshop on Applications of Signal Processing to Audio andAcoustics, pp. 94–97, October 2005.

[22] T. Nakatani, M. Miyoshi, and K. Kinoshita, “Implementation and ef-fects of single channel dereverberation based on the harmonic structureof speech,” Proc. IEEE Int. Workshop on Acoustic Echo and NoiseControl (IWAENC), pp. 91–94, September 2003.

[23] S. Amari, S. C. Douglas, A. Cichocki, and H. H. Yang, “Multichannelblind deconvolution and equalization using the natural gradient,” Proc.First IEEE Signal Processing Workshop on Signal Processing Advancesin Wireless Communications, pp. 101–104, April 1997.

[24] Y. Sato, “A method of self-recovering equalization for multilevelamplitude-modulation,” IEEE Transactions on Communications, vol.6, pp. 679–682, June 1975.

[25] H. Liu, G. Xu, and L. Tong, “A deterministic approach to blind equal-ization,” Proceedings of the 27th Asilomar Conference on Signal, Sys-tems, and Computers, vol. 1, pp. 751–755, April 1993.

[26] Y. A. Huang and J. Benesty, “Adaptive multi-channel least mean squareand Newton algorithms for blind channel identification,” Signal Pro-cessing, vol. 82, pp. 1127–1138, 2002.

[27] H. Buchner, R. Aichner, and W. Kellermann, “Relation between blindsystem identification and convolutive blind source separation,” Conf.Rec. Joint Workshop for Hands-Free Speech Communication and Mi-crophone Arrays (HSCMA), 2005.

[28] T. Hikichi, M. Delcroix, and M. Miyoshi, “Blind dereverberation basedon estimates of signal transmission channels without precise informationof channel order,” Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Processing (ICASSP), vol. 1, pp. 1069–1072, 2005.

[29] H. Buchner, R. Aichner, and W. Kellermann, “Trinicon: A versa-tile framework for multichannel blind signal processing,” Proc. IEEEInt. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), May2004, accepted.

16

[30] D. Giuliani, M. Matassoni, M. Omologo, and P. Svaizer, “Training ofhmm with filtered speech material for hands-free recognition,” Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),vol. 1, pp. 449–452, March 1999.

[31] T. Haderlein, E. Noth, W. Herbordt, W. Kellermann, and H. Niemann,“Using artificially reverberated training data in distant-talking ASR,”Proc. TSD, pp. 226–229, 2005.

[32] L. Couvreur and C. Couvreur, “Blind model selection for automaticspeech recognition in reverberant environments,” Journal of VLSI Sig-nal Processing, vol. 36, no. 2-3, pp. 189–203, March 2001.

[33] C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A study of speaker adaptationof continuous density HMM parameters,” Proc. IEEE Int. Conf. onAcoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 145–148, April 1990.

[34] C. J. Leggetter and P. C. Woodland, “Speaker adaptation of continuousdensity HMMs using multivariate linear regression,” Proc. ICSLP, vol.2, pp. 451–454, September 1994.

[35] M. J. F. Gales and S. J. Young, “Robust continuous speech recognitionusing parallel model combination,” IEEE Transactions on Speech andAudio Processing, vol. 4, no. 5, pp. 352–359, September 1996.

[36] P. J. Moreno, B. Raj, and R. M. Stern, “A vector taylor series ap-proach for environment independent speech recognition,” Proc. IEEEInt. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.733–736, 1996.

[37] C. K. Raut, T. Nishimoto, and S. Sagayama, “Model adaptation bystate splitting of HMM for long reverberation,” Interspeech 2005), pp.277–280, September 2005.

[38] S. Furui, “On the role of spectral transition for speech perception,”Journal of the Acoustical Society of America, vol. 80, no. 4, pp. 1016–1025, 1986.

[39] B. Hanson and T. Applebaum, “Robust speaker-independent wordrecognition using static, dynamic and acceleration features: Experi-ments with lombard and noisy speech,” Proc. IEEE Int. Conf. onAcoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 857–860,April 1990.

17

[40] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEETransactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589,October 1994.

[41] M. Brandstein and D. Ward Eds., Microphone Arrays, Springer, Berlin,Germany, 2001.

[42] Y. Y Wang, A. Acero, and C. Chelba “Is word error rate a goodindicator for spoken language understanding accuracy” Workshop on(ASRU 2003), pp. 577–560, December 2003.

[43] M. Omologo, M. Matassoni, and P. Svaizer, “Environmental conditionsand acoustic transduction in hands-free speech recognition” SpeechCommunication, vol. 25, pp. 75–95, 1998.

[44] M. Matassoni, M. Omologo, D. Giuliani, and P. Svaizer, “HiddenMarkov model training with contaminated speech material for distant-talking speech recognition” Computer Speech and Language, vol. 16,pp. 205–223, 2002.

[45] L.R. Rabiner and M. Sambur, “An algorithm for determining the end-points of isolated utterances” Bell System Technical Journal, vol. 54,n. 2, pp. 297–315, 1975.

[46] L.F. Lamel, L.R. Rabiner, A.E. Rosenberg, and J.G. Wilpon, “Animproved endpoint detector for isolated word recognition” IEEE Trans.on ASSP, vol. 29, pp. 777–785, 1981.

[47] H. Ney, “An optimization algorithm for determining the endpoints ofisolated utterances” Proc.of ICASSP, pp. 720–723, Atlanta 1981.

[48] J.C. Junqua, B. Mak, and B. Reaves, “A robust algorithm for wordboundary detection in the presence of noise” IEEE Trans. on SAP, vol.2, n.3, pp. 406–412, 1994.

[49] M. Omologo and P. Svaizer, “Acoustic event localization using a cross-power spectrum phase based technique”

[50] P. Svaizer, M. Matassoni, M. Omologo, ”Acoustic Source Location ina Three-dimensional Space using Cross-power Spectrum Phase”, Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),Munich, Germany, April 1997

18

[51] R.L. Bouquin-Jeannes and G. Faucon, “Study of a voice activity de-tector and its influence on a noise reduction system” Speech Commu-nication, vol. 16, pp. 245–254, 1995.

[52] J. Ramirez, J.C Segura, C. Benitez, A. de la Torre, and A. Rubio,“An effective subband OSF-based VAD with noise reduction for robustspeech recognition” IEEE Trans. on SAP, vol. 13, n. 6, November 2005.

[53] L. Armani, M. Matassoni, M. Omologo, and P Svaizer “Use of a CSP-based voice activity detector for distant-talking ASR” Proc. of Eu-rospeech, pp. 501–504, Geneva, September 2003.

[54] D. Macho et al., “Automatic speech activity detection, source local-ization and speech recognition on the CHIL seminar corpus” Proc. ofICME, 2005.

[55] A. Temko, D. Macho, and C. Nadeu, ”Selection of features and com-bination of classifiers using a fuzzy approach for acoustic event classifi-cation”, INTERSPEECH-2005, 2989-2992.

[56] K. Laskowski, Q. Jin, and T. Schultz. “Crosscorrelation-based Multi-speaker Speech Activity Detection” Proc. of ICSLP pp. –,Jeju Island,2004.

[57] S.N. Wrigley, G.J. Brown, V. Wan, and S. Renals, “Speech andcrosstalk detection in multichannel audio” IEEE Trans. on SAP, vol.13, n. 1, January 2005.

[58] S.B. Davis and P. Mermelstein, “Comparison of Parametric Represen-tations for Monosyllabic WOrd Recognition in Continuously SpokenSentences” IEEE Trans. on ASSP, vol. 28, n.4, pp. 357–366, 1980.

[59] D. Dimitriadis, P. Maragos, V. Pitsikalis and A. Potamianos, “Modu-lation and Chaotic Acoustic Features for Speech Recognition”, Controland Intelligent Systems, vol.30, no.1, pp.19-26, 2002.

[60] B.S. Atal, “Effectiveness of linear prediction characteristics of thespeech wave for automatic speaker identification and verification”Journ. of the Acoust. Society of America, vol. 55, n. 6, pp. 1304–1312,1974.

[61] O. Viikki and K. Laurila, “Cepstral domain segmental feature vectornormalization for noise robust speech recognition” Speech Communica-tion, vol. 25, pp. 133–147, 1998.

19

[62] S. Dharanipragda and M. Padmanadhan, “A nonlinear unsupervisedadaptation technique for speech recognition” Proc. of ICSLP, pp. 1269-1272, Bejing 2000.

[63] F. Hilger and H. Ney, “Quantile based normalization in the aocusticfeature space” Proc.of Eurospeech, pp. 1135–1138, Aalborg 2001.

[64] J. Pelecanos and S. Sridharan, “Feature warping for robust speakerverification” Proc. of Speaker Odissey conference, June 2001.

[65] H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis ofSpeech” J. Acoust. Soc. Am.’, vol. 87, n. 4, pp. 1738–1752, 1990

[66] L. Lee, R.C. Rose, “Speaker Normalization Using Efficient FrequencyWarping Procedures,” Proc. of ICASSP pp. 353–356, Atlanta, 1996.

[67] D. Macho et al., “Evaluation of a noise-robust DSR front-end on Auroradatabases” Proc. of ICSLP, pp. 17–20, Denver 2002.

[68] X. Cui and A. Alwan, “Noise robust speech recognition using featurecompensation based on polynomial regression of utterance SNR” IEEETrans. on SAP, vol. 13, n. 6, November 2005.

[69] P.J. Moreno, B. Raj, R.M. Stern, “A Vector Taylor Series Approachfor Environment Independent Speech Recognition” Proc. of ICASSP,pp. 733–736, Atlanta, 1996.

[70] L. Deng and A. Acero, “Enhancement of log mel power spectra ofspeech using a phase-sensistive model of the acoiustic environment andsequential estimation of the corrupting noise” IEEE Trans. on SAP,vol. 12, n. 2, March 2004.

[71] M. Fujimoto and S. Nakamura “Particle Filter based Non-StationaryNoise Tracking for Robust Speech Recognition” Proc. of ICASSP, vol.Ipp. 257–260, Philadelphia 2005.

[72] M. Afify amd O. Siohan, “Sequential estimation with optimal forgettingfor robust speech recognition” IEEE Trans. on SAP, vol. 12, n. 1,January 2004.

[73] S. Chakrabartty, Y. Deng, and G. Cauwenberghs “Robust speech fea-ture extraction by growth transformation in reproducing kernel Hilbertspace” Proc. of ICASSP, vol. 1, pp. 133-136, 2004.

20

[74] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automaticspeech recognition with missing and unreliable acoustic data” SpeechCommunication, vol. 24, n. 3, pp. 267–285, 2001.

[75] A. Hagen, A. Morris, and H. Bourlard, “Different weighting schemesin the full combination subbands approach to noise robust ASR” Proc.of ISCA ITRW Workshop on ASR, pp.175–180, 2000.

[76] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi-stream adap-tive evidence combination for noise robust ASR” Speech Communica-tion, Vol. 34, Nos. 1-2, pp. 25-40, 2001.

[77] A.S. Bregman, “Auditory Scene Analysis” Cambridge MA, MIT Press,1990.

[78] H. Hermansky, S. Tibrewala, and M. Pavel, “Towards ASR on partiallycorrupted speech” Proc. of ICSLP, pp. 462–465, Philadelphia, 1996.

[79] L.R. Rabiner and B.H. Juang, “Fundamentals of Speech Recognition”Prentice-Hall, NJ, USA, 1993.

[80] K. Daoudi, D. Fohr, and C. Antoine “Dynamic Bayesian Networksfor Multi-Band Automatic Speech Recognition”, Computer Speech andLanguage, Vol 17, 2003. pp.263-285.

[81] R.P. Lippmann, E.A. Martin, and D.P. Paul, “Multi-style training forrobust isolated word speech recognition” Proc. of ICASSP, pp. 709–712,Dallas 1987.

[82] S. Morii, T. Morii, M. Hashimi, S. Hiraoka, T. Watanabe, andK. Niyada, “Noise robustness in speaker independent speech recog-nition” Proc. of ICSLP, pp. 1145-1148, Kobe 1990.

[83] B.A. Dautrich, L.R. Rabiner, and T.B. Martin, “On the effects ofvarying filter bank parameters on isolated word recognition” IEEETrans. on ASSP, vol. 31, pp. 793-897.

[84] A. Acero, L Deng, T. Kristjansson, J. Zhang “HMM adaptation us-ing vector Taylor series for noisy speech recognition” Proc.of ICSLP,pp.1165–1168, Bejing 2000.

[85] L. Deng, A. Acero, M. Plumpe, and X. Huang “Large-vocabularyspeech recognition under adverse acoustic environments” Proc.of IC-SLP, pp. 1203-1206, Bejing 2000.

21

[86] A.P. Varga and R.K. Moore, “Hidden Markov Models Decompositionof Speech and Noise” Proc. of ICASSP, pp. 845–848, 1990.

[87] A. Surendran, C-H. Lee, and M. Rahim, “Maximum likelihood stochas-tic matching approach to non-linear equalization for robust speechrecognition”, Proc.of ICSLP, Philadelphia 1996.

[88] C. Chesta, O. Siohan, and C.-H. Lee. “Maximum a posteriori linearregression for hidden Markov model adaptation”, Proceedings of Euro-pean Conference on Speech Communication and Technology, volume 1,pages 211-214, Budapest, Hungary, 1999.

[89] Q. Huo, C.H. Lee, “Robust speech recognition based on adaptive clas-sification and decision strategies” Speech Communication, vol. 34, pp.175–194, 2001.

[90] C.H. Lee, “On stochastic feature and model compensation approachesto robust speech recognition” Speech Communication, vol. 25, pp. 29–47, 1998.

[91] S. Sagayama, Y. Yamaguchi, S. Takahashi, and J. Tahahashi, “Jacobianapproach to fast acoustic model adaptation” Proc. of ICASSP, pp. 835–838, Munich 1997.

[92] O Yoshioka, K. Arai, N. Sugamura, and S. Sagayama, “An address dataentry system with a multimodal interface including speech recognition”,Systems and Computers in Japan, 30(9): 64-73 (1999).

[93] P. Kenny, G. Boulianne, and P. Dumouchel, ”Inter-Speaker Correla-tions, Intra-Speaker Correlations and Bayesian Adaptation”, Proceed-ings of the ISCA ITR-Workshop 2001: Adaptation Methods for SpeechRecognition, pp. 21-24. Institut Eurecom, Sophia-Antipolis, France, Au-gust, 29-30, 2001.

[94] O. Siohan, Olivier, T. Myrvoll, and C.H.-Lee, ”Structural maximum aposteriori linear regression for fast HMM adaptation” ASR-2000,120-127.

[95] M. Saraclar and S. Khudanpur “Pronunciation Ambiguity vs Pronunci-ation Variability in Speech Recognition” Proc. of ICASSP, pp. 587–590,Istanbul, 2000.

22

[96] W.J. Byrne et al., “Pronunciation Modelling Using a Hand-LabelledCorpus for Conversational Speech Recognition” Proc. of ICASSP, pp.313–316, Seattle 1998.

[97] Eric Fosler Lussier, “Dynamic Pronunciation Models for AutomaticSpeech Recognition” PhD Thesis, International Computer Science In-stitute, Berkeley, 1999.

[98] R. SIngh, B. Raj, and R.M. Stern, “Automatic Generation of SubwordUnits for Speech Recognition Systems” IEEE Trans. on SAP, vol. 10,n. 2, pp. 89–99, February 2002.

[99] edited by H. Strik, “Modelling pronunciation variation for automaticspeech recognition”, Speech Communication, Volume 29, Issue 2-4, 1999

[100] Y. He and S. Young, “Robustness issues in a data-driven spokenlanguage understanding system” Proc. of Workshop NAACL, 2004.

[101] Y.Y Wang, A. Acero, C. Chelba “Is word error rate a good indicatorfor spoken language understanding accuracy” Proc. of ASRU, pp. 577–580, 2003.

[102] G. Riccardi and A.L. Gorin, “Stochastic Language Models for SpeechRecognition and Understanding” Proc. of ICSLP, pp. 111-114, Sydney1998.

[103] Y. Esteve, C. Raymond, F. Bechet, R. De Mori, “Conceptual Decodingfor Spoken Dialog systems” Proc. of Eurospeech, pp. 617-620, Geneva2003.

[104] W. Ward and S. Issar, “Recent improvements in the CMU spokenlanguage understanding system”, Proc. of ARPA HLY workshop, pp.213–216, Morgan Kaufan Publishers, Inc.

[105] S. Seneff, “Robust parsing for spoken language systems” Proc. ofICASSP, San Francisco 1992.

[106] J. Dowding, R. Moore, F. Andry, and D. Moran, “Interleaving syntaxand semantics in an efficient bottom-up parser” Proc. of 32nd Meetingof the Association for Computational Linguistics, pp. 110–116, NewMexico, 1994.

[107] R. Sarikaya, Y. Gao, M. Picheny, and H. Erdogan, “Semantic Con-fidence Measurement for Spoken Dialogue Systems” IEEE Trans. onSAP, vol. 13, n. 4, July 2005.

23

[108] S. Young “Talking to machines (statistically speaking)” Proc. ofICSLP, pp. 9–16, Denver 2002.

[109] R. De Mori Ed. “Spoken Dialogues with Computers” Academic Press,1998.

[110] H. Hirsch and D. Pearce ”The AURORA experimental framework forthe performance evaluation of speech recognition systems under noisyconditions”, ASR-2000, 181-188.

[111] R. Pieraccini, K. Dayanidhi, J. Bloom, J.G. Dahan, M. Philips,B. Goodman, and K.V. Prasad, “Multi-modal conversational systemsfor automobiles” Communications of the ACM, vol. 47, n. 1, pp.47–49,2004.

[112] G. Potamianos and C. Neti “Audio-visual speech recognition in chal-lenging environments” Proc. of Eurospeech, pp. 1293–1296, Geneva2003.

[113] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-VisualAutomatic Speech Recognition: An Overview”, Issues in Visual andAudio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, andP. Perrier (Eds.), MIT Press, 2004.

[114] A. Potamianos and P. Maragos, “Time-Frequency Distributions forAutomatic Speech Recognition” IEEE Trans. on SAP, vol. 9, pp. 196–200, March 2001.

[115] T. Anastasakos, J. Mc Donough, R. Schwartz, and J. Makhoul “Acompact model for speaker adaptive training” Proc. of ICSLP, pp.1137–1140, Philadelphia 1996.

[116] P.C. Woodland “Speaker adaptation: techniques and challenges”Proc. of ASRU workshop, pp. – , Keystone 1999.

[117] M.J.F. Gales “Maximum Likelihood Linear Transformation for HMM-based Speech Recognition” ‘Computer Speech and Language, vol. 12,Jan. 1998.

[118] T.T. Kristjansson and B.J. Frey “Accounting for uncertainty in obser-vations:a new paradigm for robust automatic speech recognition” Proc.of ICASSP, pp. –, Orlando, 2002.

24

[119] M.J.F. Gales “Ckuster adaptive training for speech recognition” Proc.of ICSLP, pp. 1783–1786, Sidney 1998.

[120] H. Liao and M.J.F. Gales “Uncertainty decoding for noise robustspeech recognition” Tech. Report CUED/FINFENG/TR499, Univ. ofCambridge, 2004.

[121] Z. Zhang et al., “Multi-Sensory Microphones for Robust Speech De-tection, Enhancement and Recognition” Proc. of ICASSP, Montreal,2004.

25

robust speech recognition - fbk

Documents