modeling spontaneous speech variability in professional dictation · modeling spontaneous speech...

Speech Communication 48 (2006) 493–515

www.elsevier.com/locate/specom

Modeling spontaneous speech variability inprofessional dictation

Hauke Schramm a,*, Xavier Aubert a, Bart Bakker a,Carsten Meyer a, Hermann Ney b

a Philips Research Laboratories, Weisshausstrasse 2, D-52066 Aachen, Germanyb Lehrstuhl fur Informatik VI, Computer Science Department, RWTH Aachen, University of Technology, D-52056 Aachen, Germany

Received 8 October 2004; received in revised form 23 August 2005; accepted 25 August 2005

Abstract

In this work, we present a model combination approach at the word level that aims to improve the modeling ofspontaneous speech variabilities on a highly spontaneous, real life medical transcription task. The technique (1) separatesspeech variabilities into pre-defined classes, (2) generates speech variability specific acoustic and pronunciation models and(3) properly combines these models later in the search procedure on a word level basis. For efficient integration of the spe-cific acoustic and pronunciation models into the search procedure, a theoretical framework is provided. Our algorithm is ageneral approach that can be applied to model various speech variabilities. In our experiments, we focused on the variabil-ities related to filled pauses, rate of speech and speaker accent. Our best system combines six variability specific acousticand pronunciation models on a word level and achieves a word error rate reduction of 13% relative compared to the base-line. In a number of contrast experiments we evaluated the importance of different components in our system and exploredways to reduce the system complexity.� 2005 Elsevier B.V. All rights reserved.

Keywords: Automatic speech recognition; Spontaneous speech modeling; Pronunciation modeling; Rate of speech modeling; Filled pausemodeling; Model combination

1. Introduction

Spontaneous speech phenomena are an impor-tant issue for almost every real life application ofautomatic speech recognition. It is well known thatspontaneous speech, as opposed to read or planned

0167-6393/$ - see front matter � 2005 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2005.08.003

* Corresponding author. Tel.: +49 241 6003 500; fax: +49 2416003 518.

E-mail address: [email protected] (H. Schramm).

speech, is highly dynamic with respect to the speak-ing style (e.g. pronunciation variation, rate ofspeech) and contains various kinds of disfluenciesand non speech sounds. In early speech recognitionapplications like dialog systems (e.g. Aust et al.,1995) spontaneous speech effects have typically notbeen handled explicitly but had to be capturedsomehow by the statistical models. Recent research,however, has focused more and more on explicitmodeling of spontaneous speech variability onthe acoustic, lexical and language model level.

.

mailto:[email protected]

494 H. Schramm et al. / Speech Communication 48 (2006) 493–515

Typically, the modeling of different speech variabil-ities has been addressed by specific, individually de-signed techniques (which are discussed in moredetail in Section 2). A successful combination ofthese techniques is, however, in general not trivialand may complicate the system architecture sub-stantially. As a consequence, the number of expli-citly addressed speech variabilities in a singlesystem has often been low. Thus, a technique isneeded that allows for both individual modeling ofvarious speech variabilities and efficient combina-tion of these models into a single (one-pass) deco-der. The technique we propose is a step in thisdirection. It applies speech variability dependentacoustic and pronunciation modeling and combinesthe specific models using a lexicon based word levelmodel combination technique. This technique is ingeneral applicable to arbitrary variabilities. In a firststep, however, we focused on filled pauses, rate ofspeech and speaker accent related variability.

Many publications have studied spontaneousspeech effects in dialog situations like human–machine or human–human interaction (e.g. Green-berg et al., 1996; Byrne et al., 1998; Rigoll, 2003),but much less has been published about spontaneous

monologues like presentations or dictations (e.g.Furui, 2003). This work is concerned with spontane-ous medical reports, coming from an interactionfree dictation scenario, in which the speakers expectto be transcribed by human transcribers. Thisimportant application area for automatic speechrecognition has hardly been addressed in recentyears. Compared to conversational speech, thespontaneous dictation task exhibits a more profes-sional (dictation) speaking style, larger rate ofspeech fluctuations and different disfluency charac-teristics (Peters, 2003; Schramm et al., 2003). An-other important difference is the existence ofrepetitive passages in the dictations, which areknown to both parties (speaker and transcriptionist)beforehand. In these parts the speaking style is oftensloppy with an extremely high rate of speech.

The remainder of this article is organized as fol-lows. In Section 2 we present an overview of relatedwork. Section 3 describes the generation of specificacoustic and pronunciation models for a numberof speech variabilities and their incorporation intothe search procedure. Sections 4 and 5 describe thedatabase and the baseline system, respectively. Theexperimental setup and results are provided inSection 6 and a final summary and conclusions aregiven in Section 7.

2. Related work

It is well known that a standard Hidden MarkovModel framework with mixture densities, standardstate tying (Bellegarda and Nahamoo, 1990; Youngand Woodland, 1994) and context dependent phonemodels (Lee and Hon, 1989; Young and Woodland,1994) may ‘‘implicitly’’ capture speech variability tosome extent (Adda-Decker and Lamel, 1998; Juraf-sky et al., 2001; Hain, 2002). This includes some levelof contextual pronunciation effects, (sub) phonereplacements and duration variability and has beenshown to allow for recognition accuracies of morethan 90% for large vocabulary domains with re-stricted variability, like read speech (e.g. Schwartzet al., 2004). In spontaneous speech, however, the var-iation with respect to pronunciation and duration ismuch stronger and therefore typically overstrainsthe compensation capabilities of a standard frame-work (e.g. Greenberg et al., 1996; McAllaster et al.,1998; Byrne et al., 1998). This is even emphasized incase of additional speaker accents or by disfluencyphenomena which are especially prevalent in sponta-neous speech. Therefore, in recent years, researchershave proposed various techniques for an explicit

modeling of spontaneous speech variabilities. Thissection provides a review of the literature for majorresearch topics in this field: pronunciation modeling(Section 2.1), rate of speech modeling (Section 2.2),disfluency modeling (Section 2.3) and speaker accentmodeling (Section 2.4). Notably, most publicationsfocus only on a single speech variability, althoughnatural speech contains a blend of various types.The question of how to efficiently combine explicitmodeling approaches for different variabilities hasrarely been investigated. Section 2.5 discusses the hid-den mode techniques which are related to this aspect.

2.1. Pronunciation modeling

The tremendous variability of pronunciationsin spontaneous speech has been demonstrated ina number of studies (e.g. Greenberg et al., 1996).It has been argued that the large mismatchbetween the observed pronunciation variability andthe mostly canonical phonetic transcriptions in thepronunciation lexicon is a major reason for theweak performance of spontaneous speech recogni-tion systems (e.g. Byrne et al., 1998; McAllasteret al., 1998). Consequently, this field has beenstudied intensively in recent years (see Strik andCucchiarini, 1999, for an overview).

H. Schramm et al. / Speech Communication 48 (2006) 493–515 495

Some researchers tried to treat pronunciationvariability at the sub-lexical level by introducingsophisticated state tying mechanisms (Saraclaret al., 2000; Yu and Schultz, 2003). However, whilethese techniques improve the treatment of substitu-tion effects they are not well suited to handle otherimportant pronunciation phenomena like pho-neme deletions and insertions (Fosler-Lussier andMorgan, 1998; Jurafsky et al., 2001). Thus, lexi-con-based or ‘‘explicit’’ pronunciation modelingtechniques (e.g. Riley, 1991; Aubert and Dugast,1995; Lamel and Adda, 1996; Finke and Waibel,1997, among many others) are more popular inthe speech recognition community.

Explicit pronunciation modeling, as applied inthe Philips spontaneous speech system, deals withthe generation of alternative pronunciations andtheir efficient incorporation into the search proce-dure (Schramm and Aubert, 2000). Several tech-niques for generating alternative pronunciationsfrom canonic baseforms have been proposed inrecent years. These techniques may be classified intoknowledge based methods (e.g. Cohen, 1989; Finkeand Waibel, 1997; Kessens et al., 1999) whichexploit prior phonetic or linguistic knowledge anddata driven methods (e.g. Bahl et al., 1991; Slobodaand Waibel, 1996; Fosler-Lussier, 1999; Amdal,2002), where pronunciation variation is learnedautomatically from the database.

A popular data driven technique for learningpronunciation variability is a free or slightly con-strained (e.g. by using a bigram phone grammar)phone recognition. However, phonetic transcrip-tions obtained with this technique are typically quitenoisy. One way to overcome this problem wasproposed by Fosler-Lussier (1999) and Riley et al.(1999), who derived decision tree based rules forgenerating smoothed alternative pronunciationsfrom baseforms.

Another way of automatically learning robustphonetic transcriptions from data is to utilize a stan-dard Viterbi alignment with a free choice betweenalternative pronunciations. Increasing the numberof pronunciation alternatives, for example by mak-ing each phoneme optional, allows to successivelyrelax the alignment constraint in order to learnadditional variability. This approach was appliedby Kessens (2002) for a medium-size vocabularytask to derive phoneme deletion rules which wereused to generate additional pronunciations. A simi-lar technique is used in this work for the modelingof fast speech. However, instead of using rules,

our approach directly employs the full set of gener-ated variants.

Incorporating a large number of alternative pro-nunciations into the lexicon requires a sophisticatedstrategy for efficiently handling these variants dur-ing the search process. This includes search aspectsas well as the weighting scheme for pronunciationalternatives and primarily aims at controlling theincreased lexical confusability of words in conse-quence of the large pronunciation set. An earlywork in this field employed weighted pronunciationnetworks to incorporate pronunciation variabilityinto the search (Riley, 1991). Another more popularpronunciation treatment strategy utilizes scaledpronunciation unigram prior probabilities (Peskinet al., 1997) in combination with a maximumapproximation over the pronunciation sequences(e.g. Fukada et al., 1998). In (Schramm and Aubert,2000) a search technique is proposed that approxi-mates the sum over the search paths of linguisticallyequivalent alternative pronunciation sequences dur-ing a one- pass search procedure. This extends otherrelated multi-pass summation approaches (e.g.Wessel et al., 1998; Schlueter et al., 2001) towardsan explicit incorporation of alternative pronun-ciations and enables single-pass processing. Thepronunciation summation is also applied in (Ever-mann and Woodland, 2000) where word posteriorsare estimated which represent the sum over all pro-nunciation variants instead of the most likely alter-native only.

In this article the one-pass pronunciation sum-mation technique, proposed in (Schramm andAubert, 2000), will be extended to class specific pro-nunciation alternatives.

2.2. Rate of speech modeling

Another speech variability that has often beenanalyzed in the literature is rate of speech. Sincespontaneous speech, and especially the medical dic-tation task addressed in this work, is characterizedby a very dynamic rate of speech, a careful model-ing of this variation type appears to be essentialfor making progress in this field (e.g. Pallettet al., 1994; Pfau and Ruske, 1998; Nanjo et al.,2001).

A number of effects of this variability on theacoustic–phonetic properties of speech have beenreported (see Wrede, 2002, for an overview). Rateof speech variation is not only reflected in the num-ber and length of pauses and the number of (e.g.


phone) units per second but also in the accuracy ofarticulation. Thus, rate of speech dynamics can in-duce significant changes in pronunciations and theirprobabilities (Fosler-Lussier and Morgan, 1998)and should therefore be explicitly addressed inacoustic and pronunciation modeling. Various at-tempts to improve rate of speech modeling havebeen published in recent years which may be dividedinto explicit and implicit modeling approaches.Explicit strategies use rate of speech measures (seeWrede, 2002, for an overview) to classify a completeutterance with respect to its rate of speech (e.g.‘‘slow’’, ‘‘medium’’, ‘‘fast’’) prior to the search.After this step, rate specific acoustic models (e.g.Mirghafori et al., 1995; Martinez et al., 1998; Pfauand Ruske, 1998, among others), or rate specific

acoustic features (e.g. Richardson et al., 1999) areused to compensate for rate of speech variability.Rate specific acoustic models have been studied byseveral authors and have been shown to improvethe recognition performance (Mirghafori et al.,1995; Pfau and Ruske, 1998; Martinez et al.,1998). It was, however, also found that these mod-eling approaches require a reliable rate classifier tosuccessfully combine the different models into a sin-gle system. Further drawbacks of explicit rate ofspeech modeling techniques are that (1) an addi-tional rate classification step is necessary prior tothe search and that (2) these techniques are not ableto model local (e.g. word level) rate of speech fluctu-ations. The latter aspect is particularly importantfor long utterances.

The implicit rate of speech modeling strategy,proposed in (Zheng et al., 2000), overcomes theseproblems by using rate specific alternative pronunci-ations which are linked to rate specific acousticmodels via rate specific phone inventories. Thisapplication of the so-called hidden mode concept(cf. Section 2.5) allows the decoder to track local(i.e. word level) speaking rate fluctuations. A previ-ous rate estimation step is not necessary since thedecoder simply decides for the most appropriatepronunciation alternative (and acoustic model)based on the overall likelihood.

In our work, the implicit rate modeling techniqueis incorporated into a more general framework witha higher number of unknown variables. This in-cludes a search framework which applies a probabi-listic modeling for sequences of rate specific classlabels and a one-pass sum approximation techniquefor rate specific pronunciation alternatives.

2.3. Filled pause modeling

Disfluencies such as filled pauses (e.g ‘‘uh’’,‘‘uhm’’), repeated words and self repairs are anotherimportant variability in spontaneous speech (cf.Shriberg, 1994). The most frequent disfluency typeis the filled pause. In the spontaneous dictationdatabase, used in this work, this event has a higherunigram count than any ‘‘regular’’ word. Therefore,filled pause events can be expected to be an impor-tant source of confusion for automatic speechrecognition if not handled appropriately.

In (Gauvain et al., 1997) separate acousticmodels for filled pauses and ‘‘regular’’ speech wereapplied for the Broadcast News task. In (Liuet al., 1998), it was proposed to use three particu-larly long lexical pronunciations for filled pause inorder to reduce the lexical confusability with similarwords. The impact of modeling filled pause events inthe acoustic and language model was also investi-gated in (Rose and Riccardi, 1999). Here it wasshown for telephone based natural language under-standing tasks that explicit filled pause modeling isfavourable especially for real-time conditions.

Our approach incorporates filled pause treat-ment into the more general speech variabilitydependent modeling framework. This includes asystematic data driven acoustic and pronunciationmodeling for filled pause duration and pronuncia-tion variability.

2.4. Speaker accent modeling

Like many other real life applications, the Philipsmedical dictation task contains native speech as wellas a number of foreign accents. The existence of for-eign accent speech further increases the variabilityof the spontaneous speech data and may degradethe recognition performance substantially if nothandled appropriately. Various publications haveaddressed the problem of building appropriateacoustic and pronunciation models for non nativespeakers. However, less has been said about howto efficiently combine these models with nativespeech models in a one-pass search procedure.

A straightforward way of dealing with foreign ac-cents is to simply train the acoustic model on thistype of speech data. In (Wang et al., 2003) it wasshown for German-accented English speech thatthis outperforms a regular native model (trainedon 34 h of speech) even if only few non native train-


ing data (52 min) is available. Alternatively, a nativespeech acoustic model can be adapted to non nativespeakers by applying some additional forward–backward iterations with accented speech (MayfieldTomokiyo and Waibel, 2001) or by using standardacoustic speaker adaptation techniques (e.g. Huanget al., 2000; Mayfield Tomokiyo and Waibel, 2001).

Another approach applies the hidden mode con-cept (cf. Sections 2.2 and 2.5) to the problem ofspeaker accent modeling (Sproat et al., 2004). Here,it was tested to use two types of accent specific mod-els in parallel during the decoding for a Mandarinspeech recognition task. However, it turned out thatapplying the more appropriate allophone set consis-tently throughout the sentence led to better results.Accent dependent allophone sets were also used byHumphries et al. (1996), He and Zhao (2003) andLee et al. (2003). In the approach of Lee et al.(2003) the prior probabilities of accent specific pro-nunciation variants with corresponding allophonesets are adapted to a specific speaker profile. Accentspecific pronunciations were also used by variousother research groups (e.g. Humphries et al., 1997;Huang et al., 2000; Goronzy et al., 2001; Livescuand Glass, 2003). The techniques used for generat-ing these pronunciations are usually similar to thestandard approaches (cf. Section 2.1).

In our work, a very basic accent modelingapproach, using two specific acoustic models fornative and non native speech, will be incorporatedinto the more general speech variability modelingframework.

2.5. Hidden context variable dependent modeling

The major goal of our work is to develop andinvestigate a unified framework for efficient han-dling of multiple sources of variability. This shallbe achieved by using extensions of the methods, de-scribed in Sections 2.1–2.4, to generate a number ofspeech variability specific acoustic and pronuncia-tion models. These models, which depend on a setof hidden context variables, shall be incorporated‘‘appropriately’’ into a one-pass search procedure.

The first applications with hidden context vari-able dependent acoustic modeling were genderdependent systems (e.g. Hwang and Huang, 1993).Here, the unknown gender variable has to be de-coded along with the word sequence during thesearch. While some systems constrained this vari-able to be constant throughout the sentence, others

allowed arbitrary changes from word to word orphone to phone. This approach has been general-ized to an arbitrary set of unknown context vari-ables in (Ostendorf et al., 1996). This technique,known as hidden speaking mode modeling, aimsat treating pronunciation changes in dependenceof hidden context variables. In (Bates and Osten-dorf, 2002; Ostendorf et al., 2003) this approach isapplied to prosody dependent modeling usingprosodic variables to characterize pronunciationchanges in a dynamic pronunciation model. In(Chen and Hasegawa-Johnson, 2004) a prosodydependent speech recognizer is described thatapplies a prosody dependent language, pronuncia-tion and acoustic model. Here, every word in thedictionary has a set of prosody dependent pronunci-ations with prosody dependent allophones. The useof a prosody label bigram language model is pro-posed in (Chen and Hasegawa-Johnson, 2003).The hidden context variable dependent modelingapproach has also been used for rate of speechmodeling and speaker accent modeling. This is dis-cussed in Sections 2.2 and 2.4, respectively.

Our work extends existing hidden mode ap-proaches by using a novel summation techniquefor combining the mode dependent acoustic andpronunciation models into the one-pass search pro-cedure. In addition to that, the number of hiddenvariables in our system is, to the best of our knowl-edge, larger than in other previously publishedsystems.

3. Modeling spontaneous speech variabilities

The concept of speech variability class dependentacoustic modeling aims at separating acousticobservations and their trajectories at word levelwith respect to specific speech variability classes.However, this technique is not well suited formodeling pronunciation variation effects as well.Especially phoneme deletions and insertions cannotsufficiently be represented using the topology of astandard left to right HMM. Therefore, our frame-work for modeling spontaneous speech variabilitiescombines class dependent acoustic modeling withclass dependent pronunciation modeling to alsoaccount for speech variability dependent pronuncia-tion effects (Section 3.2). In the next sections, we de-scribe the implementation of this approach withrespect to data preparation, generation of alterna-tive pronunciations, training and search.


3.1. Speech variability dependent acoustic

modeling

3.1.1. Data preparationA quantitative analysis of the data (database sta-

tistics are given in Table 2) has shown that the mostimportant issues in modeling spontaneous speechphenomena on our database are rate of speech,filled pause and speaker accent related variability(Schramm et al., 2003). Therefore, we focused ourwork on these variabilities.

Based on a data analysis and on the availabilityof training data, we introduced up to three classesfor rate of speech modeling (‘‘slow’’, ‘‘medium’’,‘‘fast’’) two classes for speaker accent (‘‘native’’,‘‘non native’’) and two classes to distinguish be-tween filled pauses and regular (i.e. non filled pause)speech (‘‘filled pause’’, ‘‘regular speech’’). Table 1gives an overview of the considered speech variabil-ity types and the associated class labels. For thesake of completeness, a baseline class, covering all

Table 1Considered speech variability types and associated classes

Speech variability type Associated class labels

All Baseline (BL)Rate of speech Slow (S), medium (M), fast (F)Filled pause Filled pause (FP), regular speech (RS)Speaker accent Native (N), non native (NN)

Fig. 1. Histogram of rate of speech values (see Section 3.1.1 for detailstraining corpus.

speech variabilities, has been added as well. In orderto train the respective models, the words in thetraining corpus have to be labeled with respect tothe considered speech variability classes. In ourcase, information about speaker accent and filledpauses is contained in the data annotation whileword level rate of speech information has to bedetermined automatically. We applied a word levelrate of speech measure, which has been proposedby Zheng et al. (2000). It always has values in[0, 1], whereas a value close to 0 (1) indicates fast(slow) speech, respectively. Using this measure, eachword in the training corpus was assigned an indivi-dual rate of speech value. The histogram of allobserved values is presented in Fig. 1. It demon-strates that a typical rate of speech does not existin this data and therefore indicates demand for abetter representation of the large rate of speech var-iance in the models.

After assigning a rate of speech value to eachword in the training corpus, we determined the classboundaries by computing the rate of speech histo-gram and dividing it into two or three parts withan equal number of entries. Finally, each wordwas labeled with respect to its class affiliation. Incase of considering multiple speech variabilities(e.g. rate of speech and accent), each word in thetraining corpus has to be assigned a list of class affil-iations. The usage of this class label information intraining is discussed in Section 3.1.2.

on the rate of speech measure) on word level, as observed in the

Fig. 2. Strategies for using class label information in training.

THE dh uhTHE[class1] dh_1 uh_1


3.1.2. Strategies for training data usage

There are different ways to use the variabilityclass label information in the training procedure.This is illustrated in Fig. 2 for a simple example withtwo rate of speech classes (‘‘slow’’ and ‘‘fast’’) andtwo speaker accent classes (‘‘native’’ and ‘‘nonnative’’). In the first ‘‘label combination’’ approachthe different acoustic models are trained on specific,non overlapping parts of the training data, whichrepresent the possible combinations of the consid-ered variability labels (see 1. in Fig. 2). This sepa-rates speech variabilities well but may introducesparse data problems, especially for infrequent labelcombinations or when multiple variability types areaddressed. With less specific variability classes, thesparse data problem can be avoided. Therefore, amore convenient strategy consists in treating thevariability types individually, that is, independentof the other types. This ‘‘single label’’ approach(see 2. in Fig. 2) has been used in our experiments.Here, each variability (e.g. rate of speech) specificmodel set is trained on the whole training data setaccording to the corresponding labels (e.g. slow,medium, fast). This is equivalent to using a copyof the training data for each variability type as illus-trated in Fig. 2. Note that the ‘‘single label’’ trainingstrategy is much more expensive in terms of compu-tation time, since the training data has to beprocessed several times.

3.1.3. Training procedure

In this section we address the technical realiza-tion of the speech variability dependent acoustic

model training. Basically, the technique exploitsthe system�s capability for handling alternative pro-nunciations. In the traditional sense, the alternativepronunciations of a word are a set of phonemesequences, specified by the lexicon, which representthe word�s possible realizations. This concept can beextended by introducing speech variability class spe-

cific pronunciations. Different from an alternativepronunciation in the traditional sense, a class spe-cific alternative pronunciation exhibits an (initially)unchanged phoneme sequence but addresses a dif-ferent class specific acoustic model by using respec-tive class specific phoneme symbols.

In the following example two class specific alter-native pronunciations of ‘‘THE’’ are shown:

The standard phoneme symbols in this example(‘‘dh’’ and ‘‘uh’’) address the standard acousticmodel while the new phonemes (‘‘dh_1’’ and‘‘uh_1’’) refer to the acoustic model of class 1.

Having a training corpus with speech variabilityclass labels on word level, this concept can be usedto train variability specific acoustic models. Thetraining procedure, illustrated in Fig. 3, includesthe following steps:

1. Introduce a specific set of phoneme symbols foreach speech variability class.

2. For each baseform and alternative pronuncia-tion in the lexicon, generate new class specific

Fig. 3. Introduction of class specific acoustic models.


pronunciations by replacing the standard pho-nemes with the new symbols.

3. Map each training word with its pronunciationvariant label and its class label to the correspond-ing phoneme sequence using class specific pho-neme symbols.

4. Perform the standard training procedure.

Note that the different class specific acousticmodels are completely independent from each otherif no cross class acoustic model parameter tying isapplied. This allows to use individual model struc-tures (e.g. HMM topologies or time distortionpenalties) for the different acoustic models.

3.2. Speech variability dependent pronunciation

modeling

Due to the very different character of speech vari-abilities (e.g. rate of speech versus speaker accent) itcan be expected that the corresponding pronuncia-tion phenomena differ as well. Examples for thisare phoneme deletion effects in fast speech (Bern-stein et al. (1986)) or systematic phoneme substitu-tions in non native speech (Goronzy et al. (2001)).This calls for the introduction of class specific pro-nunciation models as a consequent next step follow-ing the class specific acoustic models. Very specificsets of pronunciations can be used for the differentclasses which allows to adjust the pronunciation

modeling individually to the different variabilitytypes. Note that some pronunciation phenomenamay also be treated by an appropriate acousticmodel. An example for this are phonemes with aparticularly long duration, like in filled pauses(‘‘ah ah m’’). This effect may be treated (1) in thepronunciation model by duplicating the respec-tive phoneme in the lexicon or (2) in the acousticmodel by using adequate time distortion penaltiesor longer HMM topologies. In this work, the for-mer approach was tested since it can be appliedwithout substantial changes in the acoustic modelstructure or system architecture.

Below, we explain the generation of specific alter-native pronunciations for the filled pause class andthe rate of speech classes (Section 3.2.1 and 3.2.2,respectively). Furthermore, we discuss the combina-tion of speech variability specific acoustic and pro-nunciation models and their incorporation into thetraining procedure (Sections 3.2.3 and 3.2.4 respec-tively). Section 3.2.4 also discusses how to estimateunigram prior probabilities for the speech variabil-ity dependent pronunciations.

3.2.1. Filled pause pronunciations

A careful pronunciation modeling of filled pauseseems justified, as this event has the highest uni-gram count in our professional dictation task. Weobserved a substantial variability of filled pauseoccurrences with respect to both phonemic realiza-


tion and duration. Therefore, we manually designed64 alternative regular and explicitly lengthened pro-nunciations for filled pause as illustrated in thisexample:

FP ahFP/1 ah ah. . .FP/6 ah mFP/7 ah ah m. . .

THE dh uh % baseformpronunciation

THE/1 dh ee % alternativepronunciation

THE[class1] dh_1 uh_1 % fast speechbaseform

THE[class1]/1 dh_1 ee_1 % fast speechalternativepronunciation

THE[class1]/2 dh_1 % additional fastspeech alternativepronunciation

THE[class1]/3 uh_1 % additional fastspeech alternativepronunciation

As shown above, longer pronunciations havebeen generated by simply duplicating phonemes.This technique enforces a longer minimum lengthrealization without changing the basic HMMstructure.

3.2.2. Rate of speech specific alternative

pronunciationsA significant part of our professional dictation

data consists of very fast speech. As fast speakerstend to phonetically reduce pronunciations (Bern-stein et al., 1986), phoneme deletion effects are animportant issue for this task. A simple way to ac-count for deletion effects is to model them lexicallyby introducing respective pronunciations. We ap-plied the following data driven technique to gener-ate pronunciations with systematic phonemedeletions from the baseline lexicon (cf. Section 5):

1. Identify the most frequent pronunciation (‘‘base-form’’) of each word by using Viterbi alignedtraining data.

2. Generate new, shorter pronunciations from thebaseform pronunciation by systematically leav-ing out its phonemes, one at a time.

For example, the following pronunciations havebeen generated from the transcription of ‘‘adhesive’’(baseform: ‘‘aa d h ee z ih v’’): d h ee z ih v, aa h ee z

ih v, . . . ,aa d h ee z ih.The restriction to single phoneme deletions of

course limits the potential of the proposed tech-nique since it does not allow for a treatment oflonger pronunciation reductions. Technically, thegeneration of pronunciations with multiple pho-neme deletions could be easily achieved by simplyiterating the described procedure several times.This has, however, not been investigated in orderto keep the pronunciation set (and search space)manageable.

3.2.3. Lexical combination of speech variability

specific acoustic and pronunciation models

The incorporation of variability specific acousticand pronunciation models into a single system issimple, since both techniques are realized by anextension of the set of alternative pronunciations.Therefore, if consistent phoneme sets are used, thetechniques can be combined by simply joining therespective pronunciation sets into one lexicon. Thisis illustrated in the following example where stan-dard pronunciation models are combined withspeech variability specific (standard) pronunciationsand additional class specific variants

In the remainder of this article, we simply refer tothe combined set of alternative pronunciations as‘‘variability specific pronunciations’’.

3.2.4. Incorporating variability specific

pronunciations into the training

In this section, we discuss how to perform atraining with variability specific pronunciations. Afirst step in a standard training procedure is to pho-netically transcribe the training text according to thelexicon. This is usually done with a Viterbi align-ment that specifies which pronunciation from agiven set is realized at a given word position. Whenapplying variability specific pronunciations, how-ever, the set of selectable pronunciations at eachword position does no longer depend only on thecurrent word identity, but also on the current speechvariability class. This is illustrated in Fig. 4. Boththe current pronunciation realization vi and thenumber of possible variants m depend on the cur-rent word and variability class identity wn and cn.Here, n denotes the current word position in the

Fig. 4. Class dependent Viterbi alignment.


sentence. After generating class specific pronuncia-tion labels for the training data, the transcriptioncan be used to perform a standard training pro-cedure. In addition to that, the script with thepronunciation and class label information is usedto estimate prior probabilities for the combina-tion of a pronunciation vi and a variability classcj, given the word w : p(vi,cjjw). This is achievedby applying the maximum likelihood techniquewhich sets the probabilities to the relative frequen-cies of the pairs (vi,cj) in the labeled trainingdata. Note that the priors are normalized, i.e.P

v;cpðv; cjwÞ ¼ 1; 8w.

3.3. Search

In the following, we discuss the question of howto efficiently incorporate various alternative, vari-ability specific pronunciations into a one-passsearch procedure. In order to control the lexicalconfusability, introduced by the potentially vastnumber of pronunciations, this integration shouldincorporate some prior knowledge about sequencesof pronunciations and variability class labels. Thismay limit the number of unlikely word level changesbetween different models (e.g. from native to nonnative) and thus may help to focus and stabilizethe beam search.

Let us now discuss the integration of class spe-cific alternative pronunciations into Bayes� decisionrule. Let wN

1 ¼ w1;w2; . . . ;wN be a hypothesizedword sequence and xT

1 ¼ x1; x2; . . . ; xT denote the in-put feature vectors. We further use vN

1 ¼ v1; v2; . . . ;vN to denote one possible sequence of pronuncia-tions related to the words wN

1 , both sequences beinglinguistically equivalent. Each pronunciation vn invN

1 is assigned a speech variability class label cn,taken from the set of training labels A. For example,one set of variability class labels used in our exper-iments is given by

A ¼ fFilledPause; Slow; Medium; Fast; Native;

NonNativeg

The sequence of labels is denoted by cN1 ¼ c1; c2; . . . ;

cN . Starting out from Bayes� decision rule andsumming over all cN

1 and vN1 we obtain:

bwN1 ¼ arg max

wN1

Pr wN1 jxT

1

� �

¼ arg maxwN

1

XcN

1

XvN

1

Pr wN1 ; v

N1 ; c

N1 jxT

1

� �

¼ arg maxwN

1

XcN

1

XvN

1

Pr wN1 ; v

N1 ; c

N1

� ��

� Pr xT1 jwN

1 ; vN1 ; c

N1

� ��¼ arg max

wN1

XcN

1

XvN

1

Pr wN1

� �� Pr vN

1 jwN1 ; c

N1

� ��

� Pr cN1 jwN

1

� �� Pr xT

1 jwN1 ; v

N1 ; c

N1

� ��Using the following model assumptions

1. Trigram language model

PrðwN1 Þ ¼

YNn¼1

pðwnjwn�2;wn�1Þ

2. Unigram pronunciation model

PrðvN1 jwN

1 ; cN1 Þ ¼

YNn¼1

pðvnjwn; cnÞ

3. Class label sequence model

PrðcN1 jwN

1 Þ ¼YNn¼1

pðcnjcn�11 Þ

4. Acoustic model

PrðxT1 jwN

1 ; vN1 ; c

N1 Þ ¼ max

tN1

YNn¼1

p xtnðvN

1;cN

1Þ

tn�1ðvN1;cN

1Þþ1

��vn; cn

� �

leads to

bwN1 ¼ arg max

wN1

XcN

1

XvN

1

maxtN1

YNn¼1

npðwnjwn�2;wn�1Þ

� pðvnjwn; cnÞ � pðcnjcn�11 Þ � p x

tnðvN1;cN

1Þ

tn�1ðvN1;cN

1Þþ1

� ��vn; cn

�o

ð1Þ

Here, the Viterbi approximation is carried out onstate sequences and leads to a set of ‘‘optimal’’ wordboundary points tN

1 (with t0 � 0). Furthermore, ithas been made explicit that, in general, the start


and end point of each word, tn�1 + 1 and tn, de-pends on the whole sequences vN

1 and cN1 .

Given our model assumptions, the integration ofmultiple variability specific acoustic and pronuncia-tion models into Bayes� decision rule introduces anadditional knowledge source (apart from the wellknown pronunciation model): a sequence modelfor variability class label sequences pðcnjcn�1

1 ). Thisterm can be used to incorporate prior knowledgeabout valid sequences of variability class labels inorder to narrow the search space. By restrictingthe variability class label history to zero, one ortwo labels (uni-, bi-, or trigram), a trainable modelstructure can be achieved. In our experiments how-ever, we neglected the dependence on cn�1

1 com-pletely and worked with the joint probabilityp(vn,cnjwn) instead of pðcnjcn�1

1 Þpðvnjwn; cnÞ. Thisprior can be robustly estimated on the trainingcorpus using the standard maximum likelihoodtechnique (Section 3.2.4).

The practical implementation of Eq. (1) in a stan-dard left to right search framework is discussed inthe remainder of this section. In recent years, twoapproaches have been shown to be viable for incor-porating the pronunciation model into the searchprocedure: the Viterbi approximation (e.g. Fukadaet al., 1998) and the sum approximation (Schrammand Aubert, 2000). These approaches are now ex-tended towards the incorporation of the class labelsequence model.

3.3.1. Maximum approximation

A way of realizing Eq. (1) in the left to rightsearch framework is to apply the Viterbi approxi-mation on pronunciation and class label sequencesvN

1 and cN1 as well instead of summing up. Analogous

to the language model factor a, a scaling factor b forthe joint probability p(vn,cnjwn) is introduced as aheuristic to control the influence of the new termwith respect to both acoustic and languagemodeling:

bwN1 ¼ arg max

wN1

maxcN

1

maxvN

1

maxtN1

YNn¼1

npðwnjwn�2;wn�1Þa

� pðvn; cnjw1Þb � p xtnðvN

1;cN

1Þ

tn�1ðvN1;cN

1Þþ1

� ��vn; cn

�oð2Þ

1 In this example, the class labels are ignored for the sake ofsimplicity.

3.3.2. Sum approximationWhen looking for a more exact approximation

technique, preserving the summation, care must betaken of the possible dependence of the wordboundaries tn on the whole pronunciation and class

label sequence vN1 and cN

1 . An exact fulfilment of thisconstraint would probably require some kind ofword graph rescoring approach. A relaxation,allowing for an approximation of the sum in the leftto right Viterbi search framework, can however beachieved by (cf. Schramm and Aubert, 2000):

• Assuming that the tn depend only on the immedi-ate neighboring pronunciations vn�1 and vn andclass labels cn�1 and cn. This is similar to theword pair approximation technique (Aubertand Ney, 1995). Or by

• assuming that the tn depend only on the linguisticword sequence (wN

1 ). This means that all (vN1 ; c

N1 )

equivalent to wN1 share the same word boun-

daries.

The second approach, described in Eq. (3), has beeninvestigated in this work:

bwN1 ¼ arg max

wN1

XcN

1

XvN

1

maxtN1

YNn¼1


� pðvn; cnjwnÞ � p xtnðwN

1Þ

tn�1ðwN1Þþ1

� ��vn; cn

�oð3Þ

Here, the dependence of the boundaries tn on the wN1

has been made explicit. In the course of the left toright search, this simplified estimation of the sumcan be realized by the following two steps:

• First, the probability contributions of all alterna-tive class specific pronunciations (i.e. all pairs(v,c)) of a word w that end at the same time inthe same language model context are summed up.

• Second, the corresponding search paths are com-bined into a single remaining hypothesis whichrepresents w in this context and carries the sumcomputed in the first step.

This solution may be suboptimal, if two or morepaths are summed up, since the optimal wordboundaries of each individual path may be different.For example, consider two pronunciation se-quences,1 ‘‘AND THEN’’ and ‘‘AN THEN’’ (seeFig. 5). The word boundary point between bothwords may be different for the optimal alignmentsof the two pronunciation sequences. In Fig. 5these optimal boundaries are denoted by t1 and t2,

Fig. 5. The effect of the sum approximation on the wordboundaries for different pronunciation sequences.

Table 2Statistics of the professional dictation training and test corpora

Corpus Training DEV EVAL

Language EnglishSpeaking Style spontaneous dictationBandwidth telephone

Overall Duration (h) 270 5 3.3

# Speakers 436 11 11# Sentences 343674 (chopped) 273 154# Running words 2385331 38023 26479# Different words 32414 3830 3161


respectively (a). In the sum approximation, how-ever, only paths are considered which end at thesame time, e.g. t3. This implies some kind of wordboundary blurring, illustrated in (b).

Another aspect concerns the handling ofweighted probabilities in the sum over pronuncia-tion and variability class sequences. Usually, whencombining the log probabilities from a languagemodel and an HMM based acoustic model, thelog probabilities of the language model are scaled.This is done to compensate invalid modelingassumptions in the acoustic model, which underesti-mates the probability of acoustic feature sequences.In case of using the maximum approximation, thisproduces the same effect as scaling the log probabil-ities of the acoustic model with the reciprocal lan-guage model scaling factor. However, if the sum isapplied this is no longer true. When using a lan-guage model scaling, we found that the sum is oftendominated by only the best hypothesis. This is aconsequence of the large dynamic range of acousticprobabilities appearing in pðxtn

tn�1þ1jvn; cnÞ and hasalso been observed in the context of confidence mea-sures (Wessel et al., 1998) and discriminative train-ing (Woodland and Povey, 2000). Therefore, weapplied a scaling of the acoustic model log probabil-ities with an acoustic scaling factor c (Schramm andAubert, 2000), as shown in Eq. (4).

bwN1 ¼ arg max

wN1

XcN

1

XvN

1

maxtN1

YNn¼1


� pðvn; cnjwnÞ � p xtnðwN

1Þ

tn�1ðwN1Þþ1

� ��vn; cn

�coð4Þ

2 Due to the lack of sufficient female training data, it wasdecided to work with male data only.

4. Database

In order to study the modeling of speech variabil-ities in spontaneous dictations we had to use pro-prietary data as, to the best of our knowledge, no

comprehensive public corpus for spontaneous dicta-tion exists. Therefore, all experiments reported herehave been performed on a Philips in-house USEnglish database of highly spontaneous medical dic-tations. It contains real-life recordings of medicalreports, spontaneously spoken over long distancetelephone lines by various male physicians2 fromall over the United States. The speech variabilityin this data is tremendous with a variety of speakingstyles, accents and speaking rates and a large degreeof spontaneous speech effects like filled pauses, par-tial words, repetitions and restarts. This spontane-ous speaking style differs strongly from thecooperative speech in conventional dictation appli-cations of automatic speech recognition, where thespeakers are aware of being transcribed automati-cally. The speakers in the spontaneous dictationtask are much less cooperative, since they expectto be transcribed by human transcribers. Conse-quently, the speaking style in the dictations is typi-cally sloppy with an often extremely high rate ofspeech. The existence of very fast speech parts is apeculiarity of this feedback free dictation task since(1) the speakers are usually very experienced in dic-tating and (2) several repetitive passages exist whichare known to both parties (physician and transcrip-tionist) beforehand. Another characteristic of thetask is the specific vocabulary due to the substantialnumber of specific medical terms in the reports. Inthe training corpus, filled pause and non speechevents are annotated and sentences have beenchopped at silence intervals of sufficient length.No chopping was applied, however, on the develop-ment and evaluation corpus. The statistics of thetraining, development (DEV) and evaluation(EVAL) corpus are presented in Table 2.


5. Baseline system description

The baseline system is the Philips ResearchLVCSR system, which has already been applied tonumerous tasks including Broadcast News tran-scriptions (Beyerlein et al., 2002) and Switchboard(Beyerlein et al., 2001). All experimental results pre-sented in this article were achieved by using a stan-dard one-pass recognizer. This allows to separatelyevaluate the proposed technique in a basic recogni-tion scenario but leaves open the question of howmuch could be gained from a combination withother techniques like acoustic model adaptation.Although a further refinement of the class specificmodels by utilizing appropriate adaptation tech-niques appears to be attractive this lies beyond thescope of this article.

Best system performance has been achieved withan acoustic resolution of 857 k mixture densities.Additional splitting did not further improve our re-sults. The baseline lexicon consists of 60k wordsand, on average, 1.3 alternative pronunciations perword, which are weighted by unigram prior pro-babilities. The set of pronunciation alternativesrepresents frequent variants as well as systematicvariations and was taken from available lexiconsof other tasks. The 77k lexical entries (60k base-forms plus 17k alternative pronunciations) also con-tain a set of four filled pause pronunciations withequal priors. In the baseline system, the trigramLM treats filled pauses like any other word, i.e.filled pause is predicted probabilistically fromtrigram context and may also appear in the historyitself. The search proceeds from left to right using aprefix tree lexicon (Aubert, 1999) and a maximumapproximation over all pronunciation sequences ofa word sequence is performed (cf. Section 3.3).

6. Experimental results

In this section we present the performance evalu-ation of the techniques discussed above. Our base-line experiments are described in Section 6.1.Experimental results with explicit handling of filledpause related speech variability in the acoustic,pronunciation and language model are presentedin Section 6.2. Efficient integration of multiple pro-nunciations into the search procedure has beenachieved by applying the sum approximation (Sec-tion 3.3.2). We present the results obtained withthis technique in Section 6.3. Experiments withimproved modeling of rate of speech and accent

related variability are discussed in Sections 6.4 and6.5, respectively. Some contrast experiments arepresented in Section 6.6.

6.1. Baseline

The baseline system has been described in Section5. This system achieves a word error rate of 22.3%for the DEV corpus and 28.3% for the EVAL cor-pus. The differing results on both corpora may beascribed to substantial differences in the individualerror rates of the speakers, which range from below10% to more than 50%.

6.2. Modeling filled pause related variability

The lexicon of our baseline system contains fourequally weighted, alternative pronunciations forfilled pause and therefore already captures filledpause related pronunciation variability to someextent. Starting from this standard treatment offilled pause, our goal was to successively improveits representation in the acoustic, pronunciationand language model by implementing a number ofalgorithmic extensions. First, appropriate pronunci-ation prior probabilities for the 4 filled pause pro-nunciations have been applied (instead of usingequal priors) and lead to a non significant improve-ment of 0.1% absolute on the DEV corpus (seeTable 3). A more significant influence of filled pausepronunciation priors can however be expected whenintroducing additional pronunciations (see below).The next step was the introduction of specific acous-tic models for the filled pause class and the regularspeech (i.e. non filled pause) class. This techniqueimproved the word error rate on the DEV corpusby 0.5% absolute. Another 0.5% absolute improve-ment was achieved by deliberately excluding thefilled pause event from the LM history, but predict-ing filled pause conditioned on the previously spo-ken words (Peters, 2003; Schramm et al., 2003).Finally, we applied a class dependent pronunciationmodeling for filled pause. The systematic phonemeinsertion technique described in Section 3.2.1 wasused to generate 64 alternative filled pause pronun-ciations of different lengths. In combination withthe other described techniques, this approach fur-ther reduced the word error rate to 20.7% on theDEV corpus. Compared to the baseline, this is a7% relative improvement. We also evaluated thefinal system on the EVAL corpus and achieved aword error rate of 27.0%, which is a relative

Table 3Effect of explicit filled pause handling in acoustic, pronunciation, and language model. Experimental results have been achieved on theDEV and EVAL corpus. AM: acoustic model, PM: pronunciation model, BL: baseline, FP: filled pause, RS: regular speech

FP priorprobabilities

Class dependentmodeling in

LMhistory

# FPpronunc.

WER (%) DEV WER (%) EVAL

AM PM

No BL BL FP incl. 4 22.3 28.3Yes BL BL FP incl. 4 22.2 –Yes FP, RS BL FP incl. 4 21.7 –Yes FP, RS BL FP excl. 4 21.1 –Yes FP, RS FP FP excl. 64 20.7 27.0


improvement of 4.4%. Note that this gain has beenachieved with an only marginally increased numberof parameters.

6.3. Efficient incorporation of pronunciations

into the search

The efficient incorporation of multiple pronunci-ations into the search procedure is an importantissue. On HUB4 Broadcast News (Schramm andAubert, 2000) and Switchboard (Beyerlein et al.,2001), we found that an approximation of the sumover the pronunciation variant sequences (Section3.3.2) leads to consistent improvements of the worderror rate and even reduced search costs. This moti-vated us to also use this technique for our medicalreport task. The results, shown in Table 4, are in linewith our experimental results on Broadcast Newsand Switchboard. In addition to a small gain ofthe word error rate, a slightly improved realtimeperformance was observed.

6.4. Modeling rate of speech related variability

This section describes the development of rate ofspeech specific acoustic and pronunciation models.We present experimental results with two and threerate of speech specific acoustic models in Section6.4.1 and discuss the results achieved with fastspeech specific pronunciation modeling in Section6.4.2.

Table 4Experimental comparison of the maximum approximation andthe sum approximation on the DEV and EVAL corpus

Pronunciation handling WER (%)DEV

WER (%)EVAL

Maximum approximation 20.7 27.0Sum approximation 20.4 26.6

6.4.1. Rate of speech dependent acoustic modeling

Based on the (new baseline) system with im-proved filled pause modeling and pronunciationsummation, described in Section 6.3, additional rateof speech classes were introduced to further specify‘‘regular’’ (i.e. non filled pause) speech. Two or threespeech variability classes were applied for slow andfast, respectively slow, medium and fast speech.Each word in the training corpus was labeled withrespect to its rate of speech class. Then, correspond-ing phoneme inventories and alternative pronuncia-tions were introduced, according to Section 3.1. Anacoustic model training was performed and two,respectively three rate of speech specific acousticmodels were generated, directly linked to corre-sponding rate specific pronunciations. Table 5 com-pares the complexity of the three systems in terms ofthe number of mixture densities and pronuncia-tions. Compared to the baseline, the number of mix-ture densities is little increased for the systems withtwo and three rate specific acoustic models (in thefollowing referred to as ROS-2 and ROS-3). Thenumber of lexical entries3 is, however, clearlyincreased from 86k to 172k (ROS-2) respectively258k (ROS-3). In the search, contributions of simul-taneously active pronunciations of the same wordwere summed up (Sections 3.3.2 and 6.3). Table 5also presents the experimental results, achieved forthe DEV and EVAL corpus. Especially for theROS-3 system, the performance with rate of speechspecific acoustic models is quite different for bothcorpora. While we observed a (marginally signifi-cant) improvement over the baseline for the DEVcorpus, the WER was increased for the EVALcorpus.

We assume that the tuning of important searchparameters (e.g. language model scaling and word

3 Baseform pronunciations plus alternative pronunciations.

Table 5Effect of rate of speech dependent acoustic modeling. Experi-mental results have been achieved on the DEV and EVALcorpus. FP: filled pause, RS: regular speech, S: slow speech, M:medium speech, F: fast speech (cf. Table 1)

Class dependentmodelingin acoustic model

Lexicalentries

Mixturedensities

WER (%)DEV

WER (%)EVAL

FP, RS 86k 857k 20.4 26.6FP, S, F 172k 866k 20.1 26.6FP, S, M, F 258k 1034k 19.9 27.0


penalty) on the DEV corpus did not generalize wellto the unknown EVAL data. Another reason couldbe the different rate of speech speaker characteristicon both corpora. Apart from the system complexityand word error rate, it is also interesting to comparethe runtime performance. We measured a realtimefactor of seven for the baseline system and 15 forthe ROS-3 system using an Intel Xeon 2.8 GHz pro-cessor. This difference can be attributed to (1) highergeneral search costs as a result of the larger prefixtree lexicon (the number of active states after prun-ing, for example, is increased by 20%), (2) additionalcosts implied by the pronunciation summation pro-cess when the number of active terms increases and(3) additional costs for the distance computationdue to the 20% higher number of mixture densities.

6.4.2. Rate of speech dependent acoustic and

pronunciation modeling

The data driven, speech variability dependentpronunciation modeling technique, described inSection 3.2, was applied on top of speech variabilitydependent acoustic modeling to improve especiallythe recognition of fast speech. Thus, a number ofalternative pronunciations with systematic phonemedeletions were introduced for the fast speech class,while the pronunciation modeling for the otherrate of speech classes was not modified. A classdependent Viterbi alignment (Section 3.2.4), wasperformed and the alignment output text was usedin a new acoustic model training to generate two

Table 6Effect of rate of speech dependent acoustic and pronunciation modelingcorpus. FP: filled pause, RS: regular speech, S: slow speech, M: mediu

Class dependent modeling in LexicalentriesAcoustic model Pronunc. model

FP, RS FP 86kFP, S, F FP, F 317kFP, S, M, F FP, F 403k

or three speech variability specific acoustic models.In addition to that, new pronunciation and classlabel prior probabilities were estimated from thealignment text. A single pass search procedure wasperformed, using the new acoustic and pronuncia-tion models. Again, contributions of alternativepronunciations of the same word were summed up(Sections 3.3.2 and 6.3) and important searchparameters were tuned with respect to the DEV cor-pus. Table 6 gives an overview of the complexityand the performance of the two systems, now de-noted by ROS-2-F and ROS-3-F. It is importantto emphasize again that in these experiments aspecific (non baseline) pronunciation modeling tech-nique was applied only to the fast speech class (F)and the filled pause class (FP). The pronunciationmodeling of the slow (S) and medium (M) speechclasses does not differ from baseline modeling. Com-pared to the baseline, the system complexity hasbeen increased clearly. For the ROS-3-F system,the number of mixture densities has increased from857k to 1.7M, while the number of lexical entrieshas more than quadrupled. Of course, the numberof lexical entries would have been even larger if wehad introduced additional pronunciations for allwords in the recognition lexicon. However, weapplied the speech variability dependent pronuncia-tion modeling only to words, which occur at leastonce in the training data and thereby restricted thenumber of lexical entries somewhat. The globalword error rates show a 3% relative improvementfor the DEV corpus, while the gain on the EVALcorpus is 7% relative. Compared to the system with-out rate of speech dependent pronunciation model-ing (Table 5), a significant improvement wasachieved especially for the EVAL corpus, whichdemonstrates the importance of performing a spe-cific pronunciation modeling for fast speech. Dueto the significantly increased complexity of the lexi-cal tree, we measured a realtime factor of about 20for the ROS-3-F system, when using an Intel Xeon2.8 GHz processor. Thus, the runtime has nearly tri-pled compared to the baseline.

. Experimental results have been achieved on the DEV and EVALm speech, F: fast speech (cf. Table 1)

Mixturedensities

WER (%)DEV

WER (%)EVAL

857k 20.4 26.61448k 19.9 25.31716k 19.7 24.7

Fig. 6. Speaker specific results achieved with speech variability dependent acoustic and pronunciation modeling on the DEV and EVALcorpus. The individual speakers are denoted in terms of their average rate of speech (see Section 3.1.1 for details on the rate of speechmeasure).

Table 7Effect of modeling speaker accent related variability

Class dependentmodeling inacoustic model

Lexicalentries

Mixturedensities

WER (%)DEV

WER (%)EVAL

FP, S, M, F 403k 1716k 19.7 24.7FP, S, M, F, N, NN 413k 2516k 19.6 24.5

Experimental results have been achieved on the DEV and EVALcorpus. FP: filled pause, RS: regular speech, S: slow speech, M:medium speech, F: fast speech, N: native speech, NN: non nativespeech (cf. Table 1).


In order to better understand the different gainson both corpora, we studied speaker specific results.These are given in Fig. 6. Considering the ROS-3-Fsystem, 16 out of the 22 DEV and EVAL speakersare improved, one EVAL speaker by 18.7% relative.Only one speaker, the slowest EVAL speaker, is de-graded significantly. Studying the 11 fastest speak-ers, we observe a similar gain in both corpora ofaround 6% relative. At the same time, hardly anyimprovement is observed when combining the re-sults of the 11 remaining (slow) speakers. This indi-cates that the gain, achieved with this technique,may be mostly attributed to an improved acousticand lexical modeling of fast speakers. This may alsoexplain the different gains observed for both cor-pora, since the EVAL speakers are on average fasterthan the DEV speakers. We conclude that a specificpronunciation modeling for only the fast speechclass helps to significantly improve the performancefor fast speakers without degrading the results forslower speakers.

6.5. Modeling speaker accent related variability

In the previous sections we presented experimen-tal results, achieved with a combination of rate ofspeech specific models with filled pause models.

Now, we try to further enhance our best systemby incorporating specific acoustic models for nativeand non native speakers. The two additional acous-tic models have been trained on either native or nonnative speakers, which are labeled in our trainingdata. About one fifth of the training data containednon native speech. Although studies have shownthat pronunciation modeling of non native speechmay improve the performance (e.g. Lee et al.,2003), we did not yet apply any specific pronuncia-tion modeling for this speech variability. In order torestrict the complexity of the system, we incorpo-rated accent specific pronunciations only for themost frequent 5000 lexical entries. In Table 7, re-sults with additional native and non native models


are compared to the best performing system (Sec-tion 6.4.2). The overall gain, achieved with thisaccent modeling, is rather small. However, for theonly non native speaker in the two corpora, a rela-tive error rate improvement of about 3% isobserved.

6.6. Contrast experiments

In this section, we present several contrast exper-iments. The aim of these investigations is to evaluatethe importance of different components in oursystem and to explore ways to reduce the systemcomplexity. The main aspects concern: a speech var-iability class independent modeling of phonemedeletions (Section 6.6.1), an oracle selection of rateof speech models (Section 6.6.2), a comparison ofthe maximum approximation with the sum approx-imation, when using rate specific models (Section6.6.3) and the number of rate specific pronuncia-tions (Section 6.6.4).

6.6.1. Class independent pronunciation modeling

of fast speech

Using specific (shorter) pronunciation models forthe fast speech class has led to a significant perfor-mance improvement, especially for the fast speakers(cf. Section 6.4.2). It is, however, not yet clear if it isnecessary to apply the phoneme deletion pronuncia-tion modeling technique in combination with therate of speech class dependent acoustic modeling.Therefore, in a contrast experiment, this specifickind of pronunciation modeling was applied in arate of speech class independent framework. Start-ing from the baseline system with filled pause mod-eling and pronunciation summation (cf. Section6.3), various additional short pronunciation vari-ants were generated by using the systematic pho-neme deletion technique, described in Section3.2.2. These new variants were incorporated intothe system in the usual manner. That is, the variants

Table 8Effect of class independent fast speech pronunciation modeling

System Class dependent modeling in

Acoustic model Pronunc. model

Baseline FP, RS FPContrast FP, RS FP, RSROS-2 FP, S, F FPROS-2-F FP, S, F FP, F

Results of the class dependent framework are given for comparison. Theregular speech, S: slow speech, F: fast speech (cf. Table 1).

were put into the training lexicon which in turn wasapplied for a standard (class independent) Viterbialignment. The alignment text was then used for astandard acoustic model training and for an estima-tion of the pronunciation prior probabilities. Allpronunciations, observed in the alignment text,were put into the recognition lexicon (if the corre-sponding word was contained in the recognition lex-icon wordlist). This led to a total number of lexicalentries of 232k, compared to 86k in the baseline lex-icon. Table 8 compares this new system with threeothers, presented formerly: the baseline system (Sec-tion 6.3), the ROS-2 system with standard pronun-ciations (Section 6.4.1) and the ROS-2-F systemwith specific pronunciations for the fast speech class(Section 6.4.2). Compared to the baseline, 146kadditional pronunciations were introduced, whilethe size of the acoustic model was hardly increased.The word error rate, however, is 1.7% relative worsethan the baseline. This result could not be improvedby increasing the number of mixture densities withan additional split. An analysis of speaker specificresults showed that indeed a gain was achieved forthe three fastest speakers. However, this gain wasovercompensated by the degradation on the slowerspeakers. This demonstrates that the modeling ofphoneme deletions may improve the recognitionperformance on fast speech even without beingapplied specifically to rate of speech models. How-ever, the experiments also indicate that degrada-tions on slower speech can only be avoided if aclass dependent pronunciation modeling is appliedthat allows for an individual treatment of fast andslow speech.

6.6.2. Oracle selection of rate of speech specificmodel

In these contrast experiments we investigatedwhether the lexical combination of the rate ofspeech specific models is necessary to achieve theobserved improvements. An alternative scenario is

Lexicalentries

Mixturedensities

WER (%)DEV

86k 857k 20.4232k 888k 20.7172k 866k 20.1317k 1448k 19.9

experiments were done on the DEV corpus. FP: filled pause, RS:

Table 9Experimental comparison of the maximum approximation andthe sum approximation for the baseline system and the ROS-2-Fsystem

Pronunciation handling WER (%)Baseline

WER (%)ROS-2-F

Maximum approximation 20.7 20.9Sum approximation 20.4 19.9

Results have been achieved on the DEV corpus.


to determine the average rate of speech for eachspeaker beforehand and to use only the most appro-priate rate specific model during the search (e.g.Mirghafori et al., 1995). We tested this isolatedmodel scenario by performing two evaluations foreach speaker, one exclusively using the fast modelsand the other using only the slow models. Techni-cally, this has been achieved by simply excludingthe fast respectively slow speech pronunciationmodels from the recognition lexicon of the ROS-2-F system (Section 6.4.2). Note that the numberof lexical entries is larger for the fast speech models(231k fast speech pronunciations compared to 86kslow speech pronunciations), since additional short-er pronunciations have been used in this variabilityclass. The overall performance with isolated rate ofspeech models is clearly worse than with the ROS-2-F system. On the DEV corpus, a global worderror rate of 22.2% is achieved when using onlythe slow models.

The performance is even more degraded to 25.1%if the fast models are applied exclusively. This is notsurprising since either system is specialized to only apart of the rate of speech range. A more interestingquestion is how the best of both systems for eachspeaker performs in comparison to the ROS-2-Fsystem and to the baseline. These results are illus-trated in Fig. 7. The word level lexical model com-bination with sum approximation (ROS-2-F)outperformed the oracle model selection approachfor all but two speakers. We conclude that the avail-ability of both models and their combination during

Fig. 7. Speaker specific performance with isolated rate of speech specificdenoted in terms of their average rate of speech (see Section 3.1.1 for

the search is more helpful than the restriction to theoracle model. This is probably a consequence of thesubstantial intra speaker rate of speech fluctuationsin our dictation data, which demand a flexible wordlevel modeling.

6.6.3. Maximum approximation

In Section 6.3, it has been shown that the appli-cation of the sum approximation technique leadsto a small but consistent improvement of the worderror rate on the DEV and EVAL corpus. Conse-quently, this technique was applied in the subse-quent experiments. It is, however, not yet clearwhat influence the sum approximation has in asystem with a large number of variability specificpronunciation models. Therefore, we performed acontrast experiment, in which we compared the per-formance of the ROS-2-F system (Section 6.4.2)when using the sum and maximum approximation.The global word error rates for the baseline andROS-2-F system are presented in Table 9. The per-formance of the ROS-2-F system with maximum

models, achieved on the DEV corpus. The individual speakers aredetails on the rate of speech measure).


approximation is worse than the baseline perfor-mance. However, when combining class specificacoustic and pronunciation models with the sum-mation technique, a gain of nearly 4% relative isachieved, compared to the baseline with maximumapproximation. At the same time, the search effortis even slightly reduced with the sum, similar tothe results presented in (Schramm and Aubert,2000). These findings demonstrate that the summa-tion is an important part of the presented technique,without which no gain at all would have beenachieved.

6.6.4. Number of pronunciations

The phoneme deletion approach, described inSection 3.2.2, adds a vast number of additional pro-nunciations to the lexicon. This improves the mod-eling of fast speech but at the same time causesadditional lexical confusability and larger memoryrequirements. Therefore, an interesting contrastexperiment is to evaluate the system performanceof the ROS-2-F system in terms of the number ofpronunciations in the lexicon. Based on frequencycounting in the Viterbi aligned training data, wehave excluded less frequent pronunciations (of allclasses) and thereby generated three lexicons with260k, 210k and 180k lexical entries. In Fig. 8, theexperimental results with these lexicons are com-pared to the results with the full lexicon (317k lexi-cal entries). Results are given for maximum and sumapproximation. A reduction of the number ofpronunciations by 100k slightly improves the word

Fig. 8. Influence of the number of lexical entries on the performanapproximation on the DEV corpus.

error rate in case of the maximum approximationand has a small negative influence when applyingthe sum. It is apparently possible to reduce the sys-tem complexity without losing too much in terms ofperformance. However, in case of 180k lexicalentries clear degradations are observed for bothtechniques.

We conclude that even the less frequent rate spe-cific pronunciations are helpful when applied incombination with the sum approximation. In themaximum approximation, however, these lexicalentries are rather counterproductive and should beexcluded from the lexicon.

7. Conclusions and outlook

In this work a word level model combination ap-proach has been presented that aims to improve themodeling of multiple spontaneous speech variabili-ties on a highly spontaneous, real life medical tran-scription task. The technique (1) separates speechvariability into previously defined classes, (2) gener-ates speech variability specific acoustic and pronun-ciation models and (3) properly combines thesemodels later in the search procedure on a word levelbasis. This allows for an individual acoustic andpronunciation modeling of various speech variabil-ity classes and is therefore more adequate to the tre-mendous variety of natural speech. A theoreticalframework for the efficient integration of the vari-ability specific acoustic and pronunciation modelsinto the search procedure has been provided, which

ce of the ROS-2-F system when using the maximum and sum

Table 10Overview of achieved improvements on the evaluation corpusEVAL

System improvement WER (relative gain)

Baseline 28.3% (reference)Filled pause modeling 27.0% (�4.6%)Pronunciation summation 26.6% (�6.0%)Rate of speech modeling 24.7% (�12.7%)Speaker accent modeling 24.5% (�13.4%)


incorporates alternative pronunciations and acous-tic models in a weighted sum of acoustic probabili-ties. Since this technique can be applied in a singlesystem�s one-pass search, it is an appealing optionespecially for real life applications. Another strengthof the lexicon based model combination technique isits capability to model speech variability on theword level instead of utterance level. This is espe-cially important for the modeling of variabilitiesthat may change frequently within an utterance, likethe rate of speech.

The approach was used for modeling speechvariability related to filled pauses, rate of speechand speaker accent but may well be applied to otherareas.

Our best system is based on a lexicon with 422kspecific pronunciations and achieved a word errorrate reduction of 13% relative compared to the base-line. This gain can be attributed mostly to improvedrate of speech modeling (�7% relative) and filledpause modeling (�4% relative) while our efforts withrespect to the modeling of speaker accent were lesssuccessful (less than�1% relative). Table 10 presentsan overview of the achieved improvements. The pro-nunciation summation technique slightly reduces thesearch costs and therefore has no negative influenceon the realtime factor. However, the lexical combi-nation of several acoustic and pronunciation modelsincreases the size of the lexical tree significantly andtherefore causes additional costs. Consequently, therealtime factor has nearly tripled, compared to thebaseline, when applying three rate of speech modelsand additional pronunciation models for fast speechand filled pauses. Most probably, this performancecan be improved significantly, for example, by amore sophisticated class label sequence model or abetter pronunciation selection technique. However,this has not been the focus of this work.

In a number of contrast experiments the impor-tance of different technical components of the multivariability system was evaluated and ways to reducethe system complexity were explored. It was found

that the sum approximation technique is an impor-tant part of the framework, necessary to successfullyincorporate a large number of alternative classspecific pronunciations into the system. Anothercontrast experiment demonstrated the need for com-bining the class specific modeling approaches for theacoustic and pronunciation model. Neither a sepa-rate class specific acoustic modeling nor a separateclass specific pronunciation modeling approachcould significantly reduce the word error rate. Itwas further shown that the proposed word level com-bination of rate specific models is superior to an ap-proach that utilizes only the best performing singlerate specific model for each speaker. This appearsto be a consequence of the substantial rate of speechfluctuations in the medical dictations and underlinesthe fact that, contrary to the single model approach,the model combination is able to also model local(i.e. word level) variations. In another contrastexperiment the complexity of the multi variabilitysystem was reduced by removing the least frequentlexical entries. Reducing the size of the lexicon byabout one third led to only a small degradation ofthe best word error rate. The presented techniquemay be further extended and refined in many waysand we conclude by describing some of the directionswhich appear promising. The systematic phonemedeletion approach we applied is quite restricted,allowing only the deletion of one phoneme per word.Iterating this process could lead to even shorter newvariants, which may be more adequate for very fastspeech. Another promising direction is to performexperiments with an even larger number of speechvariability classes to further refine the modeling.We integrated variability specific acoustic modelsand pronunciations into the search procedure byusing the joint probability p(vn, cnjwn). This coarseunigram model does not allow for an incorporationof any prior knowledge about likely sequences ofacoustic models. Therefore, an interesting experi-ment is to replace this model by the more accurateN-gram model p(cnjcn � 1,cn � 2,. . .)p(vnjwn, cn). Thiscould also help to better focus the search and there-fore improve the overall performance of the system.

Acknowledgements

The authors wish to thank Dr. Jochen Petersfrom Philips Research Laboratories, Aachen, Ger-many, for providing the language model whichwas used in the experiments and the two reviewersfor their helpful comments.


References

Adda-Decker, M., Lamel, L. 1998. Pronunciation VariantsAcross Systems, Languages and Speaking Style. In: Proc.ESCA Workshop �Modeling Pronunciation Variation ForAutomatic Speech Recognition�, Rodulc, Kerkrade, TheNetherlands, pp. 1–6.

Amdal, I., 2002. Learning pronunciation variation. Ph.D.dissertation, Norwegian University of Science and Technol-ogy, Norway.

Aubert, X., Dugast, C., 1995. Improved acoustic–phoneticmodeling in Philips� dictation system by handling liaisonsand multiple pronunciations. In: Proc. European Conf. onSpeech Communication and Technology, Madrid, Spain, pp.767–770.

Aubert, X.L., Ney, H., 1995. Large Vocabulary ContinuousSpeech Recognition using Word Graphs. In: Proc. IEEEInternat. Conf. on Acoustics, Speech and Signal Processing,Detroit, USA, Vol. 1, pp. 49–52.

Aubert, X., 1999. One pass cross word decoding for largevocabularies based on a lexical tree search organization. In:Proc. European Conf. on Speech Communication and Tech-nology, Budapest, Hungary, pp. 1559–1562.

Aust, H., Oerder, M., Seide, F., Steinbiss, V., 1995. The Philipsautomatic train timetable information system. Speech Com-mun. 17 (3–4), 249–262.

Bahl, L.R., Das, S., de Souza, P.V., Epstein, M., Mercer, R.L.,Merialdo, B., Nahamoo, D., Picheny, M.A., Powell, J., 1991.Automatic phonetic baseform determination. In: Proc. IEEEInternat. Conf. on Acoustics, Speech and Signal Processing,Toronto, Canada, pp. 173–176.

Bates, R., Ostendorf, M. 2002. Modeling pronunciation varia-tion in conversational speech using prosody. In: Proc.ISCA Tutorial and Research Workshop on Pronun-ciation Modeling and Lexicon Adaptation for SpokenLanguage (PMLA-2002), Estes Park, Colorado, USA,pp. 42–47.

Bellegarda, J.R., Nahamoo, D., 1990. Tied mixture continousparameter modeling for speech recognition. IEEE Trans.ASSP 38 (12), 2033–2045.

Bernstein, J., Baldwin, G., Cohen, M., Murveit, H., Weintraub,M., 1986. Phonological studies for speech recognition. In:Proc. DARPA Speech Recognition Workshop, Palo Alto,CA, USA, pp. 41–48.

Beyerlein, P., Aubert, X., Harris, M., Meyer, C., Schramm, H.,2001. Investigations on conversational speech recognition. In:Proc. European Conf. on Speech Communication and Tech-nology, Aalborg, Denmark, pp. 499–503.

Beyerlein, P., Aubert, X., Haeb-Umbach, R., Harris, M.,Klakow, D., Wendemuth, A., Molau, S., Pitz, M., Sixtus,A., 2002. Large vocabulary continuous speech recognition ofbroadcast news—the Philips/RWTH approach. Speech Com-mun. 37 (1–2), 109–131.

Byrne, W., Finke, M., Khudanpur, S., McDonough, J., Nock,H., Riley, M., Saraclar, M., Wooters, C., Zavaliagkos, G.,1998. Pronunciation modelling using a hand-labelled corpusfor conversational speech recognition. In: Proc. IEEE Inter-nat. Conf. on Acoustics, Speech and Signal Processing,Seattle, USA, pp. 313–316.

Chen, K., Hasegawa-Johnson, M., 2003. Improving the robust-ness of prosody dependent language modeling based onprosody syntax cross-correlation. In: Proc. IEEE Workshop

on Speech Recognition and Understanding, St. Thomas, USVirgin Islands, pp. 435–440.

Chen, K., Hasegawa-Johnson, M., 2004. How prosody improvesword recognition. In: ISCA International Conference onSpeech Prosody, Nara, Japan, pp. 583–586.

Cohen, M.H., 1989. Phonological structures for speech recogni-tion. Ph.D. dissertation, University of California, Berkeley,USA.

Evermann, G., Woodland, P.C., 2000. Posterior probabilitydecoding, confidence estimation and system combination. In:Proc. NIST Speech Transcription Workshop, College Park,MD, USA.

Finke, M., Waibel, A., 1997. Speaking mode dependentpronunciation modeling in large vocabulary conversationalspeech recognition. In: Proc. European Conf. on SpeechCommunication and Technology, Rhodes, Greece, pp. 2379–2382.

Fukada, T., Yoshimura, T., Sagisaka, Y., 1998. Automaticgeneration of multiple pronunciations based on neuralnetworks and language statistic. In: Proc. ESCA WorkshopModeling Pronunciation Variation For Automatic SpeechRecognition, Rodulc, Kerkrade, The Netherlands, pp. 41–46.

Furui, S., 2003. Recent advances in spontaneous speech recog-nition and understanding. In: Proc. ISCA and IEEE Work-shop on Spontaneous Speech Processing and Recognition(SSPR), Tokyo, Japan, pp. 1–6.

Fosler-Lussier, E., Morgan, N., 1998. Effects of speaking rate andword frequency on conversational pronunciations. In: Proc.ESCA Workshop on Modeling Pronunciation Variation forAutomatic Speech Recognition, Rodulc, The Netherlands,pp. 35–40.

Fosler-Lussier, J.E.1999. Multi-level decision trees for static anddynamic pronunciation models. In: Proc. European Conf. onSpeech Communication and Technology, Budapest, Hungary,pp. 463–466.

Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M., 1997.Transcribing broadcast news: the LIMSI Nov96 Hub4System. In: Proc. ARPA Spoken Language TechnologyWorkshop, Chantilly, Virginia, pp. 56–63.

Goronzy, S., Kompe, R., Rapp, S., 2001. Generating non-nativepronunciation variants for lexicon adaptation. In: Proc. ISCAITRW Adaptation Methods for Speech Recognition, Sophia-Antipolis, France, pp. 143–146.

Greenberg, S., Hollenback, J., Ellis, D., 1996. Insights intospoken language gleaned from phonetic transcription of theswitchboard corpus. In: Proc. Internat. Conf. on SpokenLanguage Processing, Philadelphia, PA, pp. 24–27.

Hain, T., 2002. Implicit pronunciation modeling in ASR. In:Proc. ISCA Tutorial and Research Workshop on Pronuncia-tion Modeling and Lexicon Adaptation for Spoken Language(PMLA-2002), Estes Park, Colorado, USA, pp. 129–134.

He, X., Zhao, Y., 2003. Fast model selection based speakeradaptation for non-native speech. Proc. IEEE Transactionson Speech and Audio Processing. 11 (4), 298–307.

Huang, C., Chang, E., Zhou, J., Lee, K.F., 2003. Accentmodeling based on pronunciation dictionary adaptation forlarge vocabulary mandarin speech recognition. In: Proc.Internat. Conf. on Spoken Language Processing, Beijing,China, Vol. 3, pp. 818–821.

Humphries, J.J., Woodland, P.C., Pearce, D., 1996. Usingaccent-specific pronunciation modelling for robust speech


recognition. In: Proc. Internat. Conf. on Spoken LanguageProcessing, Philadelphia, PA, USA, pp. 2324–2327.

Humphries, J.J., Woodland, P.C., Pearce, D. 1997. Using accent-specific pronunciation modelling for improved large vocabu-lary continuous speech recognition. In: Proc. European Conf.on Speech Communication and Technology, Rhodes, Greece,Vol. 5, pp. 2367–2370.

Hwang, M.-Y., Huang, X., 1993. Shared-distribution hiddenmarkov models for speech recognition. IEEE Trans. SpeechAudio Process. 1 (4), 414–420.

Jurafsky, D., Ward, W., Jianping, Z., Herold, K., Xiuyang, Y.,Sen, Z., 2001. What kind of pronunciation variation is hardfor triphones to model? In: Proc. IEEE Internat. Conf. onAcoustics, Speech and Signal Processing, Salt Lake City, UT,USA, Vol. 1, pp. 577–580.

Kessens, J.M., Wester, M., Strik, H., 1999. Improving theperformance of a Dutch CSR by modeling within-word andcross-word pronunciation variation. In Speech Commun. 29,193–207.

Kessens, J., 2002. Making a difference on automatic transcriptionand modeling of Dutch pronunciation variation for automaticspeech recognition. Ph.D. dissertation, University of Nijme-gen, The Netherlands.

Lamel, L.F., Adda, G., 1996. On designing pronunciation lexicafor large vocabulary, continuous speech recognition. In: Proc.Internat. Conf. on Spoken Language Processing, Philadel-phia, PA, pp. 6–9.

Lee, K.-F., Hon, H.-W., 1989. Speaker-independent phonerecognition using hidden markov models. IEEE Trans.Acoustics, Speech, Signal Process. 37, 1641–1648.

Lee, K.-T., Melnar, L., Talley, J., Wellekens, C.J., 2003.Symbolic speaker adaptation with phone inventory expan-sion. In: Proc. IEEE Internat. Conf. on Acoustics, Speechand Signal Processing, Hong-Kong, China, Vol. 1, pp. 296–299.

Liu, D., Nguyen, L., Matsoukas, S., Davenport, J., Kubala, F.,Schwartz, R., 1998. Improvements in spontaneous speechrecognition. In: Proc. DARPA Broadcast News Transcriptionand Understanding Workshop, Lansdowne, Virginia, USA,pp. 123–126.

Livescu, K., Glass, J., 2003. Lexical modeling of non-nativespeech for automatic speech recognition. In: Proc. IEEEInternat. Conf. on Acoustics, Speech and Signal Processing,Istanbul, Turkey, pp. 1683–1686.

McAllaster, D., Gillick, L., Scattone, F., Newman, M., 1998.Fabricating conversational speech data with acoustic models:a program to examine model-data mismatch. In: Proc.Interant. Conf. on Spoken Language Processing, Sydney,Australia, pp. 1847–1850.

Martinez, F., Tapias, D., Alvarez, J., 1998. Towards speech rateindependence in large vocabulary continuous speech recogni-tion. In: Proc. IEEE Internat. Conf. on Acoustics, Speech andSignal Processing, Seattle, USA, pp. 725–728.

Mayfield Tomokiyo, L., Waibel, A., 2001. Adaptation methodsfor non-native speech. In: Proc. Multilinguality in SpokenLanguage Processing, Aalborg, Denmark.

Mirghafori, N., Fosler, E., Morgan, N., 1995. Fast speakers inlarge vocabulary continuous speech recognition: analysis andAntidotes. In: Proc. European Conf. on Speech Communica-tion and Technology, Madrid, Spain, Vol. 1, pp. 491–494.

Nanjo, H., Kato, K., Kawahara, T., 2001. Speaking ratedependent acoustic modeling for spontaneous lecture speech

recognition. In: Proc. European Conf. on Speech Communi-cation and Technology, Aalborg, Denmark, pp. 2531–2534.

Ostendorf, M., Byrne, W., Fink, M., Gunawardana, A., Ross,K., Roweis, S., Shriberg, E., Talkin, D., Waibel, A., Zeppen-field, T., 1996. Modeling systematic variations in pronunci-ation via a language-dependent hidden speaking mode. In:Proc. Int. Conf. on Spoken Language Processing, Philadel-phia, PA, USA.

Ostendorf, M., Shafran, I., Bates, R., 2003. Prosody models forconversational speech recognition. In: Symposium on Pros-ody and Speech Processing, Tokyo, Japan, pp. 147–154.

Pallett, D., Fiscus, J., Garofolo, J., Lund, B., Przybocki, M.,1994. 1993 benchmark tests for the ARPA spoken languageprogram. In: Proc. ARPA Spoken Language TechnologyWorkshop, Princeton, NJ, pp. 49–74.

Peskin, B., Newman, M., McAllaster, D., 1997. Improvements inrecognition of conversational telephone speech. In: Proc.European Conf. on Speech Communication and Technology,Rhodes, Greece, pp. 22–25.

Peters, J., 2003. LM studies on filled pauses in spontaneousmedical dictation. In: Proc. Human Language TechnologyConf., (HLT-NAACL), Edmonton, Canada, pp. 82–84.

Pfau, T., Ruske, G., 1998. Creating hidden markov models forfast speech. In: Proc. Internat. Conf. on Spoken LanguageProcessing, Sydney, Australia, pp. 205–208.

Richardson, M., Hwang, M., Acero, A., Huang, X.D., 1999.Improvements on speech recognition for fast talkers. In: Proc.European Conf. on Speech Communication and Technology,Budapest, Hungary, Vol. 1, pp. 411–414.

Rigoll, G., 2003. An overview on European projects related tospontaneous speech recognition. In: Proc. ISCA and IEEEWorkshop on Spontaneous Speech Processing and Recogni-tion (SSPR), Tokyo, Japan, pp 131–134.

Riley, M.D., 1991. A statistical model for generating pronunci-ation networks. In: Proc. IEEE Int. Conf. on Acoustics,Speech and Signal Processing, Toronto, Canada, pp. 737–740.

Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A.,McDonough, J., Nock, H., Saraclar, M., Wooters, C.,Zavaliagkos, G., 1999. Stochastic pronunciation modelingfrom hand-labelled phonetic corpora. In: Strik (Ed.), SpeechCommunication, Vol. 29, pp. 209–224.

Rose, R.C., Riccardi, G., 1993. Modeling disfluency and back-ground events in ASR for a natural language understandingtask. In: Proc. IEEE Internat. Conf. on Acoustics, Speech andSignal Processing, Phoenix, USA, Vol. 1, pp. 341–344.

Saraclar, M., Nock, H., Khudanpur, S., 2000. Pronunciationmodeling by sharing gaussian densities across phoneticmodels. Comput. Speech Language 14 (2), 137–160.

Schlueter, R., Macherey, W., Mller, B., Ney, H., 2001. Compar-ison of discriminative training criteria and optimizationmethods for speech recognition. In Speech Commun. 34,287–310.

Schramm, H., Aubert, X., 2000. Efficient integration of multiplepronunciations in a large vocabulary decoder. In: Proc. IEEEInternat. Conf. on Acoustics, Speech and Signal Processing,Istanbul, Turkey, pp. 1659–1662.

Schramm, H., Aubert, X., Meyer, C., Peters, J., 2003. Filled-pause modeling for medical transcriptions. In: Proc. ISCAand IEEE Workshop on Spontaneous Speech Processing andRecognition (SSPR), Tokyo, Japan, pp. 143–146.

Schwartz, R., Colthurst, T., Duta, N., Gish, H., Iyer, R., Kao,C.-L., Liu, D., Kimball, O., Ma, J., Makhoul, J., Matsoukas,


S., Nguyen, L., Noamany, M., Prasad, R., Xiang, B., Xu, D.,Gauvain, J.-L., Lamel, L., Schwenk, H., Adda, G., Chen, L.,2004. Speech recognition in multiple languages and domains:the 2003 BBN/LIMSI EARS system. In: Proc. IEEE Internat.Conf. on Acoustics, Speech and Signal Processing, Montreal,Canada, Vol. 3, pp. 17–21.

Shriberg, E.E., 1994. Preliminaries to a theory of speechdisfluencies. Ph.D thesis, University of California at Berkeley,USA.

Sloboda, T., Waibel, A., 1996. Dictionary learning for sponta-neous speech recognition. In: Proc. Int. Conf. on SpokenLanguage Processing, Philadelphia, USA, pp. 2328–2331.

Sproat, R., Zheng, F., Gu, L., Li, J., Zheng, Y., Su, Y., Bramsen,P., Kirsch, D., Shafran, I., Tsakalidis, S., Starr, R., Jurafsky,D., 2004. Dialectal Chinese Speech Recognition: FinalReport. JHU CLSP Workshop.

Strik, H., Cucchiarini, C., 1999. Modeling pronunciation varia-tion for ASR: a survey of the literature. Speech Commun. 29,225–246.

Wang, Z., Schulz, T., Waibel, A, 2003. Comparison of acousticmodel adaptation techniques on non-native speech. In: Proc.IEEE Internat. Conf. on Acoustics, Speech and SignalProcessing, Hong-Kong, China, pp. 540–543.

Wessel, F., Macherey, K., Schlueter, R., 1998. Using wordprobabilities as confidence measures. In: Proc. IEEE Internat.Conf. on Acoustics, Speech and Signal Processing, Seattle,USA, pp. 225–228.

Woodland, P.C., Povey, D., 2000. Large scale MMIE training forconversational telephone speech recognition. In: Proc. NISTSpeech Transcription Workshop, College Park, MD.

Wrede, B., 2002. Modelling the effects of Speech Rate Variationfor Automatic Speech Recognition. Ph.D. dissertation, Uni-versity of Bielefeld, Germany.

Young, S.J., Woodland, P.C., 1994. State clustering in hiddenMarkov model-based continuous speech recognition. Com-put. Speech Language 8, 369–383.

Yu, H., Schultz, T., 2003. Enhanced tree clustering withsingle pronunciation dictionary for conversational speechrecognition. In: Proc. European Conf. on Speech Commu-nication and Technology, Geneva, Switzerland, pp. 1869–1872.

Zheng, J., Franco, H., Stolcke, A., 2000. Word-level rate ofspeech modeling using rate-specific phones and pronun-ciations. In: Proc. IEEE Internat. Conf. on Acoustics,Speech and Signal Processing, Istanbul, Turkey, pp. 1775–1778.

modeling spontaneous speech variability in professional dictation · modeling spontaneous speech...

Documents