knowing what you don’t know: roles for conﬁdence measures in automatic speech...

Knowing What You Don’t Know: Rolesfor Confidence Measures in Automatic

Speech Recognition

David Arthur Gethin Williams

Department of Computer ScienceDissertation submitted to the University of Sheffield

for the degree of Doctor of Philosophy

May, 1999

Abstract

The development of reliable measures of confidence for the decoding of speech sounds by machinehas the potential to greatly enhance the ‘state-of-the-art’ in the field of automatic speech recognition(ASR). This dissertation describes the derivation of several complimentary confidence measures froma so-called acceptor hidden Markov model (HMM) based large vocabulary continuous speech recog-nition system, and their application to a variety of tasks pertaining to ASR in realistic environments.A key contribution of the thesis is the demonstration that if a rather general definition of what con-stitutes a confidence measure is adopted, a framework results within which it is possible to explorethe utility of confidence measures throughout the recognition process. This general definition ac-crues additional benefits when used in conjunction with a set of more specific confidence measurecategories.The fundamental difference between an acceptor HMM and one which adheres to the more commongenerative formulation is the acceptor’s ability to directly estimate the posterior probability of a classof speech sound given some acoustic observations. Posterior class probabilities, unlike the classconditional likelihoods estimated by generative HMMs, provide measures of model match which aredirectly comparable across utterances and so constitute a good basis from which to derive measuresof confidence.In addition to a review of the literature resulting from the recent surge of interest concerning confid-ence measures for ASR, the dissertation includes results for the application of the described confid-ence measures to the tasks of utterance verification, pronunciation model evaluation and the filteringof ‘unrecognisable’ portions of acoustics from the input to a recogniser. The applications of train-ing data selection, recogniser combination and gender spotting are also described with preliminaryresults.The main conclusions of the thesis are as follows. First, the derived confidence measures were foundto be useful for the tasks to which they were applied, facilitating overall improvements to the system.Second, in agreement with a number of other studies, it was found that confidence measures whichdraw upon more sources of information may be preferred over those which use less—for the taskof utterance verification. A consequence of creating a confidence measure using many informationsources is that their individual contributions tend to become conglomerated, obscuring the causeof the low confidence. Confidence measures with simpler, more explicit links to the recognitionmodels are therefore more informative with regard to the timely task of recogniser diagnostics. Anexample of this diagnostic function is the accumulation of evidence to support the notion that crudepronunciation models mask the relatively subtle reductions in confidence seen for large vocabularycontinuous speech recognition (LVCSR) on clean, read speech, but not the gross model mismatcheselicited by non-speech. Acceptor HMMs are capable of producing confidence measures derived frommany information sources and are particularly well suited to producing those with simple and explicitlinks to the recognition models.

i

Preface

In the interests of brevity, a basic understanding of statistics and probability theory are assumed,together with some knowledge of current automatic speech recognition (ASR) technology. Usefultextbooks include [147, 44, 26, 74], for ASR, and [18] for statistical pattern recognition, especiallyusing artificial neural networks.

Publications

Portions of the work described in this dissertation have appeared in previous publications:

G. Williams and S. Renals. Confidence measures for hybrid HMM/ANN speech recognition.In Proceedings of EuroSpeech, pages 1955–1958. ESCA, 1997.

G. Williams. Study of the Use and Evaluation of Confidence Measures in Automatic SpeechRecognition. Technical report CS-98-02, Department of Computer Science, University of Shef-field, 1998.

G. Williams and S. Renals. Confidence measures for evaluating pronunciation models. In Pro-ceedings of the ESCA workshop on Modeling Pronunciation Variation for Automatic SpeechRecognition, pages 151–155, 1998.

J. Barker, G. Williams and S. Renals. Confidence measures for segmenting broadcast news.In Proceedings of the International Conference on Spoken Language Processing, pages 2719–2722, 1998.

G. Williams and S. Renals. Confidence measures derived from an acceptor HMM. In Proceed-ings of the International Conference on Spoken Language Processing, pages 831–834, 1998.

G. Cook, J. Christie, D. Ellis, E. Fosler-Lussier, Y. Gotoh, B. Kingsbury, N. Morgan, S. Renals,T. Robinson and G. Williams. The SPRACH system for the transcription of broadcast news.In Proceedings of the DARPA Broadcast News workshop, 1999.

E. Fosler-Lussier and G. Williams. Not just what, but also when: Guided automatic pronunci-ation modeling for broadcast news. In Proceedings of the DARPA Broadcast News workshop,1999.

G. Williams and D. P. W. Ellis. Speech/music discrimination based on posterior probabilityfeatures. To appear in Proceedings of EuroSpeech, September 5–9, Budapest, Hungary. ESCA,1999.

This dissertation can be down-loaded from: http://www.dcs.shef.ac.uk/people/g.williams.

ii

Acknowledgments

First of all, I would like to thank all the members of my research group—SPandH—for providing afriendly, supportive and no-nonsense environment in which to study. I would especially like to thankmy supervisor Steve Renals for his friendship and his unfailingly good-humoured and knowledgeableguidance throughout my PhD project. My thanks also go to Martin Cooke, in his advisory role, andfor proof-reading this thesis. Thanks to my other proof-readers, Stuart Cunnigham, Matthew Innes-Wilkin and Jane Williams. It is my pleasure to thank the residents of ICSI for providing such awarm welcome during both my visits to Berkeley. In addition to generously sharing their expertise,the Realization group introduced me to a wider community and gave me some insight into differentways in which to do research. I am hugely indebted to the members of the SPRACH and THISLpartnerships, especially to Tony, Gary and James in Cambridge, for sharing their software, data andknow-how. This foundation formed the substrate upon which my project was built. In addition, Iwould like to thank Søren Riis for many enlightening discussions regarding acceptor HMMs, bothduring his visit to Sheffield and since then. The EPSRC, the Royal Commission for the Exhibition of1851 and the European Union, in the guise of ESPRIT Long Term Research Project THISL (23495),are acknowledged for their financial support.These acknowledgments would not be complete without mention of those who help me to enjoy mytime outside of the lab: My long-time friends and cycling companions, Glyn, Matt, Rich, Paul andChris, and my newer friends Stu, Dave, Colin, Hilary, Hannah, Becky, Dan, Kevin and Sali. Thecamaraderie of the various jitsukas that have comprised the Sheffield University Jitsu club during thethree years over which I have been a member is also gratefully acknowledged. I would like to extenda special thank you to Charl for being my sounding-board on ideas and philosophies, both great andsmall.Finally, this thesis is dedicated to my family, Mum, Jane, and Ga, for their love, support and goodhumour, and to the memory of my late father, who will always be sadly missed.

iii

Glossary

ABC American Broadcasting Companies IncorporatedANN Artificial neural networkDARPA Defense Advanced Research Projects AgencyASR Automatic speech recognitionATIS Airline travel information systems (corpus)BBC British Broadcasting CorporationBN Broadcast newsCD Context dependentCI Context independentCML Conditional maximum likelihoodCNN Cable News NetworkDET Detection error trade-off (plot)EER Equal error rateFA False alarmFFT Fast Fourier transformFM Frequency modulationFOM Figure of meritGD Gender dependentGMM Gaussian mixture modelHDM Hidden dynamic modelHMM Hidden Markov modelHYP (Decoding) HypothesisICSI International Computer Science InstituteLDC Linguistic Data ConsortiumLIMSI Laboratoire d’Informatique pour la Mecanique et les Sciences de l’IngenieurLM Language modelLVCSR Large vocabulary continuous speech recognitionMAP Maximum a posterioriME Maximum entropyML Maximum likelihoodMLP Multi-layer perceptronMSG Modulation-filtered spectrogram (features)NAB North American business news (corpus)NCH Number of competing hypothesesNIST North American Standards InstituteNPR National Public RadioOOV Out-of-vocabularyPLP Perceptual linear predictionREF Reference (transcription)RHS Right hand sideRM (Naval) Resource management (corpus)RNN Recurrent neural networkROC Receiver operator characteristic (plot)ROVER Recogniser Output Voting Error Reduction (NIST software package)

iv

SD Speaker dependentSI Speaker independentSPRACH Speech Recognition Algorithms for Connectionist Hybrids (project)SWB SwitchBoard (corpus)WER Word error rateWSJ Wall Street Journal (corpus)WTN Word transition networkUER Unconditional error rate

v

Nomenclature

ProbabilitySample probability estimateLogarithm to base 10FunctionEntropyMutual informationThe th classDistanceOrder ofAn error functionKullback-Leibler distanceSymmetric Kullback-Leibler distance

Acoustic observation vectorAcoustic observation vector at timeSequence of acoustic observation vectorsSubsequenceSpace of acoustic observations

Word sequence modelSpace of all possible word sequencesBest word sequence modelPartial word sequence modelSet of wordsSet of states or subword classes (e.g. phones)Subword class posterior probability vector at time

th context independent subword classth context dependent subword class

(RNN) State vector at timeSet of state sequence paths for the alignment of to

Set of baseforms for wordBest set of baseforms for wordSpace of all possible baseform sets

Complete model parameter setSpace of complete model parametersAcoustic model parameter setSpace of acoustic model parametersLanguage model parameter setSpace of language model parametersSet of transcribed utterances

vi

An hypothesisThe null hypothesisThe alternative hypothesis(Statistical hypothesis) Test statisticOperating point on a test statisticSet of possible actions resulting from an hypothesis test

Start time (of a decoding hypothesis)End time (of a decoding hypothesis)DurationLog posterior probability ofLog scaled likelihood ofLog online garbage score forPer-frame entropy of averaged overLog -gram probability for given history

-gram based log posterior probability ofLattice density averaged overLanguage model jitter score for

vii

Contents

Abstract i

Preface ii

Glossary iv

Nomenclature vi

1 Introduction 11.1 Acoustic, Language and Pronunciation Models . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Factorisation and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 A Hierarchy of Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 “I’m Sorry, I Didn’t Quite Catch That” . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Two Types of Hidden Markov Model 82.1 The Ubiquitous HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Application to ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Generative HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Probability Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Maximum Likelihood Training Criterion . . . . . . . . . . . . . . . . . . . 122.3.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Acceptor HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Probability Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 -gram Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 The ABBOT/SPRACH System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii

3 Hypothesis Testing 263.1 Classical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Probability of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 DET Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Scalar Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Approaches to Confidence Measures 344.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Likelihood ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Explicit Alternate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.2 Likelihood of Competing Decodings . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Post-Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Language Model Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Other Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Confidence Measures derived from an Acceptor HMM 435.1 Acoustic Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Grammatical and Combined Measures . . . . . . . . . . . . . . . . . . . . . . . . . 445.3 Word-Level Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Utterance Verification 486.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2.1 NAB Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2.2 BN Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3 Marking Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4.1 Duration Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.4.2 Metric Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.5 Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.6 Differing Acoustic Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.7 Lexicon Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.8 OOV Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

ix

7 Confidence-Based Pronunciation Modelling 627.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2 Accommodating Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3 Baseform Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.3.1 Potential Baseform Generation . . . . . . . . . . . . . . . . . . . . . . . . . 647.3.2 Baseform Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3.3 Baseform Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.3.4 Multiwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.4 Transformational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.5 Acoustic Model Retraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.6 A First Baseform Learning Attempt . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.6.1 Potential Baseform Generation . . . . . . . . . . . . . . . . . . . . . . . . . 707.6.2 Baseform Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.6.3 Baseform Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.7 Decision Tree Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.7.1 Potential Baseform Generation . . . . . . . . . . . . . . . . . . . . . . . . . 737.7.2 Baseform Evaluation and Selection . . . . . . . . . . . . . . . . . . . . . . 747.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8 Filtering Audio Streams 778.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.2.1 The Raw Entropy Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788.2.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.3.1 Entropy Based Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 808.3.2 An Alternative Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 828.3.3 Noisier Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.4 Incorporation of Temporal Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 838.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9 Discussion 879.1 Recogniser Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.2 Gender Specific Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.3 Training Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909.4 Computing Full Model Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 919.5 Improved Pronunciation Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 919.6 Improved Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

x

10 Conclusion 9510.1 Summary of Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.2 Novel Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A The ICSI/LIMSI Phone Set 98

xi

Chapter 1

Introduction

In order to recognise speech, the stream of sounds with which an utterance is realised must be mappedto the word sequence of its inception. This decoding process requires a classification mechanismsince a transformation must be made between the continuous domain of the observed acoustics andthe discrete sequence of words. The difficulty of this signal-to-symbol mapping is compounded byvariability that is inherent in the speech signal:

Inter-speaker variation includes physiological differences, such as vocal tract length whichaffects the fundamental frequency of a speech sound, and socio-geographical differences, suchas dialect.

Intra-speaker variability can be physiological, such as the effects of co-articulation,1 or psy-chological, where a speaker’s mood or intention can effect factors such as speaking rate, loud-ness and intonation. Environmental context and background are also important factors: Forexample, read speech is typically articulated more precisely than spontaneous speech and thepresence of background noise can elicit the Lombard effect [125].

Extra-speaker variation covers the affect of the channel upon the observed acoustics. An utter-ance can be spoken, for example, in a quiet office environment, in the midst of traffic noise orover a (bandwidth limited) telephone line.

1.1 Acoustic, Language and Pronunciation Models

In order to build an automatic speech recognition (ASR) system,2 the mapping between the acousticsand the words of an utterance must be modelled using a mathematical function. To date, the mostsuccessful approaches to this task have employed statistical pattern recognition techniques. A statist-ical model has a number of adjustable parameters, the values of which are inferred from an (assumedrepresentative) data sample via some training algorithm. As the goal of the inference process is tomodel the underlying distribution of the observed data points for a given class of speech sound, theapproach is well suited to accommodating the variation in the speech signal.

1.1.1 Factorisation and Pre-processing

In order to recognise an utterance with the minimal risk of error, Bayes’ decision rule states that theword sequence model with the highest posterior probability should be chosen:

1In common with all other physical objects, the components of the speech production system (tongue, teeth, lips, velum,glottis etc.) have mass and so exhibit inertia. This resistance to a change in motion creates a blurring of acoustic features,termed co-articulation, seen when one speech sound is fluidly sequenced with another. The compensatory phenomena ofassimilation and elision, observed when speech sounds requiring radically different articulator configurations are juxtaposed,may also be included in a description of co-articulation.

2According to convention, the term recognition is limited to the transcription of an utterance and understanding is reservedfor some higher level of abstraction.

1

(1.1)

where is the space of all possible word sequence models, is the set of all parameters employedin the space and the acoustics are represented by a sequence of acoustic observation vectors,also known as feature vectors or patterns, .Feature extraction is an important pre-processing of the acoustic waveform since, in principle, sucha process may be used to reduce the data rate to a tractable level whilst also emphasising those as-pects of the signal which are useful for discriminating between various classes of speech sound. Afeature of the speech signal which is deliberately attenuated in current ASR systems is fundamentalfrequency. Although future speech understanding systems will undoubtedly utilise the intonationprofile of an utterance, fundamental frequency is deemed to be unimportant for recognition and issuppressed to simplify the task. For the purposes of recognising speech using a digital computer, ahigh quality (broadband) representation of the acoustic signal is typically obtained using a samplingfrequency of 16 kHz.3 Feature vectors are computed from the resulting discrete time-series using alocal analysis window of the order of 20–40 ms, termed a frame, which is advanced using a step-sizeof the order of 10–20 ms. The speech signal is assumed to be stationary over the interval spanned bythe step-size and is thus represented at the potentially coarser level of granularity of the frame rate.Useful reviews of feature extraction methodologies are given in [147, 74]. Although the feature ex-traction stage provides an opportunity to aid the task of classification through the application of priorknowledge regarding the nature of the speech signal, the assumption that speech signal is stationaryover intervals of 10–20 ms is a potentially limiting one. Many other potentially limiting assumptionsare employed by current ASR technology for reasons of computational expediency, some of whichare described in chapter 2.Although Bayes’ decision rule (equation 1.1) provides a recognition strategy which is in principleoptimal, direct estimation of is difficult in practice. Fortunately Bayes’ theorem (equa-tion 1.2) provides a useful factorisation of which is almost universally adopted forASR: is the prior probability of the word sequence and is typically estimated using a lan-guage model, leaving / to be estimated by an acoustic model. The use of thesymbols and indicate that the two models employ disjoint parameter sets.

(1.2)

In general, the number of parameters that may be utilised by a model and the detail of its resolution islimited by the quantity of data that is available for training. Given a larger training set, it is possibleto (1) train a model with more parameters; or (2) to train a more detailed model. An example of (1)is the expansion of the number of hidden units (HUs) in a multilayer perceptron (MLP), allowingthe inference of a more general set of functions. If an insufficient amount of data is used to trainsuch a model, overfitting of the training data will result, leading to poor generalisation to previouslyunseen data. An example of (2) is an increase in the number of classes to be distinguished by themodel. In this case it must be ensured that the training data contains sufficient examples of theseclasses, otherwise unreliable parameter value estimates will be obtained. Although a more detailedmodel typically employs more parameters, the distinction is still useful. A recurring theme withinthe field of statistical modelling in general, and the application of statistical modelling techniques toASR in particular, is the need to temper the desire to create a more ‘complex’ model by the amountof training data available.

3Sampling theory states that a signal must be sampled at the Nyquist rate of 16 kHz to unambiguously resolve frequencycomponents up to 8 kHz, i.e. without aliasing [131].

2

An example where this complexity/data tradeoff influences the design of an ASR system is the use ofdisjoint training sets for the acoustic and language models: The acoustic model must be trained usinga transcribed corpus of recorded speech data, whereas the language model probability distributions,which are conditionally independent of any acoustic observations, may be estimated using a separate,text-only, corpus. As several orders of magnitude more words may be collected and stored if theiracoustic realisations are not required, the acoustic and language models are typically trained usingtwo different corpora.

1.1.2 A Hierarchy of Acoustics

Since speech recognition is a classification process, the task of recognising an utterance decomposesinto the task of identifying a sequence of examples drawn from a set of speech sound classes. Anessential design question when building an ASR system is, therefore, what classes of speech soundshould be chosen as the basic modelling unit? Any such class must satisfy two constraints:

Consistency Examples of the same class must exhibit consistently similar acoustic features, whereasexamples of differing classes must possess acoustic features which are consistently distinct.

Trainability Sufficient examples of each class must exist in the training data to support robust es-timation of model parameters.

One source of speech sound classes which may seem initially appealing are the words of a language.Whilst words might ostensibly satisfy the first constraint, they fail the second for large vocabulariesdue to the broad spectrum of relative word frequencies—the word gargoyle, for example, is widelyknown but infrequently used, whereas short function words, such as and, of and the, occur withextreme regularity. For all but the most constrained vocabularies, there will be some words withvery few examples, no matter how large the training corpus. Data sparsity problems are thereforeinevitable when trying to infer reliable parameter values for whole word models. Another problemwhich plagues the whole word modelling approach is that parameter storage demands increase rapidlyif a separate model is created for addition to a vocabulary.A more practical and parsimonious solution for large vocabularies is to exploit the hierarchical natureof speech sounds—utterances are composed of a series of words, which are in turn created using aninventory of subword speech sounds. If the speech signal is modelled at the subword-level, trainingexamples as well as parameters for the collection of subword models are shared across frequentlyand infrequently occurring words alike. The sharing of parameters across several word-level models,as described in [11], is an example of the important notion of parameter tying, reviewed in [215],which may be used to obtain robust parameter value estimates given limited data.One candidate for an appropriate subword unit is the linguistically inspired phoneme, defined as oneof the set of speech sounds in any given language that serve to distinguish one word from another [1].Due to factors such as co-articulation, dialect and the extremely complex nature of what constitutesambiguity between words, the realisation of two instances of the same phoneme can be acousticallydistinct and yet perform the same linguistic function. As an example, consider the initial plosivesound in the word bicycle. If the word is spoken quickly, the plosive may be completely unvoiced;4whereas a slower speaking rate is likely to yield a voiced version. The interchange of these twoplosives does not cause a mutation to another word. The notion of an allophone is introduced toaccommodate this fact and is defined as any of several speech sounds that are regarded as contextualor environmental variants of the same phoneme [1]. Given that allophonic variation exists, a phonet-ically motivated inventory of subword speech sounds may well be preferable for the task of acousticmodel building, where a phone is defined as a single, uncomplicated speech sound [1]. Phones maybe shared across languages whereas phonemes are by definition language specific. The use of aphone-based subword class inventory has been adopted by a large number of current ASR systems,including the ABBOT system [169] used to carry out the experiments described in this dissertation. A

4Voicing, or its absence, refers to the state of the glottis. If the edges of the vocal folds are nearly touching, air expelled fromthe lungs will cause them to vibrate. If the folds are well separated, however, no vibration will occur. Sounds produced withvibrating vocal folds are said to be voiced as opposed to those produced with a wide glottis which are said to be unvoiced [119].

3

table of the 54 members of the ICSI/LIMSI phone set used by the ABBOT system and also for any ex-amples of pronunciation given in the following chapters, is provided in appendix A. The ICSI/LIMSIphone set is a subset of the 61 phones that were used to transcribe the DARPA TIMIT5 corpus. Otherpossible subword units include diphones and syllables. Diphones are typically defined to extend fromthe steady state portion of one phone to that of its neighbour and so focus upon transitional portionsof the speech signal. Diphones have proved popular for speech synthesis, e.g. [19], and may be mo-tivated from a recognition perspective by evidence, such as that presented in [66], which suggeststhat periods of maximal spectral transition provide the most information with regard to phoneme andsyllable perception. The importance of syllable length durations (i.e. of the order of approximately200 ms) for speech recognition is described in [24, 213]. Despite their appeal, the large number ofpossible syllables contained in English raises creates a number of computational problems for largevocabularies, similar to those associated with the choice of words as the basic modelling unit.Adopting subword classes as the basic modelling units necessitates the specification of a set of pro-nunciation models to decompose words into the sequence of subword sounds used to realise them. Apopular pronunciation modelling strategy is to compile a lexicon of ‘static’ baseforms enumeratingmappings between the orthography of in-vocabulary words and strings of subword units. Generat-ing accurate pronunciation models for words is a difficult problem, however, as the pronunciationof a word rarely follows the ‘beads-on-a-string’ template of a static baseform, but is rather a com-plex function of many (high-level) contextual factors. The distinction between static and dynamicpronunciation models is described in chapter 7.

1.2 “I’m Sorry, I Didn’t Quite Catch That”

A crucial property for any speech recogniser, especially when supplied with a degraded speech signalor one which is overlaid upon acoustics from other sources (potentially including other speakers), isto be able to attribute some measure of confidence to a decoding it produces. This need has also beenidentified in several relatively recent reviews of the ‘state-of-the-art’ in the field of ASR [37, 24].The benefit of assigning a confidence estimate to a decoding is succinctly summarised by the phrase,knowing what you don’t know. A popular definition for the term confidence measure [73, 205, 187,181, 103] is:

The posterior probability of word correctness, given the values of some set of confidenceindicators.

However, it will be argued that a much more useful definition is:

A function which quantifies how well a model matches some speech data; where thevalues of the function must be comparable across utterances.

This general definition, which is used throughout the rest of this text, is preferable as it allows confid-ence measures to be applied at levels other than the word-level, such as at the subword- or utterance-levels. The general definition accrues additional benefits when used in conjunction with three morespecific categories:

An acoustic confidence measure; derived exclusively from the acoustic model.

A grammatical confidence measure; derived solely from the language model.

A combined confidence measure; derived from both the acoustic and the language models.

These specific categories are useful as they accommodate confidence measures based upon specificcomponents of a recognition system and, as a consequence, encourage the design of confidencemeasures with simple and explicit links to the recognition models.

5The acronym is derived from Texas Instruments andMassachusetts Institute of Technology, where the corpus was collectedand transcribed respectively. The corpus is available from the Linguistic Data Consortium: http://www.ldc.upenn.edu/

4

The benefits of adopting this framework for investigating confidence measures are illustrated in thefollowing example: Given that a word has been decoded with low confidence, it is desirable to beable to identify the cause of the lack of confidence. One potential source is the occurrence of anout-of-vocabulary (OOV) word. In this case, although the word-level confidence is low, the acousticconfidence of all its subword-level constituents should be high, since all words are composed fromthe same inventory of subword sounds. Further evidence for an OOV word would be provided by alocally low language model score since the cause of such a low score could be the incorporation of anincorrect (in-vocabulary) word, with a similar acoustic realisation to the OOV input, into an otherwisecoherent and correctly recognised word sequence. Another possible cause of low confidence is noisyacoustics. In this case, a function which provides a general measure of acoustic model match willbe able to discriminate between clean, well modelled speech and poorly modelled speech, such asthat transmitted over a low fidelity channel or overlaid on a high level of background noise or music(assuming that the acoustic model is trained using clean speech data).This rather general framework being advocated is additionally attractive as it suggests a wide rangeof potential confidence measure applications throughout the recognition process:

Filtering Given a general measure of acoustic model match, ‘unrecognisable’ regions may be ex-cised from an unconstrained stream of acoustics: Regions of clean speech will be well matchedby an appropriately trained acoustic model, whereas regions of low fidelity speech or regionscontaining high levels of non-speech will be poorly matched. If this filtering stage is carriedout prior to the decoding6 stage, the potential exists to reduce both the overall computationalexpense and the word error rate (WER) incurred by a recogniser operating in a realistic envir-onment. As a refinement to this strategy, the values of a general measure of acoustic modelmatch may also be used to predict the WER for a particular region of acoustics. Given gender-dependent acoustic models, a similar filtering process may be used to make a gender decisionon portions of an incoming speech signal, again prior to the decoding stage. The appropri-ate acoustic model may then be used to recognise the gender-tagged regions with potentialfor further WER reductions. The scope of this filtering approach may be further extended todistinguishing between adult and child speech, accents etc.The application of a confidence measure to a filtering task could potentially be usefully com-bined with some initial sound source separation, either through auditory scene analysis [27, 40]or hiddenMarkov model decomposition [199, 200], or some speech enhancement process, suchas spectral subtraction [21] for example. Whilst the isolation of a single speech signal froma more complex auditory scene has clear benefits for recognition, a general measure of acous-tic model match is still desirable as (1) complete source separation is not guaranteed; and (2)error free recognition, even given complete separation, is not assured. This potentially fruitfulcombination was not pursued in the work described by this dissertation.

Search It is possible to order or prune the search over the space of alternative decoding hypothesesby ranking partial decoding hypotheses according to some measure of their confidence.

By using acoustic and grammatical confidence measures in tandem, the relative weightings ofthe acoustic and language model contributions to the probability of an hypothesis may beadjusted ‘on-the-fly’, according to the respective qualities of model match. For example, thelanguage model component could be favoured in regions of noisy acoustics and vice versa.

Rejection This application encompasses three related tasks:

Utterance Verification The verification task is to identify and reject incorrect decoding hy-potheses whilst retaining those which are correct. No distinction is made between causesof incorrect decodings and so it is implicitly assumed that all inputs to the recogniser fallwithin its vocabulary. If this assumption holds, causes of error include crude recognitionmodels, non-speech sounds, disfluencies and search errors.

Keyword Spotting If only a small set of task-specific keywords are of interest for a given ap-plication, and the inputs to the recogniser are relatively unconstrained, i.e. possibly con-taining non-speech sounds and words drawn from a wide vocabulary, an efficient strategy

6The term decoding is conventionally reserved for search over for , whereas recognition is used to describe thecomplete transcription process.

5

is to search only for the keywords of interest and to reject all other inputs. Keyword spot-ting is common in many telephony based applications, such as voice dialing for example.

OOVWord Spotting If the vast majority of recogniser inputs are in-vocabulary words spokenin a benign acoustic environment a large proportion of the remaining recognition errorscan be eliminated if errors due to the occurrence of OOV words can be reliably spotted.

It will be seen in chapter 3 that these three tasks, amongst others, may be cast within a statisticalhypothesis testing framework, where the value of a confidence measure determines the decisionmade by the hypothesis test. A complimentary set of confidence measures, sensitive to differentsources of recogniser error, will clearly be useful for a refined approach to the rejection task.

Model Selection If several recognisers are run in parallel, confidence measures may be used to selectbetween the competing decodings, in an analogous way to the search application. A motiva-tion for running several recognisers in parallel is that they may be tuned to different acousticconditions, genders, ages or dialects.

Model Adaptation One source of difficulty for an ASR system is a mismatch between the acousticconditions encountered during training and those seen when testing the recognition models.As a consequence, there is a growing interest in algorithms which adapt model parameters tomore closely match the incoming speech during the testing phase. Ideally, model adaptationshould be based only upon correct mappings between speech sound examples and their class.Confidence measures provide a means to refine model adaptation procedures carried out in anunsupervised fashion.

Data Selection The statistical pattern recognition approach to ASR requires that the recognitionmodels are trained on data. Training subword-level acoustic class models, for example, re-quires an acoustic corpus transcribed at the subword-level. As obtaining such a transcriptionby hand is costly and time-consuming, an automatic decomposition of the reference ortho-graphy for some training data into a sequence of subword-level sound classes is often used.An incorrect subword-level transcription of an acoustic training set will clearly compromisethe quality of an acoustic model trained upon it. Confidence measures derived from a partiallytrained model could be used in a ‘bootstrap’ fashion to identify and excise low confidence (er-rorful) portions of the training data transcription, with beneficial consequences for the qualityof the final model [2]. Perhaps more excitingly, confidence measure based data selection mayalso be used in an unsupervised recogniser training regime, where a partially trained recogniseris used to decode cheap and plentiful untranscribed acoustics; high confidence portions of theautomatic transcription are selected and trained to; and the process is iterated.

Diagnostics and Evaluation When ASR systems fail (which currently they all too readily do), find-ing the source of the problem will provide the feedback necessary to improve the offendingcomponent. This diagnostic approach to system improvement has only recently received in-terest since large performance improvements have previously been unfailingly available throughthe blind adoption of larger training corpora and model parameter sets [24]. As the body ofASR technology matures and portions of it are utilised in niche applications, diagnostic toolswill become invaluable for system development. An important role for confidence measuresis to satisfy this diagnostic need by identifying which components of a system are providinga good match to the data and which are not. Confidence measures may thus provide a usefulcompliment to the standard method of system evaluation–the measurement of WER [217].

Subsequent Processing Confidence measures have also been shown to be useful for processes fol-lowing the recognition of some acoustic data. For example, confidence measures may be in-corporated into word and term weighting functions for the purposes of information retrievaland extraction [197] from spoken documents.

It should be noted at this point that the use of the term confidence measure as applied to some decod-ing hypothesis is distinct from the notion of a confidence interval, sometimes referred to as an errorbar, provided by a probability distribution over a model prediction. The topic of confidence intervalsin relation to the Bayesian approach to parameter estimation is returned to in section 2.3.2.

6

1.3 Overview

The vast majority of current ASR systems are based upon generative hidden Markov models (HMMs).The outputs of a generative HMM require further processing, however, before they may be used as aconfidence measure. In contrast, acceptor HMMs are well suited to producing confidence measures,since their outputs may be used as such directly. The experimental results presented in this disserta-tion show that various confidence measures derived from an acceptor HMM are useful for the tasksof utterance verification, pronunciation model learning and the filtering of acoustics prior to decod-ing. Some evidence is also presented to support the argument that confidence measures with simpleand explicit links to the recognition models can provide insight into the performance of various com-ponents of an ASR system and so constitute useful diagnostic tools. As confidence measures can bedependent upon various components of a recognition system, their diagnostic use highlights the needfor a holistic approach to system development. In addition to the tasks investigated, the frameworkoutlined in section 1.2 suggests a number of potential confidence measure applications throughoutthe recognition process.The remainder of the dissertation has the following structure:

Chapter 2 compares and contrasts generative and acceptor HMMs, providing background theory,notation and terminology.

Chapter 3 describes the fundamentals of statistical hypothesis testing and a number of metrics forevaluating the performance of an hypothesis test. It is argued that the differing properties of thevarious evaluation metrics make them preferable under different circumstances. The chapteralso explains how various confidence measure applications can be cast within an hypothesistesting framework.

Chapter 4 reviews the confidence measure literature.

Chapter 5 describes the confidence measures derived for the experimental component of this work.

Chapter 6 provides method and results for a series of utterance verification experiments designed tocompare the performance of the set of derived confidence measures upon two quite differentcorpora.

Chapter 7 reviews the pronunciation modelling literature and describes two experiments carried outto investigate the potential of confidence measures as a means to assess pronunciation mod-els. The hypothesis that improved pronunciation models lead to improved acoustic confidencemeasures is also tested.

Chapter 8 describes several experiments carried out to investigate the utility of a general measure ofacoustic model match for discriminating between ‘recognisable’ and ‘unrecognisable’ acous-tics.

Chapter 9 discusses of a selection of potential applications of confidence measures and some relatedissues.

Chapter 10 provides the thesis conclusions.

7

Chapter 2

Two Types of Hidden Markov Model

2.1 The Ubiquitous HMM

HMMs have been found to be extremely useful for the task of acoustic modelling as they may be usedto estimate a probability distribution over different word sequences for an utterance . Sinceexamples of a given class of speech sound can differ in length, due to varying speaking rates for ex-ample, the probability estimation problem differs from that for a static classification task. For ASR,the class models must be aligned to varying length intervals of acoustics before classification canproceed. HMMs have surpassed nearly all other methods for ASR since their architecture inherentlycombines a time-alignment component with statistical pattern matching techniques in a computa-tionally feasible manner. An agreeable result of the eclipse of the previous method of choice forASR—template matching via dynamic programming (DP), comprehensively reviewed in [147]—isthe replacement of heuristic techniques with well founded statistics.The theory of generative HMMs is well established and its application to ASR has been describedin a number of sources [148, 145, 146, 142, 147, 44, 26]. The goal of this chapter is not to repeatthe tutorial information contained in the above citations, but rather to compare and contrast the wellestablished generative HMM to a less well known alternative—the acceptor HMM.The common features of the two variants define them as examples of an HMM:

Both models are examples of stochastic finite state machines. As such, they are composed of aset of states , each state possessing a set of probabilistic transitionsto other states, possibly including itself.

Both models implement a Markov chain. The Markov assumption is that the probability oftransition from one state to the next is conditioned only upon a limited history of the previ-ously visited states. The probability of transition to the next state is conditioned only upon thecurrent state in a 1st order Markov chain;1 upon the current and previous state

in a 2nd order chain; and so on. The transition probabilities of a Markovchain are time invariant.

Probability distributions over the values of some observed variable are associated with eachstate or each transition of the model, given a Moore or Mealy formulation respectively, asdescribed in [44]. A consequence of these distributions is that in general it is uncertain whichstate corresponds to which observation vector and so the state sequence is effectively hiddenfrom the observation sequence.

1Sometimes summarised by the statement that “the future is conditionally independent of the past.”

8

2.2 Application to ASR

HMMs may be applied to ASR by creating an HMM for each class of speech sound to be modelled(including a silence/background model) and by representing points in the acoustic observation space

using values of the random (vector valued) variable associated with the observation distributions.Utterance-level HMMs are created by concatenating word-level HMMs, which are in turn constructedby concatenating subword-level HMMs. Class models are assigned a strictly left-to-right (as opposedto a fully connected ergodic) architecture [96], reflecting prior knowledge of the sequential natureof speech. If no state skipping transitions are allowed, the number of states in a model providesa minimal durational constraint2 (vowels are typically longer than consonants). The addition ofself-loops on states accommodates temporal variation in the examples of a class. The observationdistributions accommodate variation in the acoustic realisation of a class, which is assumed to bestationary in time.The next two sections cover the ‘three classic problems of HMMs’ [148, 146]3 for the generative andacceptor cases respectively:

Probability Estimation How can the probability of a word sequence be calculated, given someacoustics ?

Training How can values for the model parameters be inferred from some training data?

Decoding Given a set of trained HMMs, how can the ‘best’ word sequence model be found?Bayes’ decision rule (equation 1.1) states that the probability of misclassification4 is minimisedif is chosen such that is maximised.

2.3 Generative HMMs

2.3.1 Probability Estimation

Generative HMMs take advantage of the observation that the denominator of equation 1.2 (Bayes’theorem), , is independent of and is therefore constant for all possible decodings of anutterance, given fixed . The equation may therefore be simplified (although this causes a problemfrom the perspective of producing confidence measures, as described in chapter 4):

(2.1)

If a language model is used to estimate , then only remains to be estimatedby the acoustic model. As the acoustic model term amounts to the probability of the acoustics condi-tioned upon a particular model , the acoustic model may be interpreted as estimating the probabilitythat generated the acoustics. This conditional probability is often referred to as the likelihood ofthe model and may be found by summing over the probabilities of the set of all legal statesequences, termed paths, of length through the model, :

(2.2)

2Explicit duration distributions may be also be associated with a state, although they are not considered here.3This formulation of HMM theory is attributed in [146, 147] to a series of lectures given by J. D. Ferguson of the Institute

of Defence Analyses.4One form of misclassification may have worse consequences than another. A commonly used example is from the domain

of medical diagnosis, where the repercussions of erroneously diagnosing a patient as not having a life-threatening conditionmay be worse than vice versa. The relative weightings of different forms of misclassification may be encoded in a loss matrixand incorporated into the search for the minimum risk classification. If all weightings are equal, minimum risk becomesequivalent to the minimum probability of misclassification.

9

where represents the occupancy of some state at time and the probabilities may be summed aseach path represents a mutually exclusive event.The computational expense of computing the probability of a path may be reduced by making Markovassumptions. For a first order Markov model, the factorisation of equation 2.2 is given in equa-tions 2.3 and 2.4 for a Mealy and Moore formulation respectively:

(2.3)

(2.4)

An additional assumption which is implicit in both 2.3 and 2.4 is the acoustic observation independ-ence assumption, well described in [26], which states that the probability of the current acousticobservation vector is conditionally independent of any acoustic context given the current state (ortransition). This amounts to the assumption that the current state (or transition) summarises all relev-ant contextual information. Written formally for the Moore formulation, this becomes:

(2.5)

Given the high degree of context dependency that is present in the speech signal, due to co-articulationfor example, this assumption can be a limiting one for a simple Markov model. As an example,consider the words miniature ([m ih n iy ax ch axr]) and signature ([s ih gcl n ax ch axr]). Throughintrospection, it can be witnessed that the articulation and hence the acoustic realisation of the [ih] inminiature, unlike that in signature, is influenced by the nasal properties of the surrounding [m] and[n]. If simple context-independent (CI) phone models are concatenated to form the respective wordmodels, as implied by the phonetic transcriptions given in square brackets, no distinction will be madebetween the two different realisations of the phone [ih], resulting in a crude model. Although thisexample highlights problems for first order HMMs created using CI subword classes, it represents asimplification of the degree of context dependency in the speech signal, where the realisation of oneclass of speech sound may be dependent upon factors more distant and complex than just the identityof the class proceeding it.Steps taken to mitigate the acoustic observation independence assumption within the generativeHMM framework include (1) a refinement in the granularity of the basic modelling units, throughthe use of context-dependent (CD) subword units, e.g. [121]; and (2) the incorporation the first andsecond derivatives of the individual features (computed over a local analysis window), termed deltafeatures, into the acoustic observation vector [67]. It is clear from the example that what is requiredis an increase in the context dependency of the probability distributions used to create the HMM.Justification for (1) is found in the fact that any th order HMM can be represented using a first ordermodel with an expanded set of states, i.e. a set of CD states. Modelling higher order Markov statisticsrequires a more complex model and so has associated problems in terms of computational expenseand the amount of training data required, however, no matter whether it is done by lengthening theconditioning history or by emulating a higher order model through the use of more states. To modelCD subword units, it must be ensured that the training set is sufficiently large so as to contain a suffi-cient number of examples of the relevant context classes. Although the incorporation of delta featuresinto the acoustic observation vector, as described in (2), may be motivated by (a) the knowledge thattransitional information is important [66]; and (b) their ability to characterise the acoustic context ina limited sense, a precise interpretation of their use is lacking. A point raised in [24] is that the useof delta features could make HMMs more sensitive to speaking rate.Despite the computational savings offered by Markov assumptions, the direct computation of thefull model probability given in equations 2.3 and 2.4 is computationally infeasible for anything otherthan simple examples, requiring operations [148, 26] for a model with states and anobservation sequence of length . Fortunately the required probability may be computed using the

10

highly efficient forward-backward algorithm [13, 12] using only operations [148, 26].5The efficiency of this algorithm stems from its recursive nature and hence the elimination of repeatedcalculations. (In actual fact only the so-called forward recursion is required to compute this particularprobability, although both the forward and backward recursions are called upon in a solution tothe training problem.) Further reductions in computational expense can be made by replacing thesummation over the probabilities given by all paths through a model with a DP search for only themost probable path through . This is termed the Viterbi approximation [202, 59] and amounts tothe assumption that only one path provides any significant contribution to the full model probability.For the Moore formulation:

(2.6)

An agreeable side-effect of the Viterbi approximation is an explicit segmentation for the time-align-ment of the word sequence model to the acoustics . This segmentation provides a state labelfor every frame of an utterance. By aligning the reference word transcription model againstsome acoustic training data, in a so-called forced Viterbi alignment, training targets can be derived foreach frame of the utterance. (In the absence of a phone-level transcription, this automatic alignmentprocedure relies upon a set of pronunciation models to map the reference orthography to a sequenceof subword classes. Crude pronunciation models will therefore result in erroneous training targets.A potential method for combating this problem is discussed in sections 4.3 and 9.3.)

Emission Probabilities

The state (or transition) observation distributions within the generative HMM framework are termedemission distributions as they estimate the probability that the state (or transition) emitted the obser-vations. A discrete HMM arises if the random variable associated with the distributions is discrete; acontinuousHMM has a continuous random variable.Discrete HMMs are attractive due to their computational simplicity—emission probabilities can beaccessed quickly, for example, through a process of table look-up. In order to apply a discrete HMMto ASR, however, the continuously varying properties of the speech signal must be represented usinga discrete random variable. In order to achieve this, a vector quantisation (VQ) step is required [126],which replaces a continuously valued feature vector with its closest prototype (according to some dis-tance metric). A codebook of a limited number of prototypes is typically created through the applic-ation of a clustering algorithm, such as the -means algorithm, described in [18], to a representativedata set. A useful side-effect of the VQ process is thus a reduction in the data rate. The appeal ofthe continuous HMMs does not stem from computational simplicity but rather from an improvedmatch between the continuously valued observations of the speech signal and the representative ran-dom variable. A disadvantage of the continuous HMM is that a particular parametric form for theemission distribution must be explicitly assumed. Discrete HMMs do not encode any assumptionsregarding the parametric form of the emission distribution, but the quantisation of the feature vectorsintroduces a degree of distortion, determined by the codebook size [146].A popular strategy for ameliorating the problem of an assumed parametric form is to construct theemission distributions using a mixture of parametric distributions such as a mixture of Gaussians,e.g. [100].6 In principle, given enough components, a Gaussian mixture model is a universal functionapproximator and so may be used to model any smooth continuous function, to an arbitrary degreeof accuracy [140]. In practice, however, as described in section 1.1.1, the number of parametersthat may be employed by a model and hence the number of mixture components is limited by theamount of available training data. Too few components will yield a distribution with insufficient flex-ibility to model the complex, typically multi-modal, likelihood distributions observed for the speech

5These estimates of algorithmic complexity are upper bounds as they are based upon an ergodic HMM, as opposed to onewith a left-to-right architecture, in which fewer state transitions are allowed.

6A mixture distribution attached to a single state was also shown in [100] to be equivalent to a set of parallel states, eachwith a single component distribution attached.

11

signal. Overfitting of the training set will result, however, if too many components are used, lead-ing to poor generalisation. (Gaussian mixture models have interesting links to radial basis functionnetworks [154].)As a consequence of limited training data, a vast number of parameter saving permutations on themixture model theme exist and many have been tried in practice. A commonly employed strategyis to equip Gaussian components with diagonal only covariance matrices (i.e. all components offthe principle diagonal of the feature covariance matrix are set to zero), e.g. [9]. Implicit in theassignment of a diagonal covariance matrix to a single Gaussian distribution is the assumption thatthe feature vector elements are statistically independent, as described in [18], which will be false forspectrally derived features. This assumption is relaxed given a mixture of Gaussians, each componentof which is equipped with a diagonal covariance matrix. Another parameter saving strategy is the useof tied mixture or semi-continuous HMMs [91, 14] which make use of a codebook of continuousdistributions which are shared components for all emission distribution mixtures.Although a great deal of effort and empirical research has gone into evaluating different parametersaving strategies, it is unclear how any particular strategy may be preferred a priori over any other. Itis also debatable whether any of this fine-grain engineering effort will make any considerable impacton the performance of ASR systems in the long-term.

2.3.2 Maximum Likelihood Training Criterion

As a consequence of ignoring , the model parameters may be optimised using the maximumlikelihood (ML) criterion. Given a training set of transcribed utterances:

(2.7)

where is the reference transcription for the utterance and the goal is to optimise suchthat:

(2.8)

where is the space of all possible acoustic model parameter sets and is the parameter set corres-ponding to the word sequence model . The optimisation of a model’s parameter set accordingto such a criterion is sometimes more conveniently expressed as the minimisation of an error function

, by taking the negative logarithm. For the ML criterion this becomes:

(2.9)

(2.10)

It is important to distinguish between the optimisation of a model’s parameter set according to theML criterion and the ML approach to parameter estimation in general. The goal of the latter is tofind the single best set of parameter values for a model through the process of minimising an errorfunction, where the error defining criterion needn’t be limited to the ML criterion (several other op-timisation criteria are described in section 2.4.2). The ML approach to parameter estimation may becontrasted with (1) the maximum a posteriori (MAP) approach; and (2) the Bayesian approach. TheMAP approach, as it is formulated in [72], makes use of a prior distribution over . This is com-bined with , using Bayes’ theorem, to obtain . An implicit assumption in the ML approachto parameter estimation is that sufficient training data is available so as to obtain reliable parametervalue estimates. The use of a prior distribution over is useful for smoothing and adaptation whentraining data is sparse. The Bayesian approach to parameter estimation also makes use of a priordistribution over . A single set of parameter values is not considered when adopting this approach,

12

however, and the prior is updated to form a posterior distribution over following the observationof some data, again using Bayes’ theorem. A probability distribution over the parameters of a modelgives rise to a probability distribution over the outputs of the model given particular input. In thiscase any predictions made by a model may also be assigned error bars, centered upon . Al-though obtaining error bars on the probability estimates output by an ASR system is desirable, theBayesian approach to parameter estimation suffers from considerable problems regarding computa-tional feasibility. Only the ML approach to parameter estimation is considered in the remainder ofthis text.Since the optimisation of the parameters of an HMM according to any error criterion requires themodification of state-specific parameter values and the actual state sequence associated with anyparticular observation sequence is in general unknown, training an HMM becomes a non-linear op-timisation problem (the observations are in general a non-linear function of the state) with missingdata.7 The EM (expectation-maximisation) algorithm [45] is well suited to this type of optimisationproblem. Starting from some initial set of parameter values, the EM algorithm iterates (an iterativeprocedure is required as the optimisation lacks an analytical solution) over two steps:

E-step Given the current parameter values, the expectation of the joint probability distribution overthe observed and missing data is computed.

M-step A set of new parameter values are obtained by maximising this expectation function.

The algorithm has a number of desirable properties:

At each iteration, it guarantees a monotonic increase in the likelihood of the data given themodel parameters.

This monotonic increase in likelihood guarantees convergence to point of local minima on theerror surface. This guarantee can be contrasted with gradient-based optimisation methods, forexample, which will not converge if an overly-large step-size is chosen.

The algorithm is relatively simple and does not require the specification of an arbitrary step-size, the calculation of local gradient information or the computation of a Hessian matrix, asare required by other non-linear optimisation algorithms.

Baum-Welch re-estimation [13, 12] is an efficient EM algorithm for generative HMMs based-uponthe forward-backward algorithm.A disadvantage of the ML training criterion is that it is not discriminative: It can be seen fromequation 2.8 that the parameter optimisation for the word sequence model is carried out inthe space and so the overall optimisation is carried out independently for each component of thecontinued product given in equation 2.8. A consequence of this is that the HMM is forced to modelthe within-class data distributions. In theory, if these distributions were modelled accurately, theoptimal Bayes’ classifier would result. Given the simplifying assumptions introduced above, forreasons of computation expediency and limited training data, it is highly unlikely that anything closeto accurate within-class data distribution models can be obtained, however. This therefore begs thequestion whether a model with known deficiencies should be trained as a generator when the desireis to use it for recognition. The consequences of ML training in terms of classification error andparameter requirements is discussed with regard to discriminative training criteria in section 2.4.2.

2.3.3 Decoding

In terms of decoding a speech signal, Bayes’ decision rule (equation 1.1) states that the word sequencemodel which maximises will minimise the probability of misclassification. Givenseparate language and acoustic models and the independence of from , this may besimplified to:

7In addition to the state sequence, the component from a mixture-based observation distribution responsible for any givenobservation will also be unknown.

13

(2.11)

Despite this simplification, the computation required to evaluate the probability of all possible wordsequence models in the space is immense for large vocabulary continuous speech recognition(LVCSR) tasks. Even given the additional computational savings offered by the Viterbi approxima-tion, an exhaustive search is still computationally intractable. A practical decoding algorithm musttherefore use a principled method for narrowing the search and evaluating only a subset of . Solu-tions to this problem are described for Viterbi and stack based decoding algorithms below. (Thedescriptions are influenced to a large degree by those given in [153].)

Viterbi Decoding

In its exhaustive form, the Viterbi decoding algorithm [59] is guaranteed to find . The algorithmis also relatively efficient as its recursive nature avoids repeated calculations. For LVCSR tasks,however, an exhaustive search is not practical and some pruning of the search space is required. Onesolution to this problem is to implement a search beam, e.g. [135], of width relative to the logarithmof most probable partial decoding hypothesis at time , . Partial decoding hypotheseswith likelihoods falling below are discarded. The logarithm ensures partialdecoding hypotheses with likelihoods below some fraction of are discarded as opposedto those falling outside a fixed difference relative to .

Stack Decoding

Stack based decoding [98] is an example of the algorithm [136]. Using this approach, a partialdecoding hypotheses is ranked according to the sum of its likelihood and some heuristic estimate ofthe cost of completing the remainder of the decoding path and a stack is used to store the rankedhypotheses. The highest ranked partial decoding hypothesis is (1) removed from the top of thestack; (2) extended; (3) re-evaluated; and (4) replaced at the appropriate level in the stack, at eachiteration. Providing the path completion heuristic never over-estimates, the search will be admissible.In this case, as for an exhaustive Viterbi search, is guaranteed to be found. Unlike Viterbidecoding, the search is time-asynchronous and so partial paths of different lengths are stored in thestack. Useful additional computational savings can be obtained by fixing the capacity of the stackand discarding any partial decoding hypotheses which ‘fall out of the bottom’. A crucial factor forstack based search is the quality of the heuristic. Whilst a good heuristic can lead to a highly efficientsearch, a poor heuristic will widen the search space and increase the risk of repeated calculations.

-best Decodings

A record of the -best decoding paths, either in the form of a list or a lattice (ordered by decodinglikelihoods and timings) has increasingly been found to be useful—the use of -best decoding stat-istics as a confidence measure is described in chapters 4 and 5. Although stack based decoders arenaturally suited to producing -best statistics, the Viterbi algorithm can also be modified to produce

-best decodings with the addition of some computational expense [139].

2.4 Acceptor HMMs

2.4.1 Probability Estimation

The local posterior probability formulation of the acceptor HMM (another formulation will be de-scribed in section 2.4.2) arises from the observation that the posterior probability of a word sequencegiven the acoustics can be marginalised over all possible paths through the model and factorisedas follows:

14

(2.12)

The second term on the RHS of equation 2.12 may be estimated using a language model and a setof pronunciation models by dropping the dependence upon the acoustics, since is conditionallyindependent of given the state sequence . The first term is left to be estimated by anacoustic model. Since the posterior probability of given is not simplified in 2.12, the modelas a whole may be interpreted as estimating the probability of accepting the acoustics.

Connectionist Acoustic Models

The path probability for the acoustic model’s contribution to equation 2.12 can be further factor-ised using familiar Markov assumptions. Equations 2.13 and 2.14 arise from first and zeroth orderassumptions respectively:

(2.13)

(2.14)

Theory demonstrating that artificial neural networks (ANNs) can be trained to estimate class posteriorprobabilities given some input [157] is well established and described in e.g. [18]. Several differentapproximations to the state posterior probabilities contained in equations 2.13 and 2.14 can be madeusing various ANN architectures:

can be estimated by an MLP supplied with a single frame of acoustics as in-put [26].

The acoustic observation independence assumption can be relaxed by supplying an MLP withseveral frames of acoustic context. For example if frames of context on either side of thecurrent observation vector are supplied as input, the prob-ability can be estimated [26]. As MLPs model non-linear functions of manyvariables as a superposition of non-linear functions of a single variable (‘hidden functions’)as described in [18], the number of parameters need only grow linearly with the dimension-ality of the input space. The MLP model formulation is thus particularly well suited to theincorporation of acoustic context at the input layer.

A similar relaxation of the acoustic independence assumption may be made through the useof a recurrent neural network (RNN) [167]. Since an RNN implements some form of memoryfor its previous inputs, only a single frame of acoustics needs to be input for the estimationof . By making use of recurrent links, RNNs offer a method for incorporatingacoustic context which is potentially more efficient than that offered by MLPs.

By supplying previous state information as well as acoustics at the input, conditional trans-ition probabilities can be estimated, enabling the Mealy formulation of an acceptor HMM. Forexample, an MLP supplied with frames of acoustic context and the identity of the previousstate can be used to estimate [26]. By comparing and

, i.e. the probability of observing during the transit between and as op-posed to the probability of observing whilst resident in , it can be seen that the estimationof conditional transition probabilities focuses modelling resources upon transitions betweenspeech sounds as opposed to the steady state portions of the speech signal.

15

The probability estimates listed above are termed local posterior probability estimates [26] as theconditioning is only upon a local portion of acoustics. The above list is far from exhaustive andmany different ANN architectures with various advantages and disadvantages, including radial basisfunction networks as described in [154] for example, could be trained to estimate the required prob-abilities.The different forms of local posterior probabilities listed above highlight the different parameter in-vestment strategies that may be adopted when designing an acoustic model. Three different strategiesfor accommodating the effect of context upon the realisation of a particular speech sound are (1) toadopt context-dependent basic modelling units and to accommodate them by expanding the numberof states in the model; (2) to relax the acoustic observation assumption by constructing observationdistributions over several frames worth of observations; and (3) to condition the observation distribu-tions upon states previous to the current, in addition to the acoustic observations. Although all threestrategies require an increase in the number of parameters and strategy (1) requires an expansion inthe number of states contained in the acoustic model for both generative and acceptor HMMs alike,acceptor HMMs are well placed to accommodate strategies (2) and (3) due to the efficient manner inwhich ANNs, such as MLPs, accommodate an increase in input dimensionality.

The Priors Component

A further factorisation, again through the use of Bayes’ theorem (equation 1.2), may be applied tothe 2nd term on the RHS of equation 2.12 (assuming that is conditionally independent of ,given the path ):

(2.15)

As usual, the path probability calculations may be simplified using Markov assumptions. Equa-tions 2.16 and 2.17 arise due to first and zeroth order assumptions respectively:

(2.16)

(2.17)

Several prior probability estimates are provided by this term. In equation 2.16 the numerator and thedenominator enclosed in square brackets provide the prior probabilities for the transition fromto conditioned upon the word sequence model and the space of all possible word sequencemodels , respectively. Equation 2.17 provides similar priors for the state residency . isthe prior probability of the word sequence and may be provided by a standard language model.

Merging the Two Components

For an RNN-based acoustic model without previous state information, a 1st order Markov languagemodel component and the Viterbi approximation, the result of combining the equations for the acous-tic model and the priors term is:

(2.18)

16

The simplification of the denominator in equation 2.18 highlights an important notion termed the con-flict of priors [26]: A consequence of training a class posterior probability estimator on the acoustictraining data is that, from Bayes’ theorem (equation 1.2), the prior class probabilities will be impli-citly estimated from that data. It is clear, however, that equations 2.16 and 2.17 also provide twoestimates of the class priors, where these are not derived from the acoustic data. The two sets of pri-ors will thus be differ and conflict. Fortunately, the theory prescribes that the acoustic class posteriorprobabilities should be normalised by one of the second set of priors, to produce scaled likelihoods:

(2.19)

which may be combined with the remaining terms without conflict.Figure 2.1 schematically illustrates generative and acceptor HMMs, contrasting the form of theirobservation distributions and the interpretation that may be given to the flow of observations in eachcase.

Figure 2.1: Schematic illustrations of a three state, left-to-right, 1st order, Moore-formulation gener-ative HMM (Left) and acceptor HMM (Right), including self-transitions. The large arrows schemat-ically represent the generation and the acceptance of the observations respectively.

2.4.2 Discriminative Training

In contrast to the ML training criterion, the goal of a discriminative training procedure is not onlyto increase the likelihood of the correct model but also to discriminate against all incorrect modelsby reducing their respective likelihoods. This criterion does not force an HMM to attempt to modelthe within-class data distributions, but rather to minimise the classification error rate. An exampleof a discriminative training criterion is the conditional maximum likelihood (CML) criterion [133](sometimes referred to as the maximum a posteriori (MAP) criterion, e.g. in [26]). This criterionpushes the HMM toward modelling the between-class posterior probability distributions and so, fromBayes’ decision rule (equation 1.1), to (indirectly) minimise the classification error rate. In order toapply the CML criterion in a global sense, i.e. at the utterance level, the following expression mustbe optimised with respect to the acoustic model parameters :

(2.20)

where discrimination is ensured by the sum-to-one constraint on posterior probabilities:

(2.21)

17

As the computational expense of the summation over the space of all possible word sequences modelsrequired for an exact estimation of the denominator of equation 2.20 is clearly intractable for

LVCSR tasks, the global application of the CML criterion is problematic. Due to the factorisationof adopted in the local posterior probability formulation of the acceptor HMMdescribed in section 2.4.1, the acoustic model can be trained according to a discriminative criterion(typically the sum-of-squared-errors or cross-entropy criteria, descr ibed in [18]) in a local fashion,incurring far less computationally expense. After training, the outputs of the acoustic model areestimates of local posterior probabilities and the HMM as a whole estimates global word sequenceposterior probabilities.A formulation of the acceptor HMM which is trained according to a global optimisation of the CMLcriterion is the hidden neural network (HNN) [116, 161, 159, 160]. In this formulation state spe-cific neural networks are used to replace the transition and emission probability distributions of agenerative HMM:

A match network is used to estimate a score for the match between the observed acoustics andthe parent state.

A transition network is used to estimate a score for the transit from the parent state to any otherstates allowed by the HMM architecture conditioned upon the acoustic observations.

The term score is used to highlight the fact that the outputs of the state specific networks are notprobabilities in general as the normalisation at the word sequence level ensures a posterior probab-ilistic interpretation for the output of the overall model. The optimisation of the HNN is carried outaccording to:

(2.22)

where is used to denote a score rather than a probability. Whilst the numerator of equation 2.22 isrelatively inexpensive to compute, some easing of the computational burden imposed by the denom-inator is required for LVCSR tasks. One method for ensuring a computationally feasible optimisationprocess is to run the HNN in two modes:

A clamped phase is used to estimate through a forced alignment of the ref-erence transcription to the training acoustics.

A free-running phase is used to provide a relatively simple estimate of the probability of theacoustics conditioned upon the overall model .

The potential inclusion of a label network at each state, estimating a probability distribution oversound classes, highlights similarities between the HNN and the input-output HMM [15].Another discriminative training criterion is the maximum mutual information (MMI) criterion [8],well described in [138]. The objective behind this criterion is to maximise the mutual informationbetween a word sequence model and an acoustic observation sequence associated with it:

(2.23)

18

where the third line of equation 2.23 is obtained by dividing both the numerator and the denominatorby .As with the CML criterion, the global calculation of the denominator is computationally infeasibleand some approximations must be made for it to be implemented. The MMI criterion has beenimplemented for an LVCSR task through the use of -best approximations to the summation requiredto compute by Valtchev et al. [198].Other than optimising the model parameters in the best way with regard to recognition, an additionaladvantage of discriminative training is that the resulting model will be efficient in terms of parameterusage. This perspective is highlighted by the relative complexities of the schematic illustrations ofwithin- (likelihood) and between-class (posterior) distributions given in figure 2.2, with implicationsfor the number of parameters required to model them. Evidence that acceptor HMMs are moreefficient in terms of parameter usage than generative HMMs is given in [26, 155, 216, 161].

Figure 2.2: Schematic illustrations of within-class (likelihood) (Left) and between-class (posterior)(Right) distributions for two classes , after [161].

A disadvantage of adopting a discriminative training criterion, however, is that the desirable proper-ties of Baum-Welch re-estimation must be traded for the complexities of another non-linear optim-isation procedure, such as gradient descent.

Soft-Targets

It has been shown [157], also described in [18], that ANNs can be trained to estimate class posteriorprobabilities using either 1-from- target vectors,8 or target vectors of real valued class posteriorprobabilities. The two schemes may be referred to as using hard- and soft-targets respectively. Sinceinertia prohibits instantaneous transitions between ‘states’ of the vocal apparatus, where a ‘state’ mayrepresent some idealised articulator configuration, it seems appropriate to train an acoustic modelusing soft targets, representing soft distinctions between states of the vocal apparatus. An additionalmotivation for this strategy can be drawn from the fact that many LVCSR systems approximatediphthongs, using one or more steady state vowel categories.Acceptor HMMs have historically been trained in an embedded Viterbi fashion [23, 64]: The firststage is to perform a forced Viterbi alignment of the reference transcription to the training acousticsusing a partially trained acoustic model. This process makes an explicit segmentation of the trainingdata and so provides a state label for each frame. 1-from-K target vectors may be derived fromthis segmentation and supplied to the ANN training algorithm. The alignment and training steps areiterated until no further increase in performance on a cross validation set is seen. (Periodic testing on aheld-out cross validation set is a common practice which prevents over-fitting to the training data andso assists the model to generalise to previously unseen data.) Embedded Viterbi training is an exampleof a bootstrap process which can be likened to a generalised EM (GEM) procedure. The E-stepamounts to the estimation of training targets via a forced Viterbi alignment with the existing acousticmodel and the M-step becomes the retraining of the acoustic model in an attempt to maximise the(posterior) probability of the targets. This procedure falls into the GEM category as the M-step is nota guaranteed maximisation (non-linear optimisation techniques for ANNs, such as gradient descent,

8A 1-from- target vector contains a 1 for the target class and 0s for all others, e.g. , for an ANN with sixoutput units.

19

are not guaranteed to converge to a point of local maxima). By using the forward-backward algorithmwithin the acceptor HMM framework, the posterior probability of state occupancy at a given point intime can be calculated. These values can then be used in a soft-target version of the GEM algorithmtermed REMAP (Recursive estimation and maximisation of a posteriori probabilities) [25, 112]. Ithas been shown [25, 114, 113, 112, 83] that the REMAP procedure yields an increase in performanceon small tasks over the embedded Viterbi regime. Evidence has also been provided [112] whichsuggests that the use of soft-targets is essential for estimation of conditional transition probabilitiesof the form as there will otherwise be a mismatch between the training and thetesting conditions of the acoustic model.

2.4.3 Decoding

Decoding for acceptor HMMs is essentially much like that for generative HMMs as the local posteriorprobabilities estimated by the acoustic model are converted to scaled likelihoods prior to combina-tion with the language model. A distinct advantage which is provided through the availability of localposterior probability estimates, however, is phone deactivation pruning [152]. This process prunes(deactivates) cells from the matrix of local phone posterior probabilities, within which po-tential decoding paths reside, if the posterior probability of the phone class falls below thresholdat frame . For an LVCSR task, a phone deactivation level of 64% was found to provide a factor ofsix speed-up in decoding at the cost of only a 1% relative increase in search error [153]. Phone deac-tivation pruning can be thought of an application of an acoustic confidence measure to the decodingtask.

2.5 -gram Language Models

Language models are used to exploit regularities that occur in the language to be recognised, andhence to constrain the search over the space of possible word sequence hypotheses. In principle agreat many knowledge sources such as the syntactic, semantic and even pragmatic constraints of therecognition task may be incorporated into the language model. In practice, however, the use suchhigh-level constraints is limited to relatively constrained tasks and relatively simplistic stochasticmodels have proved hard to better for LVCSR. By far the most popular approach to large vocabularylanguage modelling has been the use of so-called -gram models [11]. (The following account owesa great deal to the description given in [74].)Using the product rule, the prior probability of a word sequence , of length , may be calculatedas:

(2.24)

It is computationally infeasible to reliably estimate for long sequences, however,and so an approximation must be sought. -gram models take the form of a Markov chain and es-timate the probability of a word conditioned upon a limited history of the previous words.If is set just the previous word then the -gram is known as a bigram and estimates .If is set to the previous two words, a trigram estimating results, and so-on. Theestimation of using a trigram model thus becomes:

(2.25)

Typically, the distribution is modelled as a histogram, which may be estimated by countingword occurrences in some text corpus. Although it is clear that more powerful constraints arise from

20

higher order -grams, reliable estimates for probabilities conditioned on longer histories becomeharder to obtain due to data sparsity problems in data sets of limited size. Some form of smoothing,such as the use of backoff or deleted interpolation strategies, are therefore commonly used, as de-scribed in e.g. [74]. Up until the recent past, the computational expense of estimating anything morethan a trigram has been prohibitive for large vocabularies. Currently, however, computational re-sources are such that the limit may be relaxed to higher order -grams, such as quadgrams, e.g. [38],and training corpora containing billions of words of text9 may be exploited.A clear weakness of a standard monolithic -gram, such as that described above, is the inabilityto capture long-term dependencies which would be accommodated by more sophisticated modelsof syntax or semantics. For example, the word bicycle in the phrase he was pedalling a now oldand rather dilapidated bicycle, may be predicted much more strongly from the word pedalling thanfrom dilapidated. Despite their absurd simplicity as a grammar, -grams (1) overcome the problemsof ‘brittleness’ in the face of exceptions encountered by hand-crafted deterministic grammars; and(2) may be automatically derived from large corpora. These two properties have made -gramsfrustratingly difficult to better [97]. Some recent developments in the field of language modelling arebriefly reviewed in section 9.6.

2.6 The ABBOT/SPRACH System

The experiments described in this dissertation were conducted using the ABBOT LVCSR system [169]and its recent incarnation as the SPRACH10 entry to the 1998 DARPA Broadcast News evalu-ation [38].

2.6.1 Feature Extraction

Several feature extraction processes have been investigated as the ABBOT system has evolved. Theoriginal version of ABBOT used a 20 channel mel-scaled filter bank front-end with additional voicingfeatures (MEL+) [168, 87]. Previous experiments [39] have revealed, however, that 12th order per-ceptual linear prediction (PLP) [84] cepstral coefficients (plus energy) provide better recognitionperformance and are more robust to changes in microphone. Recent versions of ABBOT thereforeemploy PLP features computed from a 16 kHz sampled acoustic signal using 32 ms analysis framesand a frame-rate of 16 ms. For the recent SPRACH system, modulation-filtered spectrogram (MSG)features [109, 108] have additionally been employed, computed using the same window size andframe-rate.PLP processing includes three steps, also described in [74], which reflect properties of human hear-ing. Following the application of the fast Fourier transform (FFT) to each frame of the utterance, thepower spectrum is (1) filtered to produce approximately critical-band frequency resolution (higherfrequencies have reduced resolution); (2) weighted according to the “equal-loudness curve” (highfrequencies are perceived to be louder); and (3) compressed using a cube root to emulate the “powerlaw of hearing”. An autoregressive model is then used to approximate the perceptually warped spectraand the resulting linear prediction coefficients are converted to cepstral coefficients. MSG featuresdiffer from those obtained through PLP processing in that they are designed to incorporate longertime aspects of the speech signal and in so doing to exploit the relatively stable temporal encoding ofphonetic information. In common with PLP, the features follow an approximately critical-band ana-lysis, but differ in that they focus upon slow (0-8 Hz) amplitude modulations in each channel. Alsoincluded in the MSG feature extraction procedure are automatic gain control and spectral peak en-hancement mechanisms. Basing the features on channel amplitude modulations provides robustnessto stationary forms of spectral distortion and to certain noise sources and to the effects of reverbera-tion, whereas peak enhancement (achieved by thresholding at a level 30 dB below that of the globalpeak) emphasises those aspects of the signal that rise above the noise floor. Although MSG featureslack spectro-temporal detail of those derived via PLP and so provide reduced performance for the

9For context, a billion words is more than the average reader will see in a lifetime.10ESPRIT long term research project 20077: Speech Recognition Algorithms for Connectionist Hybrids (SPRACH).

21

recognition of clean speech, they are more robust to degradations of the speech signal, such as re-verberation [109]. MSG and PLP features thus provide complimentary representations of the speechsignal which are exploited by the ABBOT system.

2.6.2 Acoustic Model

The ABBOT acoustic model has traditionally been based upon a set of RNN phone classifiers [167].The architecture of the RNNs used in the ABBOT system is schematically illustrated in the left panelof figure 2.3, where the acoustic observation vector is mapped to an output vector and a statevector . The state vector is then returned to the input, via a layer of state units, to be combined withthe acoustic observations for the estimation of and so on. Successive state vectors thus act asa form of memory and allow the network to exploit previous acoustic context in a computationallyefficient manner. A consequence of the recurrent links is that the outputs of the network at time

are dependent upon the outputs at time and so on. A parameter optimisation process forthe network must therefore take these dependencies into account. An appropriate training schemeis the so called back propagation through time algorithm [179, 206], which essentially unfolds thenetwork connections over time to create a feedforward structure akin to an MLP with as many layersas there are time steps in the training sequence (weights are tied across points in time), which can beoptimised using gradient descent methods.ABBOT adheres to a Moore formulation and the RNNs are trained, traditionally using an embeddedViterbi scheme, to estimate state local posterior probabilities for context-independent (CI) phoneclasses of the form . Due to the time-asynchronous nature of the RNNs, it has beenfound to be beneficial to merge the outputs of a network trained forwards in time with those from anetwork trained backwards in time, by averaging them at the frame-level in the log domain [86, 169].

Timedelay

....

Figure 2.3: Schematic illustrations of two ANNs. An RNN (Left) and an MLP supplied with framesof acoustic context on either side of the current frame (Right).

Recent improvements in generative HMM performance have been found through the use of CD phonemodels, where the context classes are typically derived through a process of decision tree clustering.The structure of a regular feedforward ANN makes the naive estimation of CD phone class posteriorprobabilities computationally prohibitive, due to the large number of additional model parameterswhich would be required [26]. To circumvent this problem, a hierarchical approach can be adopted,where a CD phone class posterior probability is factored into simpler components which can beestimated using several ANNs. The particular factorisation adopted in the ABBOT system is [105,106]:

(2.26)

22

where the joint probability of the CD and CI phone classes is factored into two components: Thefirst term on the RHS of equation 2.26 may be estimated using an existing CI phone class posteriorprobability estimator and the second term may be estimated by a ‘context expert’ unit for the CIphone class. Each context expert unit is trained to discriminate between a set of CD phones giventhe CI phone class and the acoustic observations. In this case the merging of outputs from CDRNNs trained both forwards and backwards in time has also been found to be useful. A similarapproach to the problem is described by Fritsch [65], where a more extensive hierarchy of ANNsis obtained through an agglomerative clustering algorithm based on information divergence (ACID).In this case, the joint probability of a CD phone class and an empirically derived subset of the CDphone set (representing a more general class), given the acoustics is factored in a similar fashionto equation 2.26:

(2.27)

where the 2nd term on the RHS of equation 2.27 is similarly factored in a hierarchical fashion,according to the result of the agglomerative clustering stage.An MLP classifier is combined with the RNN probability estimators in the SPRACH system acousticmodel. The MLP, schematically illustrated in the right panel of figure 2.3, is trained to estimate CIphone class posterior probabilities from a sequence of observation vectors , providing framesof acoustic context on either side of the th frame. MSG features are supplied to the network, com-puted from a waveform down-sampled to 8 kHz. As the RNNs are trained upon broadband acousticsusing PLP features and the MLP is trained upon narrowband acoustics using MSG features, theirprobability estimates are more reliable at different points in the signal and are therefore compliment-ary rather than competitive. Various methods for combining the outputs of the various classifiersat the frame- and hypothesis-level have been investigated [38], one of which is described further insection 9.1.

2.6.3 Decoding

For the decoding process, the (CI or CD) phone class posterior probabilities are converted to scaledlikelihoods and tied across all states of a phone class HMM. The number of states in each basic HMMis set equal to half the average duration of the phone class in frames [87]. By setting all state transitionprobabilities in the class models equal to , an approximation to a Poisson duration distribution,with a minimum duration constraint, is made for each class [88]. The normalising priors occupyingthe denominator of equation 2.18 are approximated from the acoustic model training data, as theirestimation from the space of all possible word sequence models is clearly infeasible. The searchfor the most probable word sequence over the phone posterior probability space can performedusing either the NOWAY [153] or CHRONOS [170] stack-based decoders and a standard monolithic -gram language model trained using hundreds of millions of words of text. The pronunciation lexiconis typically based upon a 65 k word vocabulary and makes use of multiple baseforms per word.(Recent modifications to the language model for named entity identification—companies, peopleetc.—are described in [75].)A schematic representation of the ABBOT system is given in figure 2.4.

2.7 Summary

This chapter has compared and contrasted generative and acceptor HMMs, raising the following keypoints:

Common to both Generative and Acceptor HMMs

– As the state sequence associated with a particular utterance is in general unknown (thestate sequence is hidden from the observations) an analytical solution to the training prob-lem is lacking and the HMM must be trained in an iterative fashion.

23

– Due to limited training data and computational resources, subword units must be usedas the basic modelling units for LVCSR. This strategy exploits the hierarchical nature oflanguage and allows subword-level training examples to be shared and parameters to betied across frequent and infrequently occurring words alike.

– Although the Mealy formulation focuses modelling resources upon the transitional por-tions of the speech signal, both the Moore and Mealy machines model the speech signal asa sequence of discrete states with stationary properties, connected by instantaneous trans-itions and so suffer from the assumption that the speech signal is piecewise-stationary innature.

– A challenge for both generative and acceptor HMMs is to to be able to predict the effectof context in the speech signal. Any attempt to model an increased amount of contextrequires an enlarged parameter set. Several different strategies for parameter investmentexist in this regard:

The use of context-dependent basic modelling units.The relaxation of the observation independence assumption.The conditioning of the observation distribution upon some history of previouslyvisited states.

Specific to Generative HMMs

– The ML training criterion enables the use of the relatively efficient Baum-Welch re-estimation algorithm which has a number of desirable properties for a non-linear op-timisation problem with missing data. These properties include (1) its simplicity; and (2)a guaranteed monotonic increase in the likelihood of the data given the model, leadingto a guaranteed convergence to a point of local maxima. A consequence of optimisingthe model according to the ML criterion is, however, that the model does not enjoy thebenefits derived from a discriminative training criterion.

– The observation independence assumption is introduced for reasons of computationalexpediency. The limiting nature of this assumption is mitigated to some degree by theuse of delta features and the modelling of context-dependent subword sound classes.

– Other than the use of subword modelling units, parameter tying can be introduced throughthe use of semi-continuous (typically Gaussian mixture) emission distributions.

Specific to Acceptor HMMs

– The HMM is trained to directly estimate the posterior probability of the model given theacoustics. Consequences of this are that the model is (indirectly) trained to minimise theprobability of misclassification and is efficient in terms of use of parameters.

– The estimation of phone class posterior probabilities from local portions of acousticsreduces the computational cost of discriminative training.

– ANNs are well suited to accommodating a relaxation of the acoustic independence as-sumption and the conditioning of the observation distribution upon a history of previousstates as the inherently tied parameter architecture deals efficiently with high dimensionalinput spaces. ANNs may also be trained with soft-targets, relaxing the assumption of in-stantaneous transitions between states.

– A disadvantage of choosing an ANN acoustic model is that it must be trained using com-plex non-linear optimisation techniques, such as gradient descent for example. Gradientdescent has a number of undesirable properties such as no guarantee of convergence toa point of local minima, the required computation of local gradient information and thespecification of an arbitrary step-size, which can greatly effect the efficiency of the al-gorithm.

24

12

LanguageModel

}

VectorOptimal State Sequence Search

Pronunciation

Acoustic

....}

Vector

ModelAcoustic

FeatureAcoustic

Lexicon HMMsPhone

Posterior ProbabilityPhone Class

AcousticWaveform Time (frames)

ObservationSpace

SpacePosterior Probability

Phone Class

Sliding Analysis Frames

Feature Extraction

Mapping

Search Constraints

Figure 2.4: A schematic representation of the ABBOT acceptor HMM based LVCSR system.

25

Chapter 3

Hypothesis Testing

3.1 Classical Hypothesis Testing

Two states of nature are relevant to the test of an hypothesis —its truth or its falsity. An hypothesistest is a decision making process, the outcomes of which may include the acceptance or rejection of

, although the set of possible actions resultant from an hypothesis test may contain others, suchas the deferral of the decision until further evidence is accrued.As an example, may represent the hypothesis that a word decoded by an ASR system is correct.In this case, the acceptance of will lead to being tagged as correct, whereas its rejection willresult in being tagged as incorrect. As the goal of an utterance verification process is to identifywhich recogniser outputs are correct and which are incorrect, it can be seen that the utterance verific-ation task can be easily cast within an hypothesis testing framework. could alternatively be used torepresent the hypothesis that a given recogniser output is the product of the input of an OOV word, ora non-speech input. OOV word spotting, keyword spotting and speech/non-speech distinction taskscan thus also be formulated as tests of appropriate hypotheses.The decision to accept or reject an hypothesis in a statistical hypothesis test is based upon thevalue of a test statistic . For a classical hypothesis test is based upon Frequencist statisticsand so the use of prior probabilities is vetoed.1 In the absence of any prior probabilities, a classicalhypothesis test requires the specification of a null hypothesis and an alternative hypothesis .The default action of a classical test is to accept and so the null hypothesis constitutes an assump-tion regarding the world of discourse. As a consequence, the rejection of makes a much strongerstatement than its acceptance (much like disproving a theory is more compelling than presentingevidence to support it): To reject there must be sufficient evidence to warrant the action, whereasto accept requires only that there be insufficient evidence to prefer .For a classical hypothesis test, possible values of fall into one of two regions; the acceptance( is accepted) or the critical region ( is rejected). The particular form of the two regions isdetermined by the focus of the test: If, in order to make the decision, the value of need onlybe compared to a single threshold, the test becomes a one-tailed test and the acceptance and criticalregions are delineated by a single critical value or operating point. If the value of relativeto some contiguous region is of interest, the acceptance and critical regions are delineated by twocritical values and the test becomes a two-tailed test.As the test statistic is required to be informative with regard to the truth or falsity of , an appro-priate test statistic for an utterance verification task, at the word-level say, is an objective measure ofhow well the model for a recogniser output matches the aligned portion of acoustics. The focusof the task dictates that the test is a one-tailed test: If the confidence is low, should be tagged asincorrect and vice versa. Confidence measures are also appropriate test statistics for other tasks, suchas OOV word spotting or speech/non-speech distinction, although the particular confidence measureformulation will vary from task to task. For the speech/non-speech distinction task, for example, a

1Frequencist statistics are calculated using only observed data, whereas Bayesian statistics incorporate prior probabilities.

26

general measure of acoustic model match is an appropriate test statistic as all possible word models,not just that for , should provide a poor match for non-speech.

3.2 Probability of Error

Two types of error are possible for a classical hypothesis test:

A type I error occurs when is rejected when it is in fact true—a false negative.

A type II error occurs when is accepted when it is false—a false positive.

The power of a test , like its converse , quanti-fies the ability of the test to distinguish between the two states of nature. The positioning of theoperating point for a test determines the size of the critical region (often just referred to as the sizeof the test) and so embodies a compromise between the the power of the test and

.The results for a series of applications of a classical hypothesis test equipped with two actions (acceptor reject) may be recorded in a confusion matrix such as that shown in figure 3.1, where thenumber of particular actions observed for a given state of nature is recorded in the appropriate cell.

Actions

Nature

Statesof

Figure 3.1: A confusion matrix recording actions against states of nature for a classical hypo-thesis test.

A number of statistics can be computed from a confusion matrix and used to evaluate theperformance of the test statistic at a given operating point. The simplest of these is the unconditionalerror rate (UER) of the test, which involves the sum of the number of type I and type II errors. Thesample estimate of the UER is given by:

(3.1)

where, denotes a sample probability estimate and is the total number of hypotheses tested,which is equal to the sum of the counts in all four cells of the matrix.It is also possible to calculate sample estimates of error rates conditioned on a particular state ofnature, corresponding to the probabilities of a type I or a type II error:

(3.2)

(3.3)

It is important to note that the UER of an hypothesis test will be influenced not only by the efficacy ofthe test statistic for distinguishing between the two states of nature, but also by the relative frequency

27

of examples of the two states of nature. The relative frequency of the two states of nature is an im-portant factor in determining the difficulty of the task. For example, consider an utterance verificationtask ( is set to the hypothesis that any given recogniser output is correct) applied to two differentdata sources: The first source of data is the output of a connected digit recogniser. In this case therecogniser error rate will typically be very low, say 0.3%. Even with a completely uninformative teststatistic, an extremely low UER may be obtained in this case by merely setting the operating point ofthe test such that is accepted on all occasions. The second source of data is the output of a largevocabulary spontaneous speech recogniser. In this case the recogniser error rate will be much higher,say 30–40% WER. Obtaining a low UER for a verification test on this data is much more challen-ging as merely accepting on all occasions will not yield a low UER. Hypothesis tests involvingconsiderably imbalanced relative frequencies for the two states of nature are thus easier than thosefor which the relative frequencies are balanced.The UER of an hypothesis test is therefore a task specific metric and so is useful for determininghow the difficulty of the task influences the performance of the test statistic (this will be an importanttopic in chapter 6). On other occasions however, such as when developing and comparing differenttest statistics, it is much more desirable to be able to evaluate them independently from the difficultyof the task. This may be done through the use of conditional error rates.Two other statistics which may also be computed from a confusion matrix are precision (‘purity’of retrieval) and recall (completeness of retrieval). Precision and recall are popular for informationretrieval tasks, but in the context of an hypothesis test become [103, 181]:

(3.4)(3.5)

As the precision of a test is conditioned upon the action of acceptance, and recall is conditioned uponbeing true, precision, but not recall, may be influenced by the difficulty of the task. For this

reason, precision and recall are not as easily interpretable as the error rates of a test and so were notused in this work.

3.3 ROC Curves

A Receiver Operating Characteristic (ROC) curve is a plot of the detection rate, on the y-axis, againstthe false alarm (FA) rate, on the x-axis [47]. An ROC curve can be used to plot the probability ofperforming a correct action conditioned on one state of nature against the probability of performingthe same action (this time incorrect) conditioned on the other state of nature (a false alarm). An ROCcurve may thus be created by plotting either:

on the y-axis against ( (type II error))on the x-axis,

or

(power) on the y-axis against ( (type I er-ror)) on the x-axis.

The choice depends upon whether it is desired to plot the rate of detecting instances of oragainst the respective FA rate. As:

(3.6)

and

28

(3.7)

the information contained in both possible ROC plots is, however, equivalent.The left panel of figure 3.2 illustrates these three possible ROC curves. An ideal hypothesis test isone with an error rate of zero, i.e. a detection rate of 1.0 for all possible FA rates (most desirably for aFA rate of zero), and is characterised by an ROC curve which passes through the top left corner of theplot. A guessing hypothesis test is one with a maximum error rate,2 i.e. the detection rate is equal tothe FA rate, and is characterised by an ROC curve which is a straight line passing through the originand the top right corner of the plot. A typical ROC curve will fall between these two extremes.

TypicalGuessing

01.0

Detection Rate

1.0

FA Rate

Ideal

01.0

Detection Rate

1.0

FA RateRange

FOM

Figure 3.2: Schematic illustrations of an ROC plot for an ideal, guessing and typical hypothesis test(Left) and for the figure-of-merit (FOM) for a range of FA rates (Right).

In a similar vein to an ROC curve, the probability of a type I error can be plotted against the probab-ility of a type II error. The left panel of figure 3.4 provides a schematic illustration of such curves foran ideal, guessing and typical hypothesis test. From equations 3.6 and 3.7, it can be seen that such acurve provides equivalent information to that presented by an ROC curve.

3.4 DET Curves

It is common practice to plot several type I against type II error curves (or ROC curves) on thesame axes to facilitate comparison between the performance of several different test statistics. If thetest statistics all provide similar performance the curves will form a closely spaced cluster and theplot will become difficult to read. The use of a detection error tradeoff (DET) curve, e.g. [128], ismotivated by this problem. For the purposes of this plot, it is assumed that the distributions of teststatistic values , conditioned on the two states of nature, are Gaussian. A schematic illustrationof this case is given in figure 3.3, where the means of the two distributions are labelled and ,and the operating point of the test is marked . The hatched regions indicate the areas correspondingto (type I error) and (type II error).If the deviations from the Gaussian means corresponding to (type I error) and (type II error) areplotted for a range of operating points, curves such as those schematically illustrated in the right panelof figure 3.4 are obtained. The plot for a given test statistic will be linear if the two distributions areindeed Gaussian. The plots for a ‘good’ and guessing hypothesis test are shown. The warping of theaxes has the effect of accentuating the differences between well performing test statistics, clusteredin the lower left quadrant of the plot.

2Excluding the case where the test becomes ‘pathological’ in the sense that more instances of are rejected thanare accepted and vice versa.

29

T

AcceptReject

((

S-of

-N)

Figure 3.3: A schematic illustration of distributions of values of the test statistic conditionedon the two states of nature, and . The operating point on is labelled andthe means of the two distributions are labelled and . The hatched regions indicate the areascorresponding to (type I error) (horizontal lines) and (type II error) (diagonal lines).

01.0

1.0

P(type II error)

P(Type I error)

EER *

TypicalGuessing

Ideal

0.005

0.050.1

0.3

0.5

0.7

0.90.95

0.995

0.005 0.050.1 0.3 0.5 0.7 0.90.95 0.995

P(ty

pe II

erro

r)

P(type I error)

GoodGuessing

Figure 3.4: Schematic illustrations of plots of (type I error) against (type II error). Left: Curvesfor guessing, typical and ideal hypothesis tests on linearly scaled axes. The point of equal error rate(EER) is also indicated for a typical test. Right: A schematic illustration of a detection error tradeoff(DET) curve for a good and guessing hypothesis test. In this case the axes have a Gaussian deviatescale.

30

3.5 Mutual Information

Another evaluation metric which may be computed from the values stored in a confusionmatrix is the mutual information between a state of nature and a chosen action. In order to computethis metric, an hypothesis test is considered analogous to the transmission of a signal across a noisybinary channel. Specifically, the state of nature is considered to be equivalent to the source value ofthe signal, prior to transmission, and the action resulting from the test corresponds to the receivedvalue of the signal.The state of nature may be encoded using a binary random variable :

and the action resulting from the test may be encoded using another binary random variable :

.

The mutual information between the values of these two random variables is given by [41]:

(3.8)

where and denote the entropy of the respective random variables. Entropy is definedas the uncertainty in the value of a random variable. and denote the conditionalentropy of the two random variables and represent the uncertainty in the action given a known stateof nature and vice versa. is the reduction in uncertainty in the action given knowledge ofthe state of nature, which is symmetrically equal to the reduction in uncertainty in the state of naturegiven the action. The value of will be equal to for an ideal hypothesis test (uncertaintyis reduced to zero) and equal to zero for a guessing hypothesis test (no uncertainty is removed).The left panel of figure 3.5 shows how the value of varies over the range of possible values for

and the UER of the hypothesis test. may be considered representative of the difficultyof the task as it is a measure of the uncertainty in the state of nature. The plot shows that for a fixedUER of an hypothesis test, the value of is larger for a harder task than for an easier one.Normalised mutual information, or efficiency [42], is illustrated in the right panel of figure 3.5 anddefined as:

(3.9)

The normalisation ensures that efficiency values obtained for a test with some UER will be equal forall degrees of uncertainty in the action. Normalising by would ensure equal efficiency valuesfor all degrees of uncertainty in the state of nature (i.e. all difficulties of the task).

31

ideal

guessing

0.69 0.68 0.61 0.5 0.32 0.0

00.10.20.30.40.50.60.7

Performance

ideal

guessing

0.69 0.68 0.61 0.5 0.32 0.0

0

0.5

1

Performance

Figure 3.5: Left: Values of over the range of possible values of and UER of anhypothesis test. Right: Values of plotted on similar axes.

A related confidence measure evaluation metric, termed reduction in cross-entropy, is describedin [34, 181, 103, 205]. The metric, which is similar to that described in [73], has been adoptedin recent years by the DARPA/NIST CSR evaluation community [217, 181, 103]. A clear disadvant-age of either the reduction in cross-entropy, or the metric described in [73], is that the confidencemeasure must take the form of the probability of some attribute of a decoding hypothesis, such ascorrectness for example, as estimated by some form of post-classifier.

3.6 Scalar Summary Statistics

By Integration

The information conveyed in an ROC curve can be summarised using a more compact statistic, suchas the area beneath an ROC curve3 or the Figure-of-Merit (FOM). The FOM is defined as the averagedetection rate calculated over a range of FA rates [122]. The right panel of figure 3.2 provides aschematic representation of a FOM for a given range of FA rates. The area beneath an ROC curvebounded by a lower and an upper FA rate is related to the FOM calculated over this range of FA ratesby:

(3.10)

Equal Error Rates

A method for summarising a plot of (type I error) against (type II error) using a single value is torecord the equal error rate (EER) [128], for some operating point :

(3.11)

Distributional Separability

Given a set of observation vectors, , the distribution of test statisticvalues conditioned on a particular state of nature can be plotted, schematically illustrated in

3If the detection and FA rates cover the range [0,1], this area is equivalent to the value of the Mann-Whitney version ofthe non-parametric two-sample statistic [219] and has a value equal to 1.0 for an ideal hypothesis test and has a value equal to0.5 for a guessing test. For a one-tailed test where a high test statistic value is indicative of , the area is equal to theprobability that an hypothesis drawn at random from the set has a test statistic value that is larger than that for onedrawn at random from the set .

32

figure 3.3. Such as plot shows how well distinguishes the two states of nature—an ideal teststatistic will yield two narrow and completely separable distributions.A number of inseparability metrics can be used to quantify how inseparable the two distributions are.As these metrics measure inseparability, a low value indicates that the distributions are separable,whereas a high value indicates that the distributions overlap to a large degree. Two such metrics arethe Kolmogorov Variational Distance, , and the Bhattacharyya Distance, [81]. These maybe calculated for a classical hypothesis test using:

(3.12)

(3.13)

3.6.1 Comments

Any summarisation of the data recorded in a confusion matrix results in a loss of information.For example, information is lost when an ROC plot is used to summarise a confusion matrixand further information is lost when an ROC curve is itself described using the area beneath it. Thecalculation of any probabilities from a confusion matrix also requires the assumption that eachtrial, i.e. application of the hypothesis test, is independent. This assumption is not strictly warrantedwhen applying an hypothesis test to successive outputs of a CSR system, since errors in a decodingtend to form contiguous blocks and are thus highly correlated.

3.7 Summary

This chapter has covered the basic theory of classical statistical hypothesis testing. It has been shownthat tasks such as utterance verification and OOV word spotting can be cast within a statistical hy-pothesis testing framework, where different confidence measures are suited as the test statistic fordifferent tasks. The results of a series of applications of an hypothesis test can be recorded in aconfusion matrix, from which a number of evaluation metrics can be computed. Some evaluationmetrics, such as the UER of a test, are task specific whereas others, such as those derived from con-ditional error rates, are task independent. Their varying properties make different evaluation metricsappropriate under different circumstances. For example, UER is useful for comparing the perform-ance of a test statistic over tasks of varying difficulty, whereas conditional error rates are useful fortest statistic development.

33

Chapter 4

Approaches to Confidence Measures

4.1 Introduction

The recent surge of interest in measures of confidence for ASR has generated an extensive literat-ure which requires review to place each contribution into context. Most ASR systems are basedupon generative HMMs. As the class conditional acoustic likelihoods estimated bythese models are relative to the probability of the acoustics marginalised over all word sequencemodels , they are not comparable across utterances. The outputs of a generative HMMtherefore require further processing before they may be used as confidence measures. This is not sofor an acceptor HMM as the estimated posterior probabilities are implicitly scaled by

.The vast majority of confidence measure publications in the recent conference literature describerelatively subtle variations upon two general themes: (1) The scaling of within-vocabulary acousticmodel likelihoods by the likelihood of an alternate model, to form a likelihood ratio; and (2) thetraining of an application specific post-classifier to estimate the posterior probability of some eventgiven some set of confidence predictor variables. Three popular tasks for the application of thesemethods are utterance verification, keyword spotting and OOV word spotting.

4.2 Likelihood ratios

The use of a likelihood ratio as a test statistic in a statistical hypothesis test is well motivated fromboth the Frequencist and Bayesian standpoints. In the Frequencist case, the Neyman-Pearson lemmastates [70]:

Given the likelihood functions, and , the most powerful test for agiven size of vs. has a critical region of the form:

(4.1)

for some non-negative constant —the operating point.

In the Bayesian case [16], hypothesis selection is based on posterior probabilities and so the likeli-hood of some hypothesis should be normalised by (assuming equal priors).

4.2.1 Explicit Alternate Models

The use of explicit alternate models to estimate or has been popularfor relatively constrained tasks such as digit or company name recognition, over the telephone.

34

Utterance Verification

The use of an explicit alternate model of extraneous acoustics is reported in [149, 124, 42, 194, 185,193, 150, 192, 176, 79]. Composite alternative models and discriminative training methods are addi-tionally described in [149, 124, 194, 185, 192]. Composite alternate models are typically composedof general models of extraneous acoustics, sometimes termed backgroundmodels, together with anti-keyword or imposter models which are trained using examples of in-vocabulary items other than aspecific keyword. The discriminative training techniques reported involve optimising the parametersof the in-vocabulary models and/or alternate model using the likelihood ratio as the training criterion.From a Frequencist viewpoint, this optimisation can be seen to be based directly upon the Neyman-Pearson Lemma. Gupta & Soong [79] show that an adaptive threshold, based on utterance length,provides improvements over a static threshold for a digit recognition task. Rose et al. [176] takethe step of conditioning the probability of a word estimated by the language model not only uponits -gram history, but also upon the (quantised) acoustic confidences of the elements of that historythrough the use of a variable n-gram stochastic automaton [156]. In this case the language modelestimates the probability of a word sequence model conditioned upon the acoustic confidence of thatword sequence and the (generative HMM based) recognition system as a wholeperforms a search for such that:

(4.2)

Word sequences with high acoustic confidence are thus preferred. An issue with this approach is thatthe language model must be trained using a corpus of word tokens tagged with appropriate acousticconfidences as ascribed by the acoustic model. In order to do this the acoustic realisations of eachword in the corpus must be recorded and stored with obvious repercussions concerning data sparsityfor training the language model.

Keyword Spotting

A common approach to the task of keyword spotting is to employ a set of filler, or garbage, modelsto compete with the keyword models during a decoding. Filler models must accommodate non-keywords as well as extraneous acoustics such as non-speech sounds and channel noise. The gener-ation of keyword hypotheses is often followed by a verification stage. In the absence of a languagemodel, competition between two generative HMMs in a decoding is equivalent to the formation of alikelihood ratio.The use of explicit filler models is reported in [171, 210, 175, 172, 129, 123, 94, 172, 122, 173,174]. One of the earliest descriptions is provided by Rohlicek et al. [171], who report the use ofwhole word keyword models and a single filler model created by combining the Gaussian emissionprobability distributions from all the keyword models. Wilpon et al. [210] report one of the first usesof filler models trained on examples of extraneous acoustics. Lleida et al. [123] report an investigationinto various forms of filler models, including phonetic-, syllabic- and word-level fillers. Syllabicfillers were found to provide the best performance in this study. An additional verification stage forhypothesised keywords was introduced by Rose & Paul [175], in which the likelihood of an alternatemodel, distinct from the filler model, was used to form a ratio with the likelihood of the keywordmodel. The additional verification stage was found to reduce the number of false alarms given by thekeyword spotter. Discriminative training techniques for an alternate model, based on the Neyman-Pearson Lemma, were introduced later by Rose [172]. This additional parameter optimisation stepwas found to provide an additional performance gain. Mazor & Feng [129] describe a somewhat adhoc investigation into the application of a threshold to various quantities, including the likelihoodof the keyword hypothesis and the difference between the likelihood of the best and second bestkeyword hypotheses, for the verification task. Chang et al. [31, 122] report the use of a FOM baseddiscriminative training criterion to optimise the parameters of the keyword models.Discriminative training according to the Neyman-Pearson Lemma can be seen to increase the prob-ability of rejecting given that it is false, i.e. the power of the test. The probability of acceptinggiven that it is false— —must be reduced, as will remain constant for

35

a given data set. Likewise it can be seen that the probability of accepting given that it is true willbe increased and so the probability of rejecting given that it will be true— —willbe reduced. Training according to a criterion based upon the FOM of a test will increase the detectionrate for a given range of false alarm rates. For one of the two ROC curves that it is possible to plot,this will correspond to increasing the power for a given range of the probability of a type I error.

Comments

Since an explicit alternate model is required to model all extraneous acoustics with a limited numberof parameters (typically this number is very small in comparison to the number of parameters usedto model the in-vocabulary items) the approach is limited in practice to relatively constrained taskswhich have a restricted range of acoustic conditions. It may also be argued that trying to explicitlymodel is an ill-posed problem due to the limited number of samples in the training datafor this very high-dimensional and multimodal probability distribution. The use of explicit alternatemodels does, however, facilitate a relatively straightforward approach to obtaining a desirable dis-criminative training criterion. In this regard, it should be noted that the CML criterion is preferable tothe Neyman-Pearson Lemma criterion as the former, unlike the latter, does not assume equal priors.

4.2.2 Likelihood of Competing Decodings

The denominator of Bayes’ theorem (equation 1.2) indicates that in principle an estimation ofcan be made by summing over the likelihoods and priors of all competing models. In

practice, however, the computational expense of this summation is prohibitive and one of three ap-proximations are typically made. The likelihood of a decoding hypothesis may be normalised by:

1. The summation over the likelihoods of the -best competing decodings.

2. The summation over the inventory of subword-level model likelihoods on a per-frame basis,averaged over some interval.

3. The likelihood of the Viterbi path through an unconstrained sequence of subword models.


Cox & Rose [42] compared strategies (1) and (3) and found a variant of (1), termed on-line garbage[20, 22], to perform better on SwitchBoard (SWB) data. Caminero et al. [29] provide results showingthe performance of on-line garbage to be better than that for an explicit alternate model, for two smallvocabulary tasks. In an extension to their previous approach, Caminero et al. also describe [30] theuse of an explicit alternate model to spot noise and keywords followed by a subsequent verificationstep using on-line garbage and task-specific grammar constraints, for a long number recognition task.Willet et al. [209] compute estimates of by summing over all phone model likelihoods ona per-frame basis.1 Wessel et al. [207] compute an estimate of the unconditional probability of theacoustics over an interval spanned by a decoding hypothesis by summing the likelihoods ofall hypotheses spanning that interval contained in an -best word lattice.

Keyword Spotting

Rivlin et al. [165] were the first to use approach (2) for computing for the subsequent es-timation of posterior phone probabilities on a per-frame basis. These posterior probabilities were av-eraged in the log domain over the interval spanned by a phone hypothesis to form a phone-level con-fidence measure. A word-level measure was created using the confidences of its constituent phones.The term on-line garbage was first introduced by Boite et al. [20, 22] to describe the formation of alocal garbage model from the average of the -best likelihoods at each frame of an utterance. The

1The effect of the approach to duration normalisation described in chapter 5 was incorrectly reported in [209].

36

approach was found to perform favourably in comparison to explicit filler models on the ResourceManagement (RM) corpus and another medium sized task. Klemm et al. [111] report the use offillers based upon word, syllable and phone models. In the case of word models, sometimes referredto as lexical fillers, the keyword spotter is equivalent to a large vocabulary recognition system. Theperformance of syllabic fillers was found to compare favourably to that of the large vocabulary re-cogniser based approach in this study. Similar results for lexical fillers are reported by El Meliani &O’Shaughnessy [49, 51, 50, 52]. The application of large vocabulary recognition techniques is alsodescribed by Weintraub [204]. Kawahara et al. [101] describe the use of speaking-style dependentlexical fillers.Jeanrenaud et al. [94, 95] extend the task of keyword spotting to ‘event spotting’, where they definean event as, a collection of phrases or clauses, of one or more words, that relate to a single concept.They provide a time-of-the-day event example, which may be instantiated as one thirty or one thirtyin the morning on Tuesday. They describe two approaches to event spotting: One approach based onkeyword detection and event sub-grammars and another based on a large vocabulary recogniser. Thesub-grammar based approach was found to perform favourably in comparison to the large vocabularyrecogniser.

OOVWord Spotting

A looped-phone alternate model, as described by Asadi et al. [6, 7] and schematically illustratedin figure 4.1, is a useful tool for the task of OOV word spotting as it may be used to (1) spot aphone sequence with a higher likelihood than any specified in the pronunciation lexicon; and (2)simultaneously provide a putative phone-level transcription for such an (OOV) input. A similarapproach using syllable models is reported by Kemp & Jusek [102]. The incorporation of looped-phone model generated pronunciation models into the pronunciation lexicon in [7] was found not toperform as well as those derived from spelling-to-sound rules (a caveat of the second approach is thatthe spelling of the new word must be provided to the system). Haeb-Umbach et al. [80] found that the‘average’ phone-loop decoding, derived from several examples of a new word, compared favourablyto pronunciations derived from text-to-speech rules, however.

.

.

...

.

Figure 4.1: A schematic illustration of a looped phone HMM.

Search

Jitsuhiro et al. [99] define a confidence measure based upon the difference between the likelihood ofa target phoneme and the sum over the likelihoods of all other phonemes, calculated on a per-framebasis. Low confidence is obtained for small differences and vice versa. Adding the confidence ofeach constituent phoneme to the acoustic likelihood of a search path thus penalises low confidencedecodings.

37

Comments

A motivation for deriving the alternate model likelihood from competing decodings that is oftencited in the literature is that explicit alternate models are task-specific. This claim is often counteredthrough the use of explicit subword alternate models trained using a phonetically balanced corpus,such as TIMIT for example. Although the competing decoding approach does seem applicable tolarger vocabulary, less constrained tasks, it still suffers from the computational expense of estimating

using generative HMMs.

4.3 Post-Classifiers

Many post-classifier architectures and sets of confidence predictors have been investigated, typicallyusing larger, more challenging corpora than those used to investigate the likelihood ratio approach.


Post-classifiers for utterance verification using confidence predictors such as the likelihood of thedecoding hypothesis, the language model probability, speaking rate, the duration of the hypothesisand information from -best decoding lists or lattices are reported in [48, 42, 205, 73, 103, 181,34, 187, 201, 46]. Siu et al. [187] provide a useful categorisation of their confidence predictors,including those derived (1) from the complete recogniser, such as word sequence model likelihoodsand -best statistics; (2) from the language modelling information, such as -gram probabilities andthe amount of training data used to estimate such probabilities; and (3) directly from the acoustics,such as the speaking rate and signal-to-noise ratio. Similar categorisations of confidence predictorsare also adopted in [201] and [205]. Weintraub et al. [205] describe the use of an ANN based post-classifier together with a useful algorithm for marking decoding hypotheses, based upon a comparisonof their timings and those found in a forced Viterbi alignment of the reference transcription. Alsodescribed are three evaluation metrics—mean squared error, cross-entropy and UER—which wereimplemented using the posterior probabilities of word correctness and the state of nature encodedusing a binary random variable. Gillick et al. [73] report using a Generalised Linear Model (GLM) asa post-classifier as well as a cross-entropy based evaluation metric. Schaaf & Kemp [181, 103] reportthe investigation of a number of different confidence indicators and post-classification models whichthey evaluated using UER and an entropy based evaluation metric. The confidence indicator whichthey found most effective was termed the A-Stabilmeasure. This is essentially the number of times agiven word hypothesis remains in fixed position within the sequence, under varying acousticmodel and language model weighting factors. This statistic is also described in [34], where it is calledLM-jitter. The properties of the LM-jitter/A-stabil measure are discussed further in section 5.4. Thegeneral consensus of the above studies is that confidence predictors derived from -best decodinglists or lattices are the most valuable and that an ANN based post-classifier is the most effective forthe predictor combination stage.

Search

Post-classifier based confidence measures have been shown to be useful for guiding the decodingsearch [134, 115]. Neti et al. [134] describe a method for de-weighting the language model com-ponent in regions of a decoding where the posterior probability of word correctness is high. This isan appropriate strategy as the language model should be favoured in regions of ambiguous acoustics,but should ‘play second fiddle’ when the acoustics are well matched. It was found that using thisapproach WER could be reduced for data from the Air Travel Information Systems (ATIS) corpusbut that no performance increase could be obtained on the SWB corpus. This discrepancy may beexplained by a reduced quality of language model match on the more challenging, conversationalSWB data. In this case, the language model component of the overall decoding score would be lessinformative and so any variation in its weighting would have less effect on the overall best decod-ing. The algorithm described by Koo et al. [115] ranks competing decodings solely on the basis of a

38

confidence score. The algorithm was found to outperform a standard likelihood based decoder for amedium sized, telephone speech task.

Diagnostics

In addition to training post-classifiers, Eide et al. [48] and later Chase [34, 32, 33], have made someprogress in the very important area of recogniser diagnostics. Eide et al. [48] conclude that the pres-ence of short words (cf. [42]) and an increase in speaking rate are correlated with an increase in WER.As they stand, however, these facts do not provide insight into how the errors may be avoided (otherthan requiring that longer words be spoken at a strictly isochronal rate). Chase provides a usefulrecipe for categorising potential causes of error [32], such as OOV errors, search errors, homophonesubstitutions and acoustic and language modelling problems. The proposed method for identifyingthese classes relies heavily upon acoustic model likelihoods, however, and would benefit from theuse of objective measures of acoustic model match. Using an objective measure, questions such as,“was the error caused by high levels of background noise?”, or “is the reference transcription cor-rect at this point?”, or even “does this pronunciation model accurately reflect the realisations of thisword?”, could potentially be answered. It is also important to consider how informative confidencepredictors are with regard to the actually improving a system. For example, Chase lists the durationof a decoding hypothesis and also the number of phones in its pronunciation model as confidencepredictors [34]. Although these statistics can be correlated with word correctness on average, theysay nothing about why a particular decoding hypothesis is incorrect.

Training Data Selection

It has been noted for the SWB corpus, e.g. [57], that pronunciation variation is often ignored in thetranscription of acoustic training data; so that “kinda” is transcribed as kind of, for example. In orderto provide accurate training targets for an acoustic model, therefore, non-canonical pronunciationsmust be predicted from the orthographic context alone. Even with additional knowledge of speakingrate and dialect etc., this is an extremely difficult and so far unsurmounted problem. The incidenceof errors in the reference transcript of the Broadcast News (BN) corpus has also been reported byPitz et al. [144]. These problems arise as the hand transcription of large quantities of more challen-ging speech data, especially spontaneous speech, is difficult, time-consuming and costly. Given thisscenario, it is desirable to have some measure of confidence in the acoustic model training targets,especially if the transcription was not painstakingly crafted. As -best decoding statistics and lan-guage model match information are not appropriate for this task (distinct language model mismatchis unlikely at any point in the reference transcription due to weak -gram constraints) a purely acous-tic confidence measure is required. Difficulties in the assessment of a reference transcription usingacoustic likelihood-based measures are described in [144], although the use of acoustic likelihoods tohighlight potentially erroneous portions of the reference transcript followed by hand verification andrecogniser retraining was found to facilitate small decreases in WER. It was also noted in [144] thathand verification of potentially erroneous transcripts is not a practical proposition for large corporaand that some automatic verification method is required.The idea of filtering training data on the basis of confidence also conjures up the notion of training arecogniser in an unsupervised fashion, where:

1. A partially trained, bootstrap acoustic model is used to decode some data.

2. The decoding is assessed to identify any low confidence portions.

3. The recogniser is retrained using targets derived only from the high confidence portions of thedecoding.

4. The process is iterated.

The potential for training a recogniser using cheap, plentiful untranscribed data is extremely stim-ulating and pilot studies have recently been reported by Kemp & Waibel [104] and Zavaliagkos et

39

al. [218]. Kemp & Waibel report that by bootstrapping a recogniser on approximately 2 hours ofhand transcribed German broadcast news data, further relative improvements in WER of between3.1% and 5.5% could be obtained from unsupervised training on a further hours of data. Using arecogniser bootstrapped on approximately 30 minutes of data, they estimate that unsupervised train-ing yields about one third of the gain available from supervised training. The confidence measureused in this work was the same as that reported in [103]. Zavaliagkos et al. describe the applicationof a confidence measure to unsupervised training using Spanish CallHome data. They present resultswhich suggest, rather astonishingly, that recogniser performance still improves, albeit slowly, evenwith a confidence threshold set to allow decodings with 80% WER into the training transcription.

Comments

Although, in principle, bringing the largest number of information sources to bear through a post-classifier will provide a confidence measure with the best performance for some application, a side ef-fect is that the causes of low confidence are conglomerated and hence obscured by the post-classifier.Also, adopting the definition of a confidence measure as the posterior probability of word correct-ness, as is typically done in the above papers, detrimentally narrows the focus of confidence measureapplications.

4.4 Language Model Probabilities

Language model probabilities may be used as confidence measures directly as they are not condi-tioned upon the acoustics and so are comparable across utterances. The language model constitutesa rich source of information for OOV word spotting. One observation made by Chase [34] is thatlanguage model probabilities tend to be reduced around occurrences of OOV words as the recogniserstruggles to incorporate an inappropriate word hypothesis into an otherwise coherent word sequence.Another observation reported by Suhm et al. [191] is that for an approximately 14 k word vocabulary,27% of OOV words drawn from Wall Street Journal data are proper nouns, 45% are inflections ofin-vocabulary words and 6% are concatenations of in-vocabulary words. Due to the nature of propernouns, it seems feasible that they may be predicted with some success by their surrounding wordcontext. For example both Mr. person name and company name Ltd. are both likely toreceive high bigram probabilities.Asadi et al. [6] report the use of a statistical class grammar where the probability of encounteringan OOV word is estimated for each open class in the grammar. Suhm et al. [191] describe a fairlysimple approach where a standard -gram language model is augmented with an OOV word class(an approach also adopted in the ABBOT system [75]). Fetter et al. [53, 55] describe a rather morecomplicated approach which they term iterative substitution: It is argued that the majority of thewords in the language model training data are incorporated into the recogniser’s vocabulary, andso a mismatch will arise between the frequency of OOV words in the training data and frequencyof those in the test data. (This assumption may be called into question if one considers that thetraining set should be truly representative of the data encountered during testing.) To address thispotential problem, an initially small subset of the words contained within the training data is declaredas within vocabulary and all other words are declared as OOV. -gram statistics are computed foran OOV word class, the within vocabulary subset enlarged and the process iterated. The statisticsderived from successive iterations are merged. It is claimed that this process exerts a smoothingeffect upon the calculated n-gram statistics. Gallwitz et al. [69] report the use of a category based

-gram where the probability of an OOV word is calculated for each category, where categories maybe created manually or some clustering algorithm may be used to derive categories automatically.

4.5 Other Measures

A quite different approach to the problem of OOV word spotting is described by Hetherington [85].Given that a recogniser will attempt to match an OOV word using in-vocabulary items, an -best

40

word lattice is likely to contain many different matching attempts, each with a relatively low score.This is in contrast to a well modelled in-vocabulary word, where an -best lattice is likely to containrepeated matches of the correct word model. Hetherington found the active word count, defined asthe number of unique words competing with each other during the recognition process at each pointin the utterance, useful for OOV word spotting using the ATIS corpus. The active word count hasbeen incorporated into an on-line model adaptation algorithm with some success, described in [5].Ideally an adaptation algorithm should only draw upon alignments between models and associatedacoustics which are known to be correct. The use of confidence measures provides a means toimprove adaptation performance in unsupervised mode. Willet et al. [209] found a lattice basedconfidence measure to outperform other acoustic, grammatical and combined confidence measureson Wall Street Journal and RM data, which is consistent with the overall consensus from the post-classifier literature that -best statistics are the most useful confidence indicators.Rivlin [164] and Fetter et al. [54] describe the formation of frequency distributions for the two classesof correct and incorrect decoding hypotheses over their respective acoustic model likelihood values.The rejection decision is then based upon the relative proportions of these distributions given the like-lihood of a particular decoding hypothesis. This approach is somewhat ad hoc, as likelihoods are notcomparable across utterances. In practice, however, if the acoustic conditions are sufficiently con-strained for to remain reasonably constant for all utterances, the approach may be feasiblein practice.

4.6 Summary

The general strands of the above review are:

The surge of interest in confidence measures underlines their potential.

Generative HMMs are not well suited to producing acoustic confidence measures as some ad-ditional processing step must be used to convert the acoustic model likelihoods into quantitiesthat are comparable across utterances.

The majority of studies described above are based upon two general methods for this additionalprocessing: The formation of likelihood ratios and the training of post-classifiers.

Some more specific points are:

The use of explicit alternate models seems limited to relatively constrained tasks as they arerequired to model all extraneous acoustics—a potentially ill-posed problem.

Explicit alternate models do, however, facilitate relatively computationally inexpensive dis-criminative training.

Alternate models based upon competing decodings seem better suited to larger vocabulary, lessconstrained tasks, but suffer from problems of computational expense.

A looped-phone alternate model (i.e. one which facilitates a decoding subject only to phone-level constraints) is useful for spotting and putatively transcribing (at the subword-level) OOVwords.

Whilst post-classifiers allow many confidence indicators to be brought to bear, they have theundesirable side-effect of conglomerating the information sources and hence obscuring thecauses of low confidence. Their use is also often accompanied with the definition of a confid-ence measure as the posterior probability of word correctness, given the values of some set ofconfidence indicators, which detrimentally narrows the focus for potential confidence measureapplications.

The most useful confidence predictor was found to be information drawn from -best decodinglists or lattices.

41

From a diagnostic perspective, objective measures of model match which have simple, explicitlinks to the recognition models are desirable.

42

Chapter 5

Confidence Measures derived froman Acceptor HMM

Unlike generative HMMs, acceptor HMMs are able to directly estimate the posterior probabilityof a word- of phone-sequence model given the acoustics . Since these posteriorprobabilities are implicitly scaled by the probability of the acoustics conditioned upon theoverall acoustic model, they are comparable across utterances and hence constitute useful measuresof confidence. Acceptor HMMs are thus well suited to producing objective measures of model matchwith simple and explicit links to the recognition models. An additional benefit which arises from theestimation of local posterior probabilities, such as those produced by the ABBOT/SPRACH systemdescribed in section 2.6, is the availability of posterior probability estimates at the frame-, phone-,word- and utterance-levels. Accessing measures of confidence over these levels is useful for taskssuch as OOV word spotting and for recogniser diagnostics.This chapter describes a set of confidence measures derived from the ABBOT/SPRACH systemwhich were used for experiments described in chapters 6, 7 and 8. Three of the word-level measuresdescribed below, namely ‘online garbage’ [20, 22], ‘lattice density’ [85] (where it was termed the‘active word count’) and the ‘A-stabil/LM-jitter’ measure [181, 103, 34] are borrowed from the lit-erature and were applied as faithfully as possible to the ABBOT/SPRACH system. These measuresare also extended to the phone-level in this work. Another obvious confidence measure that has fre-quently been used, e.g. [34, 187, 73, 205], is the probability of a decoding hypothesis, as given by astandard -gram language model. The remaining measures are new and exploit some of the desirableproperties of the local posterior probability based acceptor HMM. In sections 5.1 and 5.2, the meas-ures are described for a phone hypothesis with a duration , where and arethe start and end frames of the decoding hypothesis respectively. It is assumed that the local posteriorprobabilities of the form are estimated using an RNN. Methods for extending the confid-ences measures to the word-level are described in section 5.3. Duration normalised versions of theconfidence measures are given by default, although it is straightforward to see how these measureswould be defined in an un-normalised fashion. A consequence of using limited context is that theprobability of a decoding hypothesis is always underestimated. Duration normalisation counteractsthe bias towards shorter decoding hypotheses which is created by this underestimate.

5.1 Acoustic Measures

Three of the four acoustic confidence measures described below exploit the local phone class pos-terior probability estimates provided by the acceptor HMM used in the ABBOT/SPRACH system.Although the (scaled) likelihoods used implement the online garbage measure are somewhat artifi-cially derived from these local posterior probability estimates, the measure nevertheless serves as auseful control since it facilitates approximations to a posterior probability with varying degrees ofaccuracy. To obtain an acoustic confidence measure from a generative HMM, the estimated classlikelihoods would have to be normalised through the formulation of some variety of likelihood ratio.

43

Posterior Probability is computed by re-scoring the Viterbi state sequence, output follow-ing the search over for , using the local posterior probability estimates obtained directlyfrom the acoustic model and duration normalising:

(5.1)

Scaled Likelihood The scaled likelihood of is obtained by dividing the local posterior probabilityestimate by the class prior, calculated from the labelling of the acoustic training data:

(5.2)

is obtained by re-scoring the Viterbi state sequence using the scaled likelihoods foreach state and duration normalising:

(5.3)

Online Garbage is normalised by the average of the -best scaled likelihoods, calculatedon a per-frame basis and duration normalised to produce [20, 22]:

(5.4)

Per-Frame Entropy is the per-frame entropy of the local phone posterior probabilities,averaged over the interval to :

(5.5)

Since the calculation of does not depend upon the Viterbi state sequence, it may becalculated prior to the search over for , with corresponding savings in computationalexpense. The independence of from the Viterbi state sequence highlights the factthat it constitutes a general measure of acoustic model match and is not dependent upon anyone particular decoding hypothesis. A schematic representation of is given in the leftpanel of figure 5.1.

5.2 Grammatical and Combined Measures

The one grammatical confidence measure described below is obtained exclusively from the languagemodel, whereas the three combined measures are obtained using both the acoustic and the languagemodels.

N-gram Probability is computed by re-scoring the Viterbi phone sequence using the prob-ability of conditioned upon its -gram history , as estimated by the language model, anddurationally normalising:

(5.6)

N-gram based Posterior Probability results from replacing the acoustic class prior in-clusive in with the appropriate language model -gram probability of conditionedupon the local decoding hypothesis history:

(5.7)

44

is thus the result of replacing the class prior as estimated by the acoustic modelwith that obtained from the language model.

Lattice Density is a measure of the density of competitors in a lattice of -best decodinghypotheses and is computed by counting the number of unique decoding hypotheses which passthrough a frame and averaging the counts over the interval :

(5.8)

where is the number of competing decoding hypotheses which pass through the thframe of the lattice. If is computed from an -best lattice of word hypotheses,

is equivalent to the active word count [85]. A schematic representation ofis given in the right panel of figure 5.1.

LM-Jitter is the number of times that remains at the same point in the sequenceover the -best decodings obtained by re-scoring the -best decoding lattice using differentlanguage model weighting factors. The count is normalised by to scale the values of themeasure into the range [0,1]. It should be noted that this implementation does not vary anyacoustic model weightings or insertion penalties, as described in [181] and [34].

......

......

High EntropyLow Entropy

Time (frames)Time (frames)

Phon

e Cl

ass P

oste

rior P

roba

bilit

y Es

itim

ates HIS NAME IS MAGIC HERE

MAGIC

YOUR

GAMEMARRIAGE

NAME

MISS

HERE

HASAIMS

ISHISTRAGIC

HAZE

Time (frames)

Aco

ustic

Sco

re

Count for this frame

Figure 5.1: Schematic illustrations of (Left) and (Right).

5.3 Word-Level Measures

and may be extended to a word hypothesis in two ways: (1) By simply aver-aging the frame-level values of the measures obtained from the optimal state sequence path over theduration ; and (2) by summing the duration normalised phone-level values of the measures overthe phone hypotheses constituent to and normalising by [17]. The second scheme, expressedin equation 5.9 for , is intuitively more appealing since the confidence of each phone is pro-cessed individually, and has been empirically found to provide slightly improved performance overthe alternative. Scheme (2) was thus adopted for all experiments described in this dissertation.

(5.9)

is calculated using and by computing the garbage likelihood over the durationof . and may be derived at the word level by simply matching the periodover which they are calculated to the duration of . and make use of wordlevel language model statistics. is computed by re-scoring -best word lattices as opposedto phone lattices.

45

5.4 Comments

Insight into the behaviour of the measure may be gained by considering the nature ofacoustic model training in the ABBOT acceptor HMM system. As the ANN-based acoustic modelis trained to discriminate between acoustic classes within some domain (phones in clean speech,for example) modelling resources will be focussed upon placing decision boundaries in the acousticobservation space between class distributions. However, acoustics from outside the modelled domain,such as non-speech sounds, are quite likely to have distributions which cross these boundaries. Asa result, tokens of clean speech will typically fall well within the boundaries of some class and sothe entropy of the posterior probability distribution for a frame will be low, due to one—and onlyone—strongly matched class. In contrast, tokens from outside the modelled domain may well fallclose to within domain class boundaries resulting in high per-frame entropies, due more than oneweakly matched class. This scenario is schematically illustrated in figure 5.2.

XXX

X XX

X

XX

XX

O O O OO

O OOO

Decision Boundaries

Out-of-DomianExemplars

XX

X

X

FeatureSpace

Within-DomainPrototypicalExemplars

Figure 5.2: A schematic illustration of the distributions of within- and out-of-domain tokens, overlaidon within-domain inferred class boundaries.

The confidence measure is by far the most expensive to compute as an -best decoding latticemust first be created and then re-scored multiple times using different language model weightingfactors. is also expensive to compute as again an -best decoding lattice must be created, followedby the calculation of the density statistics. The computational expense of both these measures may bestrongly contrasted with the expense of the purely acoustic confidence measures, such aswhich may be computed prior even to the search for for example. The lack of computationalexpense coupled with its general acoustic model match measuring properties make veryattractive for an acoustic filtering application. The application of to examples of such tasksis discussed further in chapter 8 and section 9.2.The measure may be interpreted as a measure of the ‘acoustic stability’ of a decoding hypothesis(hence the name ‘A-stabil’). The measure essentially tests the robustness of a decoding hypothesisto varying language model weightings. If the decoding hypothesis has strong acoustic evidence, thenit will remain in . Given a lack of acoustic evidence, a decoding hypothesis will be fragile inthe face of language model change and hence will be susceptible to falling out of under varyinglanguage model weighting factors. gets the ‘best of both worlds’ as it may be interpreted as,in some sense, an acoustic confidence measure, yet is calculated with the benefit of language modelinput. draws upon the most information sources, but pays for its privilege in computationalexpense.

5.5 Applications

The results of investigations into the application of the confidence measures described in this chapterto the tasks of utterance verification, pronunciation model evaluation and the filtering of audio streams,are described in chapters 6, 7 and 8, respectively. These tasks represent a broad cross-section of the

46

range of potential confidence measure applications described in section 1.2. The utterance verific-ation task tests the ability of a confidence measure to discriminate between correct and incorrectrecogniser outputs and is possibly the most popular application for a measure of confidence. Thepopularity of investigations in this area stems from the fact that utterance verification is a desirablefeature for any robust spoken user interface. Pronunciation model evaluation is an example of theuse of a confidence measure for diagnostic purposes. The experiments described in chapter 7 use aconfidence measure ‘off-line’ (i.e. outside of the recognition cycle) in an attempt to identify crudepronunciation models and also to isolate which, from a set of alternatives, constitute an improvement.The experiments described in chapter 8 describe the use of a general measure of acoustic model matchto filter out ‘unrecognisable’ portions of acoustics from a relatively unconstrained stream of audio,prior to the decoding stage. These applications highlight the fact that confidence measures can be use-fully employed throughout the recognition process—before and after decoding and also in off-linemode for diagnostics.The Broadcast News (BN) corpus is common to the experiments described in all three chapters. Thischallenging corpus has a number of features which make it suitable for all three tasks:

The corpus represents a departure from relatively constrained recognition tasks, such as thedecoding of topic-specific read speech, spoken in a quiet office environment, to the recognitionof speech found in an everyday environment, in the case of BN from television and radio newsprogrammes. This much more challenging data yields higher error rates and so provides a goodtest-bed for the utterance verification task.

Found speech is likely to contain words with a wide variety of pronunciations. Non-canonicalpronunciation is far more common in spontaneous speech, such as that produced during a tele-vision or radio interview for example, than in read or planned speech. The existence of a widerange of pronunciation variation within the BN corpus makes it well suited to an investigationinto pronunciation model assessment and learning.

The BN corpus also contains periods of non-speech, such as musical interludes, speech in thepresence of background noise and music, and speech under degraded acoustic conditions. Thecorpus is therefore also well suited to the investigation of mechanisms for filtering unrecognis-able portions from a stream of audio input to a recogniser.

In addition to the BN corpus, two further corpora were utilised in chapters 6 and 8, respectively.In chapter 6, the performance of the various confidence measures for the utterance verification taskon the BN corpus was compared to their performance for a read speech corpus. In chapter 8, thefiltering performance of the general measure of acoustic model match was also assessedusing a data base of speech and music samples.

47

Chapter 6


6.1 Introduction

In this chapter, the set of confidence measures described in chapter 5 are applied to the task ofutterance verification. The performance of the measures is compared over two quite different datasets, described in section 6.2. The preliminary experiments, described in section 6.4, investigatethe effect of duration normalisation upon the relevant measures as well as the assessments providedby the different evaluation metrics, described in chapter 3. The baseline experiments, described insection 6.5, reveal some interesting trends in the UERs of the different measures on the two data sets.These trends are further investigated over varying acoustic conditions and vocabulary sizes for theutterance verification task, in sections 6.6 and 6.7, and for the OOV spotting task in section 6.8.

6.2 Corpora

The question addressed in this chapter is, how useful are the confidence measures described in sec-tion 5 for the task of utterance verification? To answer this question, the ABBOT LVCSR system wastrained and tested using two different corpora:

The North American Business News (NAB) corpus.

The Broadcast News (BN) corpus.

Both these corpora are available from the Linguistic Data Consortium1 (LDC) and, like their prede-cessors RM, TIMIT and ATIS, were collected to support the DARPA CSR evaluation programme.The constant evolution of these evaluations reflects the desire of funding agencies to see not only‘progress against agreed metrics’, but also ‘increasing task complexity or diversity over time’ [68].A review of the evaluation programme is provided by Young & Chase [217]. In this survey, it isnoted that shared corpora, such as NAB and BN, constitute a valuable resource since they allow ob-jective comparison between different ASR technologies and that the focus of the corpora provide astrong ‘technology pull’ for timely research problems. The considerable effort require to participatein annual evaluations is also noted, however.

6.2.1 NAB Data

The Wall Street Journal (WSJ) database, which subsequently evolved to become the NAB corpus, isdescribed in [141]. The design goals were to create a database capable of supporting LVCSR with:

Variable sized vocabularies.1http://www.ldc.upenn.edu/

48

Variable perplexities.

Speaker-dependent (SD) and -independent (SI) acoustic modelling.

The use of verbalised and non-verbalised punctuation.

Speech collected using multiple microphones in a moderate noise environment (a quiet officewith background noise of approximately 50 dB).

A balanced set of speakers in terms of gender and dialect.

In the interests of cost effectiveness, the majority of the corpus was composed of newspaper articlesread aloud, although some spontaneous utterances were also included. By using prompting text, acomplete transcription of the utterance does not have to be made after the fact. An added benefit ofusing newspaper prompting text is that large quantities of machine readable text exist, drawn fromnewspaper back issues, and can be used to train well matched, domain specific language models.

Recogniser Training

The experiments described below were carried out with speech collected using the primary (C0)microphone—a ‘close-talking’, noise-cancelling Sennheiser HMD-410. Broadband (16 kHz) data issupplied for this channel. Two RNNs using PLP features were trained, one forward and the otherbackward in time, on 7240 utterances collected from 84 different speakers (the SI-84 dataset), consti-tuting approximately 15 hours of acoustic data. The CI phone class probabilities from the networkswere combined at the frame-level by averaging in the log domain. The embedded Viterbi training wasbootstrapped using an RNN trained on the TIMIT hand-labelled corpus. A trigram language modelfor 59999 words was used, trained on approximately 237 million words obtained from financial textdata sources such as the Wall Street Journal, Associated Press, Dow Jones Information Service andthe San Jose Mercury (the CSR-LM-1 text corpus [118]).2

Evaluation Test Set

The Hub-3 1995 evaluation test set3 (Hub-3E-95), used in the experiments, is of unlimited vocabularyand was collected from 20 speakers, each reading 15 sentences (300 in total). The utterances consti-tute approximately 45 minutes of acoustic data. The prompting texts were derived, in approximatelyequal proportions, from August 1995 editions of the New York Times, the Reuters North AmericanBusiness Report, the Los Angeles Times, Dow Jones Information Services and the Washington Post.The test set was decoded using a baseline pronunciation lexicon for approximately 60 k words, de-rived primarily from baseforms supplied by LIMSI. The OOV rate for the reference transcriptionusing this lexicon was 0.58%.

6.2.2 BN Data

In recent years, the focus of the DARPA/NIST evaluations has shifted from purposely collected readspeech to that found in an everyday environment. The goal underlying this shift in emphasis isto promote research in “adaptation to changing conditions and robustness with respect to degrad-ation” [178]. Broadcast news shows provide a convenient source of data which fits this mandate.Table 6.1 illustrates a number of ways in which BN data represents a relaxation of the constraintsfound in the NAB corpus.The BN corpus comprises shows broadcast by the North American television networks ABC, CNNand C-SPAN, and the NPR radio network. In order to measure and compare ASR performance over a

2I am indebted to Gary Cook for supplying the acoustic and language models used for the experiments described in thischapter.

3The ‘Hub and Spoke’ paradigm was adopted for the CSR evaluations in November 1993 [117]. The hub refers to a datasetdesigned to represent a fundamentally important ASR problem. The mandatory nature of the Hub facilitates comparisonacross all evaluation participants. The spokes are optional tests designed to explore other timely problems. All Spokes can beinformatively compared to the Hub but are otherwise independent.

49

Condition NAB Data BN DataAcoustic conditions Quiet office Studio & telephone quality speech

background noise & musicSpeaking Style Read speech, native dialects ‘Planned’ & spontaneous speech

native & non-native dialectsSubject matter Financial news Current affairs

Table 6.1: A comparison of some of the constraints of the NAB and BN corpora.

broad range of acoustic conditions, the corpus is annotated with a seven focus conditions, based uponspeaking mode, dialect, fidelity of the acoustic signal and the presence or absence of backgroundnoise:

Mode The speaking mode may be either planned or spontaneous. Spontaneous speech is charac-terised by disfluencies such as re-starts and hesitations. Planned speech is largely free of suchphenomena.

Dialect The dialect of the speaker may be either native or non-native. Native refers to native speakersof American English. All other speakers are classed as non-native (including native speakersof British English).

Fidelity The fidelity of the recording environment and transmission channel may take one of threevalues; high for speech recorded in a broadcasting studio, low for speech transmitted over lowbandwidth channels, such as a telephone line, and medium otherwise.

Background The background refers to unwanted background noise which may be due to music,speech or other sources.

Focus Condition Mode Dialect Fidelity BackgroundF0: Baseline broadcast Planned native high cleanF1: Spontaneous speech Spontaneous native high cleanF2: Telephone channels Any mode native med./low cleanF3: Background music Any mode native high musicF4: Degraded acoustics Any mode native high speech/otherF5: Non-native speakers Planned non-native high cleanFX: All other speech – – – –

Table 6.2: Focus conditions for BN data.

The attributes of the seven focus conditions are summarised in Table 6.2. The catch-all FX conditioncan contain data which simultaneously satisfies the criteria for more than one of the other focusconditions. In addition these focus conditions, intervals of BN test set data can be marked as examplesof two further categories:

Inter-segment gaps This category covers portions of BN data such as sports reports, commercialsand periods of non-speech sounds, such as musical interludes. Although inter-segment gapsare not transcribed, they are usually assigned the FX acoustic condition label.

Excluded regions These are regions which are excluded for the purposes of calculating WER stat-istics for evaluation test data. Excluded regions are often of a much shorter duration thaninter-segment gaps and typically contain silence or ‘double-talk’ (two simultaneous speechsignals of a comparable level). They are not supplied with an acoustic condition label.

50

Recogniser Training

Similarly to the NAB acoustic model, two 384 state unit RNNs were trained using PLP features, oneforward, the other backward in time, on the BN training set released in October 1996, constitutingapproximately 50 hours of data (less untranscribed adverts, sports reports etc.). The CI phone classprobabilities derived from these networks were merged in the log domain on a per-frame basis withthose from a 4 k hidden unit (HU) MLP trained on the same data using MSG features. A trigramlanguage model for 65529 words was used, trained on approximately 284 million words of BN andnewswire text.

Evaluation Test Set

The Hub-4 1997 evaluation test set (Hub-4E-97) constitutes approximately 3 hours of data (minusinter-segment gaps and excluded regions) sampled in a ‘channel-hopping’ fashion from a pool of 10hours of news stories broadcast in October and November of 1996. Given the approximately 65 kword lexicon used for decoding, the OOV rate relative to the reference transcription was 0.51%.

6.3 Marking Scheme

For the purposes of utterance verification, decodings were marked against a forced Viterbi align-ment of the reference transcription using a scheme similar to that described in [205]. In additionto insertions and substitutions, decoding hypotheses were marked as incorrect if they were poorlytime aligned. Time-alignment was deemed of crucial importance to the assessment of the confidencemeasures as their values are highly dependent upon the quality of acoustic model match—incorrectstart and end times for an otherwise correctly decoded hypothesis will adversely effect its confidencevalue. According to the marking scheme a word hypothesis may only be marked as correct ifan identical word exists in the time-aligned reference transcription such that greater than 50%of the interval spanned by overlaps with that of and vice versa. This 50% overlap criterionis illustrated in figure 6.1.

REFA B

A BHYP

incorrect correct

Figure 6.1: A schematic illustration of the 50% overlap criterion used to assess the time alignment ofa decoding hypothesis, after [205].

WER statistics for the decodings were calculated using NIST’s SCLITE package4 in dynamic pro-gramming string alignment mode.

6.4 Preliminary Experiments

Preliminary experiments were carried out to address two questions:

What is the effect of duration normalisation upon the confidence measures?

How do the assessments provided by the evaluation metrics (described in chapter 3) compare?

For the preliminary experiments, the Hub-3E-95 dataset was decoded under two conditions:4Available from http://www.nist.gov/speech/software.htm.

51

The word-level decoding constraints of a 60 k word pronunciation lexicon and trigram languagemodel were used to recognise the data. Word- and phone-level decodings were available fromthe recogniser, together with an -best decoding lattice ( was set equal to 127) at the word-level. The WER of the decoding in this case was 15.4%.

The data was also decoded using only the phone-level constraints of a phone bigram languagemodel and no pronunciation lexicon. This configuration is equivalent to a ‘looped phone’model with probabilities assigned to the transition from one phone to the next. A phone-level decoding together with an -best decoding lattice ( was set equal to 127 as before)were available for this condition. A phone error rate of 28.5% was computed for the phone-constraint decoding from a comparison with the phone sequence obtained from the forcedViterbi alignment of the reference word transcription. Since this alignment relies upon (anaugmented version of) the imperfect pronunciation lexicon used for recognition, the derivedreference phone transcription will contain errors. Any associated issues must thus be borne inmind when considering this phone-level error rate.

6.4.1 Duration Normalisation

A comparison of the performance of the confidence measures was made in their durationally norm-alised and un-normalised forms. Graphs of the sum of the probability of type I and type II errorsagainst the percentage of rejected hypotheses are given in figure 6.2. (The andmeasures are inherently durationally normalised and duration normalisation is not applicable to the

measure. The results for a subset of the remaining confidence measures were plotted for clarity.)The results in the figure indicate that the duration normalisation is beneficial for all measures underboth sets of decoding constraints. On the basis of these results, durationally normalised versions ofthe applicable confidence measures were adopted for all subsequent experiments.

6.4.2 Metric Comparison

A comparison was also made between the assessments provided by the different evaluation metrics.In broad terms, the results presented in figure 6.3 indicate that whilst all the metrics agree in theirranking of the confidence measures (again results for only a subset were plotted for clarity), the task-independent metrics provide clearer distinctions between the different measures than the task specificUER metric on this particular data set.A closer inspection of the plot of UER against percentage hypothesis rejection (top left panel) revealsthat the lowest UER is obtained when no recogniser outputs are rejected (reflecting the low WER ofthe recogniser on Hub-3E-95) and that, for all measures, any increase in the percentage of rejectedhypotheses leads to an increase in UER. Thus none of the measures facilitate any profitability inrejection for this particular dataset. This topic is investigated further in section 6.5. It can be seenthat the plots of and against percentage hypothesis rejection (upper and lower rightpanels) are essentially identical, bar a scaling of the y-axis. This similarity may be predicted fromequation 3.9 as is equal to divided by entropy of the probability distribution overthe different actions resulting from the test . The shape of both these plots are also very similarto that for the sum of the probability of type I and type II errors. The curves on the DET plot do notfall sufficiently far into the lower left quadrant for the benefits of the axis-warping to take effect.

6.5 Baseline Experiments

Given the above preliminaries, the question addressed in the baseline experiments is: How useful arethe various measures for the task of utterance verification given a recogniser operating in a particulardomain? Both the Hub-3E-95 and Hub-4E-97 datasets were used in these experiments. The sameword- and phone-level decoding conditions used in the preliminary experiments were adopted for

52

0.6

0.7

0.8

0.9

1

1.1

0 20 40 60 80 100

P(ty

pe I

erro

r) +

P(ty

pe II

erro

r)

Percentage Hypothesis Rejection

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 20 40 60 80 100

P(ty

pe I

erro

r) +

P(ty

pe II

erro

r)


0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 20 40 60 80 100

P(ty

pe I

erro

r) +

P(ty

pe II

erro

r)


Figure 6.2: A comparison of the performance of a subset of the confidence measures in durationallynormalised and un-normalised form. The sum of the probability of type I and type II errors is plottedagainst percentage hypothesis rejection for word-level hypotheses (Top Left), phone-level hypothesesdecoded using word-level constraints (Top Right) and phone-level hypotheses decoded using phone-level constraints (Bottom).

53

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

1.05

0 20 40 60 80 100

P(ty

pe I

erro

r) +

P(ty

pe II

erro

r)


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pow

er

P(type I error)

0

2

4

6

8

10

12

14

16

18

0 20 40 60 80 100Percentage Hypothesis Rejection

0.005

0.050.1

0.3

0.5

0.7

0.90.95

0.995

0.005 0.050.1 0.3 0.5 0.7 0.90.95 0.995

P(ty

pe II

erro

r)

P(type I error)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 20 40 60 80 100Percentage Hypothesis Rejection

Confidence Measure area under ROC curve EER-0.4744 0.8330 0.7772 0.2647-0.3120 0.9295 0.6907 0.3387-0.2897 0.9368 0.6805 0.3591-0.2581 0.9419 0.6694 0.3646-0.1832 0.9717 0.6096 0.4174

Figure 6.3: A comparison of the assessments provided by the evaluation metrics for word-level de-coding hypotheses obtained from Hub-3E-95. Top Left: UER vs. percentage hypothesis rejection.Top Right: Sum of the probabilities of type I and type II errors vs. percentage hypothesis rejection.Upper Left: An ROC curve. Upper Right: Efficiency vs. percentage hypothesis rejection. LowerLeft: A DET curve. Lower Right: Mutual Information vs. percentage hypothesis rejection. Bottom:Table of scalar valued evaluation metrics.

54

the Hub-3E-95 dataset and similar constraints were used for the Hub-4E-97 dataset: For the word-level decoding constraint, a 65 k word vocabulary was used yielding a 25.4% WER5 (an -best of127 was specified for lattice creation). The same phone bigram and -best specifications used todecode the NAB data were used for the phone-level constraint decoding, yielding a phone error rateof 36.5% (cf.the related figure given in section 6.4 and associated issues). was computed by(dynamic string) aligning the original best decoding hypothesis sequence against the best decodinghypotheses obtained by re-scoring the -best decoding lattice using different languagemodel weighting factors, spanning the range with step size of , for both data sets anddecoding conditions.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100U

ncon

ditio

nal E

rror R

ate


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


Figure 6.4: A comparison of confidence measure UERs for word-level hypotheses (Top) and phone-constraint decodings (Bottom), obtained from Hub-3E-95 (Left) and Hub-4E-97 (Right).

The results are presented in figure 6.4 (a subset of the measures covering the range of the performanceis again selected for clarity). The task specific nature of the UER evaluation metric is appropriate hereas it highlights some interesting trends:

provides the best utterance verification performance, i.e. the lowest UER, on both corporafor decodings made with word-level constraints. The measure may be expected to providethe best performance since it draws upon more information sources than any other— -bestdecoding lattices created using information from the acoustic and language models and re-scored using a range of language model weighting factors. The large amount of post-processingrequired to compute the measure effectively obscures the cause of the low confidence, however.As a consequence, more informative trends can be discerned for the other confidence measures,which have simpler and more explicit links to the recognition models.

With the exception of , profitable rejection is only available at the word-level for BNdata and not for NAB data. (The y-axis intercept on a UER graph is indicative of the WER ofthe recogniser as this point corresponds to the rejection of none of the recogniser outputs. The

5Inter-segment gaps and excluded regions were removed prior to decoding.

55

higher intercept for BN data indicates that decoding Hub-4E-97 is a harder recognition taskthan decoding Hub-3E-95.)

In broad terms, profitable rejection is provided by a greater number of the confidence measuresat the phone-level than at the word-level. The difference between the y-axis intercept and thepoint of UER minima for the best performing measure is also greater at the phone-level than atthe word-level.

Excepting , the combined measure performs better than all the othermeasures at the word-level on NAB data, but is outdone by the purely acoustic confidencemeasure on BN data. The failure of and to facilitate profitablerejection at the phone-level may be attributed to the relatively poor quality language modelused to create and rescore the -best decoding lattices at this level. The language model usedin this case was an unsmoothed phone bigram estimated from the approximately 3 millionphone tokens obtained from a forced Viterbi alignment of the reference transcription to the1996 and 1997 BN acoustic training data sets.

As may be expected, the general measure of acoustic model match performs badlyon both NAB and BN data at the word-level. Since the measure is not specific to any givendecoding hypothesis it may a priori be considered not well suited to the task of utterance veri-fication. facilitates profitable rejection for both data sets at the phone-level, however.It may be inferred from this result that some reasonable fraction of incorrect decoding hypo-theses at the phone-level are caused by noisy (or generally out-of-domain) acoustics, whichis intuitive as other potential sources of recognition error such as OOV words of mismatchingpronunciation models do not exist at this level.

At the word-level, the purely grammatical confidence measure performs as well as manyof the other confidence measures on NAB data, but provides essentially no information regard-ing the truth or falsity of a decoding hypothesis for BN data. The measure is also uninformativeat the phone-level. Whilst the relatively poor quality of the language model may be used tosimply explain these observations at the phone-level, an implication of the word-level resultsis that a standard trigram begins to fail for the shift from NAB to BN data.

The purely acoustic confidence measure and the combined measure have verysimilar performance for all conditions. This suggests that the majority of the predictive powerof stems from its acoustic model probability component.

Curves describing the UER of the acoustic measure over different rates of hypothesisrejection were omitted from figure 6.4 in an attempt to prevent the overcrowding of the plots.It was found, however, that the best performance of the measure was obtained by summingover a larger for BN data than for NAB data. This implies that the posterior probabilitydistributions obtained for BN data are less ‘sharp’ than those obtained for NAB data, reflectingthe increased difficulty of recognising BN data.

The observation of these trends raised a number of questions, prompting a series of further invest-igations. One such question is: Why does not perform as well, at the word-level, onBN data as it does on NAB data? One potential explanation is that the proposed reduction in thequality of language model fit on BN data is responsible for the reduction in predictive power for thecombined measure. Another explanation, however, is that the -best decoding lattices were simplysaturated for the BN data. Saturation occurs if the number of -best lattice links ending at a nodeequals the maximum number of hypotheses specified in the decoding parameters. An analysis of thepercentage of saturated nodes in the word-lattices, shown in table 6.3, revealed that whilst the BNdata lattices are ‘bushier’, they are far from saturated. This provides support for the hypothesis thattrigram language models are capable of modelling the ‘tight’ grammar of read NAB data but are notso useful for modelling less constrained BN data. This topic is discussed further in section 9.6.To explain the general trends in profitability of rejection over the two data sets and decoding con-ditions, consider the range of values which a confidence measure may take. Non-speech soundscause gross model mismatches and so lead to large confidence reductions, whereas sources of error

56

Data Type Decoding Constraints node saturation (%)NAB word-level 0.0NAB phone-level 0.0BN word-level 0.6BN phone-level 0.0

Table 6.3: The percentage of saturated nodes in the -best decoding lattices ( = 127 for all lattices).

in clean speech, such as the occurrence of OOV words for example, will cause much more subtlemodel mismatches.The effect of an OOV word upon the confidence of an associated (incorrect) decoding hypothesis isillustrated in figure 6.5. The solid lines in the figure plot the values of a subset of the outputs of theacoustic model evolving over the duration of an instance of the word bedouin (drawn from Hub-3E-95). The dashed lines overlaid in the left panel of the figure give the timings for the alignment of theword sequence model for better one (hypothesised by the recogniser in this instance—bedouin beingOOV). The dashed lines in the right panel give the timings for the alignment of the correct model tothe acoustics. The values of are appended to each phone alignment. The models differ byonly a single phone ([bcl b eh dx axr w ah n]/[bcl b eh dx axr w ih n]) and although the confidencescore for the correct phone ([ih] = -0.41) is much higher than the value of for the incorrectphone ([ah] = -5.67), the majority of the phones are common and so the overall confidence scores arerelatively similar (better one = -1.35, bedouin = -0.68).

563 568 573 578 583 588 593

dx

bcl

b

n

w

axr

ah

ih

eh

Instance of the word "Bedouin"

Time (frames)

Phon

e Cl

ass P

oste

rior P

roba

bilit

y Es

timat

es

-0.63

-5.67

-1.78

-1.58

-0.12

-0.25

-0.09

-0.69

563 568 573 578 583 588 593

dx

bcl

b

n

w

axr

ah

ih

eh

Instance of the word "Bedouin"

Time (frames)

Phon

e Cl

ass P

oste

rior P

roba

bilit

y Es

timat

es

-0.93

-0.41

-1.78

-1.18

-0.12

-0.25

-0.09

-0.66

Figure 6.5: A subset of the outputs of the acoustic model evolving over the duration of an instance ofthe word bedouin (solid lines). Overlaid (dashed lines) are timings for the alignments of the modelsfor better one [bcl b eh dx axr w ah n] (Left) and bedouin [bcl b eh dx axr w ih n] (Right). Valuesof are appended to each phone alignment.

Given the above scenario, it might be expected that it would be a simple matter to reject an incor-rect decoding hypothesis, due to the occurrence of an OOV word say, on the basis of an empiricallydetermined confidence threshold. Unfortunately, however, the pattern of confidence reduction is com-plicated by the presence of modelling imperfections, a major source of which are crude pronunciationmodels. In the case where a crude pronunciation does not cause a recognition error, the confidenceattributed to the associated decoding hypothesis will still be reduced. This effect is demonstrated infigure 6.6. The solid lines in the figure plot a subset of the acoustic model outputs evolving over theduration of an instance of the word funds (drawn from Hub-3E-95). Overlaid in dashed lines in theleft panel are the timings of the alignment of the baseform [f ah n dcl d z] drawn from the recognitionlexicon. The timings for the alignment of an alternative baseform [f ah n z] are plotted in the rightpanel. The values for are appended to each phone alignment. It can be seen from the figurethat the acoustic model suggests the absence of the phones [dcl] and [d] in this particular realisation

57

Baseform Average value[f ah n dcl d z] -2.1433[f ah n z] -0.7439

Table 6.4: Average values calculated for two baseforms from their alignment to 62 occur-rences of the word funds.

of the word. The alignment of these two phone models in the left panel receive an appropriately lowconfidence. Overall [f ah n z] receives a higher confidence value (-0.16) than [f ah n dcl d z] (-1.53).It can be seen, however, that the reduction in confidence due to a crude pronunciation model is of thesame order as that seen for the utterance of an OOV word.

160 165 170 175 180 185

dcl

d

n

f

z

s

ah

Instance of the word "Funds"

Time (frames)

Phon

e Cl

ass P

oste

rior P

roba

bilit

y Es

timat

es -0.30

-0.57

-0.02

-0.05

-4.56

-3.67

160 165 170 175 180 185

dcl

d

n

f

z

s

ah

Instance of the word "Funds"

Time (frames)

Phon

e Cl

ass P

oste

rior P

roba

bilit

y Es

timat

es -0.19

-0.30

-0.02

-0.13

Figure 6.6: A subset of the outputs of the acoustic model evolving over the duration of an instance ofthe word funds (solid lines). Overlaid (dashed lines) are timings for the alignments of the baseforms[f ah n dcl d z] (Left) and [f ah n z] (Right). The values of are appended to each phonealignment.

Table 6.4 gives average values of for the alignment of the two baseforms, [f ah n dcl dz] and [f ah n z], to over 62 realisations of the word funds (drawn from an approximately 63 hracoustic data set, described in section 7.6). The table shows that the anecdotal acoustic model matchimprovement offered by [f ah n z], over [f ah n dcl d z], is maintained for a substantially sized dataset. It may be inferred from this result that, on average, the realisation of funds is closer to [f ah nz] than it is to [f ah n dcl d z]. An obvious refinement of this analysis would be to investigate thecontexts in which funds is realised as something closer to [f ah n dcl d z] than to [f ah n z] andvice versa. The topic of context dependent pronunciation modelling is discussed in more detail inchapter 7.Crude pronunciation models thus mask similarly sized confidence reductions due to other sources.As the causes of error in clean, read speech, such as OOV words for example, elicit relatively subtlereductions in confidence and similarly sized confidence reductions are seen for correctly decodedwords with crude pronunciation models, setting a confidence threshold which facilitates profitablerejection on NAB data is difficult. In contrast, BN data is much more amenable to profitable rejec-tion, as it contains non-speech sounds which cause gross reductions in confidence, beyond the rangemasked by crude pronunciation models. The concept of a spectrum of confidence together with aregion of subtle reductions in confidence that is masked by the effect of crude pronunciation modelsis illustrated in figure 6.7.The confidence measures perform better at the phone-level as the model mismatches are more dis-tinct; there are no correlates to recognition errors due to OOV words or to crude pronunciation modelsat this level. (Improved confidence measure performance at the phone-level is also reported in [34].)Despite their absence during decoding, crude pronunciation models may still compromise the eval-

58

Bad ReductionsGross Confidence

Good

Subtle ConfidenceReductions

Masked Region

Estim

ate

Conf

iden

ceFigure 6.7: A schematic illustration of a spectrum of confidence values. Relatively subtle reductionsin confidence, due to OOV words for example, may be masked by similarly sized confidence reduc-tions caused by crude pronunciation models. Gross confidence reductions, caused by non-speechsounds, lie beyond the masked range.

uation of confidence measures at the phone-level, however: As the reference phone transcript isobtained from a forced Viterbi alignment of the reference word models to the acoustics, the pres-ence of crude pronunciation models will introduce noise into the marking of the phones; some highconfidence phones may be erroneously marked as incorrect and vice versa.In order to explore this hypothesis further, experiments were conducted using differing acoustic con-ditions and varying lexicon size. The hypothesis developed thus far predicts that (1) utterance veri-fication performance should be better for noisy rather than clean acoustics, due to the gross modelmismatches caused by non-speech sounds; and (2) as fewer words with similar pronunciations areavailable to a recogniser with a reduced vocabulary, model mismatch for incorrect decoding hypo-theses will on average be higher and so utterance verification performance should be improved.

6.6 Differing Acoustic Conditions

The effect of differing acoustic conditions upon confidence measure performance was investigatedusing the BN data. The results shown in figure 6.8 demonstrate the increased profitability of rejectionfor noisy acoustics over clean speech. UERs for planned, studio quality speech (F0 condition) areplotted in the left panel. UERs for noisy acoustics (FX condition) are plotted in the right panel. Asmay be expected, the plot for the F0 condition closely resembles that for clean, read NAB data.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


Figure 6.8: UERs of the confidence measures for word-level decoding hypotheses obtained fromHub-4E-97 for planned, studio quality speech (Left) and for noisy acoustics (Right).

59

6.7 Lexicon Size

The effect of lexicon size was investigated using the NAB data. Figure 6.9 shows that increasedprofitability of rejection is possible for decoding hypotheses made using a 5 k word lexicon (OOVrate = 8.59%, WER = 30.5%), shown in the right panel, than for those made using a 60 k word lexicon(OOV rate = 0.58%, WER = 15.5%), shown in the left panel. An explanation for this is that as thenumber of words in the vocabulary is reduced, fewer models for similar sounding words becomeavailable, on average, during decoding and so any model mismatches become more distinct andhence easier to identify. The plots clearly indicate that increased utterance verification performanceis possible for smaller vocabularies.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


Figure 6.9: UERs of the confidence measures for word-level decoding hypotheses obtained fromHub-3E-95 using a 60 k word lexicon (Left) and a 5 k lexicon (Right).

is again ‘usurped’ by for the 5 k word vocabulary decoding condition, as itwas for BN data. This may be explained by a flattening of the -best word lattices, as illustrated infigure 6.10, under the decoding constraint of a 5 k word lexicon, leading to a reduction in the dynamicrange of the measure and hence to its predictive powers.

0

50

100

150

200

250

300

Decoded Word Hypotheses

0

50

100

150

200

250

300

Decoded Word Hypotheses

Figure 6.10: The lattice flattening effect: is plotted for each word-level hypothesis ob-tained from Hub-3E-95 using a 60 k word lexicon (Left) and a 5 k lexicon (Right).

6.8 OOV Spotting

Given the hypothesis regarding the masked region of confidence and that OOV words are one causeof subtle confidence reductions, it is predicted that OOV word spotting performance will be poorfor both data sets. This prediction is confirmed by the results presented in figure 6.11. The figureplots UERs for a subset of the confidence measures for the task applied to Hub-3E-95 (left panel)

60

and Hub-4E-97 (right panel). The OOV rates for these data sets using 60 and 65 k word vocabulariesrespectively are too low to generate any meaningful statistics and so the decodings used as the sourcefor the graphs in figure 6.11 were created using a 5 k word vocabulary, for both data sets. Evenat these elevated OOV rates, profitable rejection is not possible for the task (cf. the right panel infigure 6.9).

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100

Unc

ondi

tiona

l Erro

r Rat

e


Figure 6.11: UERs for the OOV word spotting task. The results are obtained from decodings ofHub-3E-95 (Left) and Hub-4E-97 (Right) using a 5 k word vocabulary in both cases. (OOV rates are14.5% and 12.6% respectively.)

6.9 Summary

The performance of the confidence measures described in chapter 5 was investigated for the taskof utterance verification. was found to facilitate the lowest UER for the task at the word-level. This may be expected as it draws upon the most sources of information. The large amount ofpost-processing required to compute it ( is by far the most computationally expensive measureinvestigated) unfortunately obscures the cause of the low confidence, however. As a consequence,more informative trends can be observed for the measures which have simpler, more explicit links tothe recognition models. These observations led to the proposal of the following hypotheses:

Crude pronunciation models mask the relatively subtle reductions in confidence caused byOOV words for example. Profitable rejection on clean speech is therefore difficult. Non-speech sounds, on the other hand, give rise to gross model mismatches and cause confidencereductions which fall beyond the masked range. Hence, profitable rejection on noisy acousticsis easier.

A standard trigram language model shows signs of a lack of modelling power for the shift fromthe relatively constrained grammar of read newspaper text to the overall more spontaneousspeech found in broadcast news shows.

These hypotheses were supported by further investigations over varying acoustic conditions, lexiconsizes and into the OOV spotting task.

61

Chapter 7

Confidence-Based PronunciationModelling

7.1 Introduction

Since it is proposed that crude pronunciation models can compromise confidence measure perform-ance (for the utterance verification task) the question raised in this chapter is whether improvedpronunciation models lead to improved confidence measures. This question is not straightforwardto answer, as obtaining improved pronunciation models has proven to be a very difficult task. Thechapter is split broadly into two halves: Sections 7.2, 7.3, 7.4 and 7.5 provide a review of the pro-nunciation learning literature, including motivations, automatic baseform learning, transformationmodelling and the potential for acoustic model retraining. This review material is required to placethe experiments, described in sections 7.6 and 7.7, into context. The philosophy underlying the ex-periments is that the supposed sensitivity of acoustic confidence measures to crude pronunciationmodels can be ‘turned on its head’ and that the same acoustic confidence measures can be employedfor pronunciation model evaluation. Two different sources of alternative pronunciation models aredescribed in sections 7.6 and 7.7.

7.2 Accommodating Variation

A major challenge faced by current ASR technology is the accurate modelling of pronunciationvariation. Factors which can cause the realisation of a word to differ from one instance to anotherinclude co-articulation, dialect, speaking rate and speaking style [62].Whilst one might imagine that the first three of these factors are potentially explicable by some reas-onably simple, systematic rules, the fourth is particularly pertinent as it hints at the complexity of thetask at hand: The frequency of information bearing content words relative to the unladen functionwords which form the ‘glue’ binding an utterance together, is much lower in spontaneous speechthan for read speech. Adda-Decker & Lamel [3], for example, report that the 100 most frequentwords account for 80% of approximately 35 hours of the MASK (multimodal-multimedia automatedservice kiosk) spontaneous speech corpus compared to 50% for approximately 120 hours of theBREF (French) read speech corpus and approximately 21 hours of the WSJ0 (American English)read speech corpus. As function words do not contribute as much to a speaker’s message and arepredictable from grammar conventions, they tend to be more susceptible to co-articulation and vowelreduction, and in general are articulated in a less precise manner. Greenberg reports [76, 77], for ex-ample, that for an approximately 4 hour, hand-transcribed portion of the SWB corpus [78], the wordand was observed with 80 pronunciation variants and that an average of 60 variants were observed

62

for the 10 most frequent words.1 Fosler et al. [61] report that in comparison to citation form2 pronun-ciations, the same hand-transcribed portion of the SWB corpus reveals that 12% of the citation formphones are deleted and that only 67% match those in the hand transcription. Thus in order to predictsome forms of pronunciation variation it is important to be able to identify which words are centralto a speakers message and which are merely peripheral grammatical conveniences—a challengingproblem.The ‘state-of-the-art’ in pronunciation modelling is still rather primitive and the reality is that theprocess of isolating the effects caused by any of the factors of co-articulation, dialect, speaking rateand speaking style is in its infancy. Progress in this area is crucial, however, if a bottleneck in ASRsystem development is to be avoided—in addition to directly compromising the decoding process,crude pronunciation models indirectly affect the quality of the acoustic model, through the productionof training targets via a forced alignment of the reference transcription using a pronunciation lexicon.Crude pronunciation models also affect confidence measure performance as described in chapter 6.The approaches to the task that have been reported in the literature to date may be summarised usingtwo categories: baseform learning and transformation modelling.

7.3 Baseform Learning

The approach to pronunciation modelling adopted for the vast majority of ASR systems is to enumer-ate the pronunciations for all words in the system’s vocabulary using a pronunciation lexicon. Eachentry in the lexicon takes the form of a mapping between the orthography for a word and a string ofsubword units, termed a baseform, which represent the word’s canonical pronunciation.One strategy is to model each word in a system’s vocabulary using a single baseform. A consequenceof this is that all sources of variation must be absorbed by the acoustic model, which in turn has theundesirable effect of increasing the variance of the component class distributions. Given that somewords, such as either, admit multiple canonical pronunciations, [iy dh axr] or [ay dh axr], it has be-come increasingly popular to model some fraction of the vocabulary using multiple baseforms [120].This trend immediately suggests that perhaps all sources of variation could be accommodated throughthe use of multiple baseforms. If this approach were adopted, research questions would be (1) howshould a set of candidate baseforms be derived for a word? and (2) how many baseforms should beassigned to a particular word?Although this approach is simple and straightforward, it suffers from a significant problem: Since allbaseforms for a word are available regardless of the contextual factors listed in section 7.2, a side-effect of increasing their number is an increase in the homophone rate, which in turn heightens theinherent confusability between similar sounding words with obvious detrimental effects upon recog-nition performance. For example, the word to is often realised as [ax], when following, say going,resulting in the realisation of going to as something closer to “gonna”. Given that this realisation ofto is homophonous with the typical realisation of the indefinite article a, the inclusion of to [ax]in the lexicon would create a good deal of acoustic confusion. Although some homophones may beeasily distinguished through linguistic context, such as bare and bear for example, this is not the casefor a and to, which are both equally plausible extensions of the word sequence I want for example.The ‘double-edged sword’ of maximising the coverage of pronunciation variants whilst minimisingthe detrimental effect of confusability was well illustrated in Saraclar’s presentation of the workdescribed in [180]. In this study, -best decodings from the SWB corpus were re-scored using (1)a lexicon augmented on a per-utterance basis with the ‘correct’ pronunciations of the words in thatutterance; and (2) a lexicon augmented with ‘correct’ pronunciations for all words drawn from allutterances making up the test set. The WER was reduced from 46% to 26% in (1), but suffered arelative increase to 38% in (2).Given that there is scope for improvement in a pronunciation lexicon, as was indeed achieved bySaraclar, an ideal baseform learning process has the following steps:

1This list is composed of I, and, the, you, that, a, to, know, of and it.2Citation form refers to the realisation of a word when spoken in isolation, as opposed to its pronunciation when sequenced

with other words.

63

Generation of potential baseforms.

Evaluation of existing and potential baseforms.

Baseform selection.

7.3.1 Potential Baseform Generation

Approaches to deriving potential baseforms can be split into knowledge-based and data-driven cat-egories.An extreme example of the knowledge-based approach is to employ a linguist to compile base-forms by hand, e.g. [184]. A dictionary enumerating common baseforms for a wide vocabulary,derived solely from linguistic observations is described in [166]. A more common semi-automatic,knowledge-based strategy is to expand an existing lexicon of canonical pronunciations through theapplication of a set of context dependent phonological rules [196, 57, 137, 208], such as the flappingrule of American English [196], for example:

[tcl dcl] [t d] [dx] / V [ax ix axr].

This rule states that the stops and closures on the left side of the transformation may be replaced withthe alveolar flap [dx] in the context of a preceding stressed vowel and a following unstressed vowelfrom the set [ax ix axr]. The word better, for example, is typically pronounced as [bcl b eh tcl tax] in British English, but as [bcl b eh dx axr] in American English. Such transformation rules maybe applied to canonical baseforms to derive new, potentially alternative baseforms. As phonologicalrules may be applied repeatedly, however, an empirical question is at what depth of application shoulda cutoff be placed? A rule based approach to obtaining baseforms for words not previously in thelexicon is the application of spelling-to-sound rules, e.g. [10], which map an orthographical charactersequence to a sequence of subword speech sounds.Data-driven strategies to baseform generation follow the pattern of aligning two subword-level tran-scriptions of the data, as illustrated in figure 7.1:

A Canonical Transcription This is derived from a forced alignment of the reference transcriptionto the data, using the canonical baseforms which are present in the existing lexicon.

An Alternative Transcription This is a transcription of the same data which is derived, either manu-ally or via some automatic means, without reference to the existing canonical baseforms.

From the figure, it can be seen that a word can be aligned with a subword sequence which matchesits canonical pronunciation ( ) or some other sequence ( ) constituting a potentially al-ternative baseform. (The figure also highlights the fact that the alignment process must accommodateslight differences between the start and end timings of the phone-level hypotheses contained withinthe two transcriptions.)The manual approach to obtaining the alternative transcription is to employ a linguist to hand tran-scribe the data at the subword-level, e.g. [78, 184]. An example of the automatic approach is to run thedata through a recogniser employing only subword-level decoding constraints, e.g. [184, 195, 188,189, 151, 137]. Although it is intuitive that the manual approach will provide something closer to the‘truth’ than the automatic approach, there are two related issues regarding the utility of the manualapproach: (1) The effort required to hand transcribe large quantities of data at the subword-level;and (2) inter-transcriber agreement. It is estimated in [182] that manual transcription of spontaneousspeech at the subword-level is approximately 800 times slower than real-time. At that rate manu-ally transcribing even a few hours of data becomes impractical for a lone transcriber. To produce amanual transcription in a reasonable length of time therefore, the workload is typically spread overa team. This solution introduces the new problem of maintaining consistency between transcribers,however. An argument made by Schiel et al. [182] is that since large quantities of data are requiredfor accurate pronunciation learning, subword-constraint decodings may be preferred as the source of

64

Canonical Transcription

Alternative Transcription

... ...

... ...

ExistingLexicon

Figure 7.1: A schematic illustration of the alignment of a canonical and an alternative transcriptionof an interval of acoustic data.

the alternative transcription as although the resulting transcriptions may be more errorful, they arecertainly cheap, and hence plentiful, and perhaps more importantly consistent. Riley et al. [162] andSchmid et al. [184] also report that better baseform learning performance, in terms of WER, was ob-tained using an automatically derived, rather than a manually derived, alternative transcription. Thisprompted Riley et al. to conclude that, “while a hand labelled corpus is useful as a bootstrappingdevice, estimates of pronunciation probabilities, context effects, etc., are best derived from largeramounts of automatic transcriptions, preferably done using the same set of acoustic models whichwill eventually be used for recognition.”Another possible method for automatically obtaining an alternative transcription is through the ap-plication of a transformation model, described in section 7.4, to the canonical transcription.

7.3.2 Baseform Evaluation

Existing Baseform Evaluation

Ravishankar & Eskenazi [151] describe the evaluation of existing baseforms through the time-align-ment of a canonical and an alternative transcription. A poor canonical baseform will be signalled inthis case by a consistent mismatch with the alternative transcription. The consistency constraint is animportant one as it combats either inter-transcriber disagreement, for a manually derived alternativetranscription, or the errorful nature of an automatic subword-constraint decoding (typically around30% error rate for phone classification).Two radically different approaches are proposed by Nock & Young [137]. In the first approach theduration of a phone, constituent to a baseform drawn from the canonical transcription is compared tothe mean duration of that phone over the entire canonical transcription. The philosophy underlyingthis comparison is that a phone which is matched to some acoustics with a radically abnormal durationsignals a bad match and so the parent baseform has potential for improvement. The second approachuses the likelihood ratio between the canonical baseform and the aligned portion of the alternativetranscription, averaged over the data set. Baseforms with an average likelihood ratio falling belowsome empirical threshold are considered as candidates for improvement.A fourth approach is proposed by Markey & Ward [127]. In this case the likelihood of each phone ina baseform drawn from the canonical transcription is compared to the mean likelihood of that phoneover the entire canonical transcription. Baseforms containing a phone which has an average likeli-hood significantly less than the phone’s overall mean are marked as candidates for improvement. Aproblem with this approach, however, is that likelihoods are relative to the unconditional probabilityof the acoustics. They are thus not comparable across utterances and so placing a fixed threshold onlikelihoods from different utterances is somewhat ad hoc.

65

Simultaneous Evaluation of Both Existing and Potential Baseforms

A popular method for evaluating both existing and potential baseforms is to augment the existing lex-icon with all potential baseforms and to perform a forced alignment of the reference transcription tothe training set using the expanded lexicon. The baseforms most frequently selected by the alignmentprocess are favoured in this case [212, 196, 162, 208]. A useful feature of this evaluation method isthat a prior probability estimate can be easily computed for a baseform given the relative frequenciesof occurrence of a word’s candidate baseforms in the forced alignment. The assignment of priorprobabilities to baseforms in a lexicon has been found to reduce confusability and hence increaserecognition performance relative to the equal weighting of multiple baseforms for a word [196, 82].A related approach is to compare the average likelihoods of the various candidate baseforms whenaligned against all instances of a word [195]. This is implicitly done on a per-word basis in theprevious approach.

Comments

A criticism that can be levelled at all the above approaches is that the evaluation of a given baseformis relative to the evaluation of another. For example, the appraisal of a baseform through relativefrequency in a forced alignment or through a comparison of likelihoods is entirely dependent uponthe cohort being evaluated. Thus all the above approaches rely upon the quality and coverage ofthe set of potential baseforms. What is really desired is some objective measure of model matchwhich can be used to evaluate both existing and potential baseforms alike, based solely on their ownmerit. This is especially important given corpora, such as the BN corpus, which contain portionsof low fidelity speech and non-speech sounds. In this case baseform match values for differingportions of a corpus should be assigned varying degrees of credence. Steps in this direction arereported by Humphries & Woodland [92] and by Schmid et al. [184]. Humphries & Woodland applya confidence measure based on phone-level -best lattice density, similar to the measure describedby equation 5.8, to identify ambiguous portions of the alternative decoding which were subsequentlydiscarded for the purposes of baseform learning. Schmid et al. exploit the benefits of an ANN basedphoneme probability estimator to derive a confidence measure with links to the per-frame entropymeasure given in equation 5.5. Broad phoneme class labels were favoured in this study for regionsof ambiguous acoustics.

7.3.3 Baseform Selection

A key consideration when representing a word is to select a set of complementary, rather than com-peting baseforms. This point gives rise to the notion that the observed pronunciations for a wordwill fall into clusters, due to dialect, speaking rate or other contextual factors. Given this viewpoint,an ideal way to characterise the space of pronunciations, whilst adhering the baseform modellingapproach, is to assign a unique baseform to each cluster. This perspective was the motivation forstudies reported by Holter [89, 90] and Mokbel & Jouvet [132]. In both these studies, the goalwas to find the set of baseforms for a word , , which maximises the likelihood of the set

of all acoustic realisations of :

(7.1)

where is the space of all possible baseform sets.Holter makes the observation that the most likely baseforms will be similar and so competitive. Anadditional constraint must therefore be added so that also contains a diverse set of baseforms.The number of baseforms per word in a lexicon is typically hand-crafted [120] or determined us-ing heuristic rules which can be strongly influenced by the frequency of occurrence of a word in thetraining data [61, 162, 182] (where an argument could be that a better representation of more frequentwords will reduce WERs). An agreeable property of both the studies described above, however, is

66

that the pronunciation lexicon is explicitly optimised according to a well founded statistical criterion,in line with the other components of most ASR systems. In [89, 90], the total number of baseformsin the lexicon is fixed a priori and an iterative cluster splitting algorithm (based on cluster variance)ensures that additional baseforms are only assigned to words which exhibit the largest pronunciationvariation. The algorithms described by Holter and Mokbel & Jouvet are very computationally ex-pensive, however, and were applied to only medium sized and small vocabularies, respectively. Itremains an interesting challenge, therefore, to apply these or similar algorithms to large lexica andcorpora.

7.3.4 Multiwords

Due to its per-word basis, a criticism of the standard baseform approach to pronunciation modellingis that cross-word pronunciation effects, such as co-articulation, are only implicitly modelled. Oneway to mitigate this weakness is to use multiword baseforms. In this case lexical entries and theirassociated baseforms are created for word sequences, where the hope is that the strong cross-wordpronunciation phenomena present in word sequences such as kind of (“kinda”), going to (“gonna”)and did you (“didja”) will be accommodated. In addition to the standard baseform learning issuesdescribed above, research questions specific to the multiword approach are:

How should word sequences be chosen for multiword modelling?

How many words should a multiword incorporate?

How many multiwords should be added to the lexicon?

How should multiwords be incorporated into the language model?

Investigations into multiword baseform modelling are reported by Nock & Young [137] and by Finke& Waibel [56, 57]. Finke & Waibel describe the choice of word pairs for multiword modellingaccording to the dual criteria of (1) mutual information between the pronunciations of the two words;and (2) the reduction in bigram perplexity, when the word-pair is considered as a new languagemodel token. In addition to several other system improvements, the incorporation of 205 multiwordsinto their lexicon was found to yield useful decreases in WER on the SWB and CallHome corpora.They also found that splitting multiwords into their components for the purposes of language modelprobability assignment, as opposed to using multiword language model tokens, provided marginallybetter performance, although no explanation was offered as to why this should be the case.Nock & Young describe three methods for multiword (again word pair) selection: (1) Word pairfrequency; (2) word pairs for which the duration of the constituent phones differed markedly fromtheir expected values (see section 7.3.2); and (3) word pairs with a low average likelihood ratio (alsosee section 7.3.2). In contrast to Finke & Waibel, Nock & Young report no benefit in WER termsfrom the use of multiwords, despite an increase in the overall likelihood of a forced Viterbi alignmentof the reference transcription using a multiword augmented lexicon.

7.4 Transformational Models

The baseform learning approach suffers from two principal problems due to its inherently per-wordnature:

When taking the data-driven approach, it is inevitable that data sparsity will be encounteredfor infrequent words. Indeed, this is precisely the reason for choosing to construct the acousticmodel around subword speech sound classes in the first place. An unhappy consequence ofthis is that substantial fractions of large lexica will be ineligible for baseform learning. (Itis shown in section 7.6 that only approximately 5 k words out of a 65 k word vocabularyoccur sufficiently frequently, even in approximately 63 hours of training data, to be eligible forbaseform learning.)

67

ExistingLexicon

Canonical Transcription

...

... ...

Leaf Node

Root Node

Question

Answer BAnswer A

...

Surface Realisation

Figure 7.2: A schematic illustration of a decision tree predicting the transformation of the canon-ical subword unit into its surface realisation , in the context of the preceding unit and thefollowing unit (after Riley et al. [162]).

Pronunciation phenomena due to word context are only implicitly modelled.

Whilst this second point is mitigated to some degree through the use of multiword baseforms, theirappeal lies primarily in their ease of use as opposed to any principled attempt to capture cross-wordeffects.The goal of the transformational approach is to build a model which is capable of predicting the actualrealisation of a word, given some contextual factors and its canonical pronunciation. A simple ex-ample of a transformational model is a phonological rule, such as the example given in section 7.3.1.The application of a such a rule, or more typically a set of such rules, is capable of ‘mutating’ thecanonical baseform for a word into something closer to its surface realisation. Phonological rulesovercome the two limitations of baseforms listed above as they may be applied to all members ofa vocabulary and may span word boundaries. One method for deriving the rules is to infer themmanually from observation. Another method, given a large corpus with a canonical and alternativetranscription, is to infer them statistically from data [43]. The data-driven approach is especially at-tractive as probabilities can be readily appended to the rules, based on their frequency of occurrence.In general, however, the exploitation of more sophisticated transformational models is far more ex-citing, as they offer the potential to model the effects of higher level contextual factors, such as thoselisted in section 7.2. Saraclar’s observation that acoustic confusability introduced through the inclu-sion of a large number of baseforms in the lexicon is detrimental to recogniser performance [180],described in section 7.3, provides a strong motivation for the belief that a dynamic pronunciationmodel, able to take changeable contextual factors into account, will perform better than a set of staticpronunciation alternatives.A popular architecture for inferring transformations from data with the potential to capture the effectsof higher level factors upon pronunciation is the statistical decision tree [28], schematically illustratedin figure 7.2.Decision trees can be used to predict the surface realisation of a canonical unit through the use ofa number of input features. A question is asked regarding one of these features at each node of thetree and the potential answers sprout branches to other nodes. For example, a node can represent thevalue of some variable, such as speaking rate, and its associated branches can be used to representobserved values that are either less or greater than the value stored at the parent node. Probabilitiesfor the transformations can be calculated from counts of tokens which collect at each leaf when datais ‘poured’ down a trained tree. A decision tree can thus be used to implement a probabilistic one-to-many mapping from a canonical unit at the root through to its surface realisations at the leaves.

68

Advantages of a binary tree (such as that illustrated in figure 7.2) are (1) a trained tree can be easilyanalysed with regard to which features are important to the overall mapping; and (2) the branchingrestriction limits the computation required during training. A disadvantage of a decision tree, which isa corollary of its ease of interpretation, is that the independent consideration of features, enforced byits sequential branching structure, limits the functional forms that can be represented. This limitationis accentuated in the binary case.Early examples of the use of decision trees to predict surface realisations of elements from a ca-nonical transcription are reported by Chen [35] and Riley [163]. Riley reported that the uncertaintyof predicting the phonetic realisation of a canonical phoneme in the hand-labelled TIMIT databasewas 1.5 bits. By training a decision tree using phoneme context and knowledge of the precedingphone, the uncertainty was reduced to 0.8 bits. Riley et al. more recently describe the training ofsimilar decision trees on the hand-labelled portion of the SWB corpus [162], although the reductionin uncertainty was not so striking in this case (0.72 0.50 bits).A recently developed model of co-articulation that fits into the transformational category is the hiddendynamic model (HDM) [158, 143]. The model, although used for recognition, is rather similarto one which may be used for synthesis. A phone sequence is represented within the model as atrajectory through a ‘hidden’ space which is used to represent the space of articulator configurations.This trajectory is then mapped to the acoustic domain through the use of a non-linearity, such asthat provided by an MLP. When used for recognition, the sequence of acoustic vectors predicted bythe model which most closely matches those observed is used to recover the hypothesised phonesequence. The model of co-articulation provided by the model is encoded through the trajectory inhidden space. The shape of the trajectory is determined by a target vector and a set of ‘pliancies’ or‘time constants’ for each phone. The pliancy for a phone specifies the degree to which the trajectorymay miss the intended phone target due to the competitive influence of neighbouring articulatorconfigurations.As it stands, the HDM provides a useful model of what may be termed ‘fine grain’ co-articulationbetween adjacent phones and the accommodation of larger scale pronunciation effects such as elisionand assimilation are assigned to a model, such as the decision tree models above, which map thephonemic representation of a word onto some phonetic realisation. If separate pliancies were definedover each dimension of the hidden space (read articulator) and were made dependent upon higher-level factors, such as word frequency for example, the model has the potential to accommodate muchlarger scale pronunciation effects. Given these modifications, the model may be able to preserveonly those articulator targets (and the corresponding acoustic realisation) that are essential for dis-ambiguating a word from its surrounding context and may also be able to model the observation thatlow-information, frequently-occurring function words are less precisely articulated than their contentword counterparts.

7.5 Acoustic Model Retraining

A knock-on effect of any approach which yields improved pronunciation models is the ability toderive improved acoustic model training targets in a re-alignment process. An optional additionalstep to a pronunciation learning process is, therefore, acoustic model retraining. A stronger argumentis that acoustic model retraining should be mandatory so as to maintain consistency between theacoustic model classes and their parent pronunciation models.Finke & Waibel describe a flexible transcription alignment procedure [56, 57] which uses improvedpronunciation models together with some other techniques to lessen the detrimental effects of crudetranscriptions. The application of these techniques to the SWB and CallHome corpora were foundto produce useful improvements in WER. Sloboda & Waibel [189] also report improvements afterretraining using a dictionary augmented with baseforms obtained from a data-driven pronunciationlearning process.Although an experimentally derived set of pronunciation models may constitute an improvement,their application may not yield a decrease in WER due to a mismatch between the pronunciation (andindirectly the acoustic) models used for recognition and those used for training. Nock & Young [137]

69

report that a modified pronunciation lexicon that was unable to improve recognition performance withthe existing acoustic class models was also found not to be beneficial after retraining.Humphries & Woodland [93] considered two retraining scenarios: (1) The acoustic model was re-trained on acoustic data from a new accent group using a forced alignment derived from the existingpronunciation model; and (2) the acoustic model was retrained on the new accent data using a forcedalignment derived from a pronunciation lexicon that had been adapted to the new accent. Recognitionresults for data from the new accent group, using the respective pronunciation lexica, were statist-ically indistinguishable. This result starkly illustrates the fact that current acoustic model trainingalgorithms are far more powerful than current pronunciation model learning techniques.

7.6 A First Baseform Learning Attempt

In order to gain a true picture of how well a baseform matches the acoustic realisations of a word, theevaluation must be made independently of any language model factors. Accordingly, if a confidencemeasure based approach to baseform evaluation is adopted, a purely acoustic measure is required. Asthe results given in chapter 6 indicated that was the most reliable acoustic confidence measurefor the task of utterance verification, the measure was selected to act as the baseform evaluationcomponent for a set of baseform learning experiments. An attractive aspect of using for the taskof baseform evaluation is that it is capable of providing an objective evaluation, since it is based uponphone class posterior probability estimates.The experiments described in this section were carried out using two different data sets drawn fromthe BN 1996 and 1997 training sets:

An approximately 7 hour (15 episode) training set and an approximately 1 hour (2 episode)test set were used for initial, exploratory experiments.

An approximately 63 hour (155 episode) training set and an approximately 3 hour (6 episode)test set were defined for further, larger scale experiments.


Potential baseforms were derived from a phone-constraint decoding of the training set. The samephone-level bigram language model described in chapter 6 was used to make the decoding. Thislanguage model was trained using approximately 3 million phone hypotheses obtained from a forcedViterbi alignment of the reference transcription to the 1996 and 1997 BN training sets. The forcedViterbi alignment and the phone-constraint decoding of the training set were used as the canonicaland alternative transcriptions respectively. In order to perform the forced alignment, the standarddecoding lexicon was augmented with those words in the reference transcription which would oth-erwise be OOV. The baseforms for these additional words were hand-crafted. The use of a forcedalignment as the canonical transcription avoids problems associated with decoding errors.The canonical and alternative transcriptions were aligned using a DP procedure [60], in which phoneidentities from the canonical and alternative transcriptions are represented using a vector of binaryfeature values. The (Euclidean) phonetic distance between two phones may then be calculated usingthese vectors and used as the cost for the DP search. The alignment between phone sequences fromthe alternative transcription and words in the canonical transcriptions provided potential baseformsfor all words in the training set. The mapping between words and phone sequences is schematicallyillustrated in figure 7.1. In order to combat the errorful nature of a phone-constraint decoding, onlywords which occurred sufficiently frequently in the training set were considered able to support reli-able model inference and hence eligible for baseform learning. A threshold of 10 or more exampleswas arbitrarily set for eligibility in the experiments. A list of potential baseforms was compiled foreach eligible word in the training set and ordered by frequency of occurrence. For reasons of compu-tational expense, the number of potential baseforms evaluated for each word was limited to the fivemost frequent.

70

7.6.2 Baseform Evaluation

The utility of the measure for baseform evaluation is illustrated in figure 6.6. In this ex-ample, low values of are obtained for the alignment of the [dcl] and [d] components of thebaseform [f ah n dcl d z] to an instance of the word funds which has an acoustic realisation closer tothe phone sequence [f ah n z]. Anecdotal evidence aside, a good pronunciation model should providea consistently high confidence match to instances of the word it models. Accordingly, the baseformevaluation approach was:

1. A given baseform was aligned against all instances of the word in the training set andvalues of , as given by equation 5.9, were calculated for each alignment.

2. An overall confidence estimate for a given baseform was found by averaging the values offor each of its alignments.

7.6.3 Baseform Selection

For any baseform learning experiment, it is important to use a well refined baseline lexicon as theuse of a crude baseline lexicon, such as one created from spelling to sound rules, can give misleadingresults with regard to the efficacy of the baseform learning method. Accordingly, a lexicon derivedfrom the hand-crafted LIMSI lexicon [120] was used as a baseline.A number of decision schemes for accepting or rejecting a potential baseform were investigated:

augment The control scheme was to augment the baseline pronunciation lexicon with the set ofthe most frequently occurring potential baseforms, where was set equal to the number ofbaseforms possessed by the word in the baseline lexicon.

CM-replace1 A list of the existing and potential baseforms was ordered according to their associatedaverage values of . The baseforms for the word in the baseline pronunciation lex-icon were replaced with the -best baseforms drawn from the evaluation list, where was setequal to the number of baseforms possessed by the word in the baseline lexicon. The rationalebehind this scheme was to limit the potential confusability introduced into the pronunciationlexicon by the association of a large number of competing baseforms with each word.

CM-replace2 A similarly ordered list of existing and potential baseforms to that described for CM-replace1 was compiled, with the exception that short, frequently occurring function words wereomitted from the list. The baseform replacement was carried out in exactly the same manneras for CM-replace1. The rationale behind this scheme is that as function words are subjectto increased co-articulation and vowel reduction, it may be supposed that it is harder to learnbaseforms for such words. If the baseforms learnt for function words are unreliable, it may bebetter to merely retain the baseforms possessed by the word in the baseline lexicon.

CM-augment The baseline pronunciation lexicon was augmentedwith all potential baseforms whichobtained an average value of exceeding the lowest value obtained for a member of

’s set of baseline baseforms. This scheme may be interpreted as augmenting the existing setof baseforms for a word with any potential baseforms that are ‘better’ than the least represent-ative existing baseform.

Prior probabilities for all potential baseforms selected were obtained by rescaling their associatedconfidence or frequency of occurrence values into the range [0,1]. After the incorporation of anyalternative baseforms into the lexicon, the prior probabilities for all baseforms possessed by a wordwere rescaled to ensure that they summed to unity.The complete baseform learning process is illustrated in figure 7.3.

71

PronunciationLexicon

Forced Viterbi Alignment

Phone-Constraint Decoding

... ...

... ...

Relevant Confidence

ProcessSelection

ConfConf

New

Ordered by average

PronunciationLexicon

CandidateBaseforms Acoustics Measure Confidence

Figure 7.3: A schematic illustration of the baseform learning process. A forced Viterbi alignment ofthe reference transcription is aligned against a phone-constraint decoding to provide a list of potentialbaseforms for each eligible word. The existing and potential baseforms for a word are evaluated usingthe acoustic confidence measure and the evaluations are input into a selection process forlexicon creation.

7.6.4 Results

The results from the initial set of experiments are given under the ‘7 hour training set’ column oftable 7.1. It can be seen from the table that the best performing baseform selection scheme was CM-augment which achieved an approximately 1% absolute WER improvement over the baseline for anepisode drawn from the training set3 but only an approximately 0.5% absolute WER improvementon the test set. It was speculated that this lack of generalisation was the product of two factors:

Due to a flaw in the experimental design, the training and test sets may have been mismatched.The training set was made up of episodes of NPR’s “Marketplace”, whereas the test set wasmade up of an episode of ABC’s “Nightline”4 and an episode of NPR’s “All Things Con-sidered”.5

Using the 7 hour training set, only 868 words from the system’s vocabulary of 65 k wordswere eligible for baseform learning (i.e. occurred on 10 or more occasions in the training set).The training set and hence the eligible word list may therefore have been too small to cover asufficient number of words seen in the test set to make a significant difference to the test setWER.

The initial results were deemed encouraging enough, however, for a second set of experiments to berun using larger training and test sets. For the creation of these data sets, it was ensured that episodeswere sampled in a representative manner from the different shows comprising the BN 1996 and 1997training sets. The approximately 63 hour training set facilitated baseform learning for approximately5 k out of the 65 k words in the baseline lexicon using a 10 example eligibility threshold. From theappropriate column of table 7.1 it can again be seen that the best performing selection scheme wasCM-augment. As in the previous experiment, an approximately 1% absolute WER improvement foran episode drawn from the training set and an approximately 0.5% absolute WER improvement overbaseline on the test set was observed. These results show that harmful amounts of confusability werenot introduced into the lexicon providing the baseforms possessed by a word were appended with aprior probability.

3NPR Marketplace: Episode 23/05/96.4ABC Nightline: Episode 24/06/96.5NPR All Things Considered: Episode 20d/05/96.

72

WER (%)Pronunciation 7 hr training set 63 hr training setLexicon Training Test Training Testbaseline 18.5 23.5 25.0 34.0augment 18.2 23.6 24.2 34.9CM-replace1 19.0 24.8 24.6 35.2CM-replace2 18.4 24.0 24.3 34.3CM-augment 17.6 23.1 23.8 33.6

Table 7.1: WER results for the various selection schemes over the two training and test sets. The bestperforming selection scheme for all data sets was CM-augment.

Selection Cond. Test Set Episode WER (%) Av.Scheme a960610 c9606523 g960516 k960607baseline all 29.6 32.0 24.3 34.0 30.1

F0 8.0 15.8 15.0 26.7 18.9F1 34.8 - 23.5 38.8 33.3

CM-augment all 29.1 31.2 24.0 33.6 29.6F0 8.1 14.8 14.3 26.0 18.2F1 34.3 - 22.3 39.4 32.9

Table 7.2: Episode-by-episode WER statistics for a selection of episodes drawn from the approx-imately 3 hour test set. The small improvement obtained using CM-augment over the baseline isreasonably consistent over episodes.

The consistency in performance over the different sized training and test sets prompted a closerlook at the episode-by-episode WERs. Table 7.2 shows that the 0.5% absolute WER gain over thebaseline is reasonably constant over a sample of episodes drawn from the approximately 3 hour testset. Although the gains using CM-augment were consistent, they were disappointingly small (noneof the gains are statistically significant at for the particular test set used).

7.7 Decision Tree Smoothing

A second experiment [63] was carried out in conjunction with Eric Fosler-Lussier at ICSI, Berke-ley.6 The 1996 BN training set was used together with two test sets: (1) Hub-4E-97; and (2) anapproximately 30 minute subset of Hub-4E-97 (Hub-4E-97-subset) composed of all utterances witha maximum of 100 words. The smaller test set was used as a development set for the SPRACH entryto the 1998 DARPA BN evaluation as it facilitated fast turnaround experiments. Experience gainedduring the preparation for the evaluation revealed the absolute WER results to be approximately 2%higher for Hub-4E-97-subset than for Hub-4E-97.


Drawing upon insights provided by Riley et al. [162] the potential baseforms in this experimentwere derived from a decision tree smoothed phone-constraint decoding. The decision trees weretrained (one per phone) using a forced Viterbi alignment and a ‘raw’ phone-constraint decoding asthe canonical transcription and surface realisations respectively. The same phone bigram as describedin chapter 6 and section 7.6 was used to make the phone-constraint decoding. The features input tothe tree were the identity, manner and place of articulation and syllabic position (onset, nucleus orcoda) of a phone, and that of its immediate neighbours.

6I am indebted to Eric for providing the alignment software used in this and the previous experiment, for devising thelog-based lexicon pruning scheme, creating the lexica and making the decodings for this experiment.

73

with either of

th(0.6)

dh(0.4) ay(0.3)

iy(0.7)

ax(1.0)axr(1.0)dh(1.0)f(0.1)

v(0.9)

ih(1.0)w(1.0)

Figure 7.4: The concatenation of pronunciation networks for individual words to create a finite stategrammar (FSG) for the training data.

In order to perform a forced Viterbi alignment, a finite state grammar (FSG) was compiled for thetraining data by (1) merging the set of baseforms possessed by a word in the baseline lexicon tocreate a pronunciation network; and (2) concatenating the pronunciation networks for the words inthe reference transcription, as illustrated in figure 7.4. The trained decision trees were then used toaugment the canonical FSG. Each phone labelled link between the th and th nodes of thecanonical FSG was considered in turn and additional links between the nodes were created for thesurface realisations of a canonical phone, as predicted by the decision tree (cf. [35]). The probabilityof the predicted realisation was appended to each additional link and links with probabilities belowthreshold (arbitrarily set to 0.1) were discarded. The augmented FSG was then re-aligned to thetraining data to obtain a smoothed alternative decoding, from which potential baseforms were derivedas before.

7.7.2 Baseform Evaluation and Selection

For this experiment, the existing baseforms in the baseline lexicon were not evaluated. Potentialbaseforms were evaluated according to two criteria:

Frequency Evaluation Potential baseforms were ranked according to their frequency of occurrencein the alternative transcription.

Confidence Evaluation Potential baseforms were ranked according to their average val-ues, as in the previous experiment.

In accordance with the results presented in section 7.6, a lexicon augmentation strategy was adoptedfor baseform selection. Several criteria for deciding how many baseforms to add to the lexicon wereinvestigated. Disregarding the introduction of confusability, a compromise was sought between thenumber of additional baseforms included in the lexicon and the increase to the computational expenseof decoding which their inclusion brings. Setting the number of additional baseforms for a word tothe value of a function of the logarithm of the number of occurrences of the word in the training datawas empirically found to accommodate a useful amount of pronunciation variation without causingunacceptable increases in decoding expense:

(7.2)

where the best value for the factor was found to be 1.2 and the floor function is denoted .Prior probabilities for the selected set of additional baseforms were found by rescaling their evalu-ation function values to the range [0,1]. The baseline lexicon was then augmented with the selectedbaseforms, rescaling the priors according to:

(7.3)

74

Lexicon WER (%)baseline 27.5frequency evaluated 26.9confidence evaluated 26.6

Table 7.3: WER performance of the 3 lexica on Hub-4E-97-subset.

where was set equal to 0.5.

7.7.3 Results

The results given in table 7.3 show that both experimental lexica provide reductions in WER over thebaseline and that the largest reduction is provided by the lexicon created using the confidence evalu-ated baseforms. The derivation of potential baseforms from the smoothed phone-constraint decodingcoupled with the confidence-based evaluation scheme provided an approximately 1% absolute gain inWER. (A reduction in WER from 27.5% to 26.1% is significant at for Hub-4E-97-subset.)This may be contrasted with the derivation of potential baseforms from the raw phone-constraintdecoding used in the previous experiment, which yielded an approximately 0.5% absolute WER gain(albeit on a different test set) when coupled with a confidence-based evaluation and a lexicon aug-mentation scheme. The pronunciation lexicon obtained from this experiment was adopted for theSPRACH entry into the 1998 APRA BN evaluation [63].Given that we now have a set of baseforms which offer a modest reduction in WER over that obtainedwith the baseline lexicon, we must return to the question of whether an improvement in pronunciationmodelling provides a corresponding improvement in confidence measure performance. To investigatethis, decodings of the Hub-4E-97 test set were made7 using the baseline and the frequency evaluatedlexicon and two different acoustic models:

Acoustic model A CI phone probabilities from two 384 state unit RNNs using PLP features trainedforwards and backwards in time were merged at the frame-level using log domain averagingwith those from a 4 k HU MLP using MSG features.

Acoustic model B The RNNs were merged in the same way with an 8 k HU MLP using MSGfeatures.

The results given in table 7.4 show that WER reductions (statistically significant at ) wereobtained using both the improved acoustic model (8 k MLP HUs) and the improved pronunciationmodels. The last column of the table shows the minimum UER (i.e. obtained using the best threshold

) for an utterance verification test. The utterance verification experiments were conducted as de-scribed in chapter 6 using the confidence measure. The utterance verification improvementsare only statistically significant ( ) for the improved pronunciation lexicon case. These res-ults confirm that:

Small improvements in pronunciation modelling are reflected in small improvements in con-fidence measure performance (for the utterance verification task).

With regard to the utterance verification, improvements in the pronunciation lexicon are moreimportant than the improvements in the phone class posterior probability estimates which aresupposed to be the cause of the reduced WER when only the acoustic model is modified (8 kversus 4 k MLP HUs).

7Inter-segment gaps and excluded regions were removed prior to decoding.

75

Lexicon Acoustic Model WER (%) UER (%)baseline A (4 k MLP HUs) 25.4 20.4frequency evaluated A (4 k MLP HUs) 24.7 19.7baseline B (8 k MLP HUs) 24.6 20.0frequency evaluated B (8 k MLP HUs) 24.0 19.3

Table 7.4: WER and UER results for two of the 3 lexica, using two different acoustic models onHub-4E-97.

7.8 Summary

The first automatic baseform learning experiment was found to yield very modest reductionsin WER over those obtained using the baseline lexicon, if a lexicon augmentation strategywas adopted. Baseform replacement strategies were found to only degrade recognition per-formance on a test set. A confidence-based evaluation criterion was found to provide largerWER reductions than one based upon the frequency of occurrence of a potential baseform inthe alternative transcription of the training set. The addition of baseforms to the lexicon wasnot found to introduce detrimental amounts of confusability into the lexicon, so long as thebaseforms possessed by a word had associated prior probabilities.

The use of a decision tree smoothed phone-constraint decoding as the alternative transcriptionimproved the result of the baseform learning procedure. When used in conjunction with thesmoothed phone-constraint decoding, a confidence-based evaluation of potential baseformswas again found to be better than one based upon the frequency of baseform occurrence.

An improved pronunciation lexicon was found to provide the predicted improvement in con-fidence measure performance (for the utterance verification task). Pronunciation modellingimprovements were found to be more important than improvements in phone class probabilityestimates in this regard.

Pronunciation modelling is still very much in its infancy. Current approaches to pronunci-ation modelling need to be vastly improved if the systematic effects upon pronunciation ofdialect, speaking rate, co-articulation and speaking style are to be captured. Due to the inter-dependency of system components (acoustic model training targets are typically derived usingthe pronunciation lexicon, for example), progress in this area is vital to prevent it becoming abottleneck to ASR development in general.

76

Chapter 8

Filtering Audio Streams

8.1 Introduction

This question addressed in this chapter is whether a confidence measure can be used to reliablyfilter the acoustic input to a recogniser with the aim of improving the overall system performance interms of both reduced WER and computational expenditure. The work described was carried out incollaboration with Jon Barker in Sheffield and Dan Ellis at ICSI, Berkeley.1

Practical ASR systems cannot be expected to be supplied with neatly segmented packets of speech,within which aspects such as channel characteristics, speaker and speaking style remain constant, butmust rather accommodate unconstrained streams of audio. A key problem may therefore ostensiblyappear to be the segmentation of a signal into portions of speech and portions of non-speech, andto subsequently filter out those identified as non-speech. This strategy is not appropriate, however,when a speech signal is overlaid upon varying sources of background, such as noise, music or anotherspeech signal. Given this scenario and the fallibility of current ASR technology, a more appropriategoal is to generate segments of acoustics that are classified as either recognisable or unrecognis-able. Segments which fall into the second category will contain not only non-speech audio, such asmusic, but also speech which is insufficiently well matched by the recognition models, due to theprevailing acoustic conditions for example, to yield acceptable decoding accuracy. The classificationof segments as either recognisable or unrecognisable is also useful following a sound source separ-ation process or some other strategy for accommodating speech produced in adverse environments,since it still may not be possible to decode the signals produced by such algorithms with acceptablerecognition accuracy.Previous approaches to similar segmentation and discrimination tasks [190, 183, 186] have beenbased upon building explicit models of differing conditions of interest, such as speech and music orbroadband and telephone quality speech. The approach adopted here (which is in some ways similarto that taken in [130]) appeals to the notion of parsimony and asks whether statistics obtained from therecognition models themselves can be used to classify segments—can a model trained exclusively inone domain be used to reject data from another? Specifically we ask, are there or are there not phonespresent in the signal? An additional benefit of the approach is the potential to predict the WER for asegment and hence to be able constrain a recogniser to operate within some pre-determined error ratebound. Identifying regions of unrecognisable acoustics prior to the costly stage provides the potentialto substantially reduce the computational expense of the recognition process. If the decoding timefor a segment can also be reliably predicted, the latency of the recogniser response may also beconstrained to remain inside some pre-determined delay bound.The remainder of this chapter falls into three sections. First, section 8.2 describes the use of the gen-eral, per-frame entropy based, acoustic confidence measure for directly partitioning an un-constrained stream of acoustics into segments labelled as recognisable and unrecognisable. Second,section 8.3 describes the use of for the classification of previously generated segments,

1I am indebted to Jon for writing the smoothing software, making decodings and plotting graphs in the first batch ofexperiments and to Dan for training acoustic models and making decodings in the second.

77

from a wide variety of data sources. Lastly, section 8.4 describes the addition of a function of thetemporal properties of the speech signal, designed to compliment the frame-limited perspective ofthe confidence measure.

8.2 Segmentation

8.2.1 The Raw Entropy Profile

As is designed to be sensitive to signals which are not well matched by the acoustic modelin general, it seemed appropriate to investigate the use of the per-frame entropy profile over someinterval of acoustics as a means to directly obtain the desired segmentation.After some initial experimentation, however, it was discovered that repercussions of the piecewisestationarity assumption caused the ‘raw’ per-frame entropy profile of the signal to be unsuitable as thebasis for a segmentation: A consequence of modelling the speech signal as a sequence of steady stateswith instantaneous transitions between each is, unsurprisingly, that the transitional periods betweensteady state speech sounds are poorly modelled. An effect of this crude modelling approximation isthe creation of spikes in the per-frame entropy profile at phone transitions, as illustrated in figure 8.1.In one sense, the spikes in the per-frame entropy profile are gratifying since they demonstrate thatthe acoustic model is performing well on the task for which it is trained—modelling the steady stateportions of the speech signal—and as a consequence is providing a poor model match to the ‘out-of-domain’ phenomena of phone transitions. The presence of the spikes can be seen to support thenotion that the outputs of a well trained model can be used to discriminate between within- and out-of-domain data. It is still the case, however, that the fine-structure in the per-frame entropy profileobscures the underlying trends over longer periods of well and poorly matched acoustics.

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

Per-F

ram

e En

tropy

Time (frames)

Figure 8.1: A per-frame entropy profile for an example of the phrase America in black and white.Although this is clean, studio quality speech, spikes in the profile occur at phone transitions.

A closer investigation of the output of specific phone class outputs of the acoustic model revealedthat some phones can provide a strong match to certain non-speech sounds, for example backgroundhiss can be mistaken for the sibilant [s], and that other phone models can be poorly matched even byclean speech. For the particular classifier and phone-set investigated, these weak phones included [ix,dx, uh, axr] and [ax]. The existence of weak phones raises issues regarding subword unit categorydesign and the derivation of training targets. It is possible to imagine that some categories of subwordspeech sounds may be inherently ill-defined and so will not be amenable to the creation of ‘sharp’acoustic models. This notion is given some credence by results given in [147] which show, for agenerative HMM, that some phone class categories can yield low average likelihoods on the trainingset despite having frequencies of occurrence close to the maximum, and vice versa. Given a fixedset of subword categories, it is also conceivable that some subword models may be corrupted byinappropriate training targets derived from systematically crude pronunciation models.

78

8.2.2 Smoothing

By applying a median filter2 in a 50-80 ms window, many of the spikes in the per-frame entropyprofile associated with phone transitions can be removed. Even after this first stage of smoothing,however, the entropy profile over a longer interval can still exhibit a good deal of fine structure. Theupper panel of figure 8.2, for example, shows smoothed per-frame entropy values averaged over a40 frame ( 600 ms) window, for a 10 minute portion of a radio broadcast. The lower panel showsthe result of the application of a further median smoothing stage, this time over an approximately 10second window.After the final stage of smoothing, the underlying trends in the profile are exposed sufficiently forthe segmentation process to proceed. Putative boundaries between segments were located by findingsufficiently sharp changes in the smoothed function profile (absolute values of the difference functionwere compared to an empirically determined threshold). The vertical lines in the lower panel indicatesegment boundaries hypothesised in this way.

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

0 100 200 300 400 500 600 700 800 900 10001

1.5

2

2.5

3

Per-F

ram

e En

tropy

Per-F

ram

e En

tropy

Time (40 frame units)

Time (40 frame units)

Figure 8.2: A per-frame entropy profile after the first (Top) and second (Bottom) stages of smoothing,for a 10 minute portion of BN data.

Although the smoothing process reveals putative segmentation points, its heavy degree compromisestheir temporal precision. To combat this, the position of the hypothesised segment boundaries were‘fine-tuned’ using the symmetric Kullback-Leibler (KL2) distance metric, as described in [186].The Kullback-Leibler distance between two probability distributions and is given by:

(8.1)

Since is not symmetric, is defined as:

(8.2)2A median filter replaces all values in some analysis window with the median of those values. The median is better suited

than the mean to smoothing out the fine-structure of a profile, whilst also preserving any distinct turning points, as (1) it ismore robust to outliers; and (2) it also preserves‘edges’ in the data (which is a reason for its popularity in the field of computervision).

79

If and are Gaussian with means and variances and respectively,this becomes:

(8.3)

Means and variances were calculated for Gaussian approximations to the distribution of the ‘front-end’ PLP features (i.e. prior to any entropy calculations) in two second windows on either side of aputative segmentation point. The segmentation point was then adjusted within a local window of afew seconds so as to maximise the distance between the distributions.

8.2.3 Comments

Although a segmentation of a previously unpartitioned stream of audio can be obtained from itsper-frame entropy profile, a considerable amount of post-processing is required to convert the ‘raw’profile into one which is sufficiently smooth so as to reveal the underlying trends of well and poorlymatched acoustics over useful lengths of time. Despite its somewhat ad hoc nature, it will be revealedin section 8.3.1 that the above procedure provides a segmentation which compares favourably to onegenerated by hand. Due to the extensive amount of smoothing required to convert the raw per-frameentropy profile into a useful form, it may be argued, however, that the per-frame entropy measureis better suited to the classification of segments that have been created by some other means. Theapplication of to classification is explored in the next section.

8.3 Classification

8.3.1 Entropy Based Segmentation

The experiment described in this section was carried out using segments derived from the applicationof the segmentation procedure described in section 8.2 to a half hour episode of ABC Nightline.3The per-frame entropy values were computed using probability estimates obtained from 2 CD RNNs,using PLP features, trained forward and backward in time on the 1996 BN training set. The outputsof the two networks were merged at the frame-level by averaging in the log domain.A single value of was computed for each segment by setting and equal to the start-and end-points of the segment respectively. For classification, the value of was comparedto an empirically derived threshold. The WER was also calculated for each segment to facilitatecomparison against the computed value of . WER was computed from a time-alignmentbetween the decoding and the reference transcription, using NIST’s SCLITE scoring package. Thetime-alignment was considered important since segments containing few words can receive an artifi-cially reduced WER if a purely DP based string alignment is used as a basis for the marking scheme.Figure 8.3 provides plots of WER against for each of the 81 segments returned by thesegmenter. The area of shading around each point is proportional to (a) the number of words in thesegment (left panel); and (b) the time taken to decode the segment (right panel). The plots show that(1) there is a high correlation between the WER and the value of for a segment and; (2)that although the high WER segments contain few words, they account for a large proportion of theoverall decoding time.By setting a threshold on the confidence measure to an appropriate value, it is possible to excludethose segments which are expensive to decode but are nevertheless poorly recognised. In this waydecoding time may be reduced by up to 70% without greatly increasing the overall word error ratefor the episode. This point is illustrated by the left panel in figure 8.4 which shows the overall WER(i.e. rejected words constitute deletions) as a function of computational cost.4 Each point represents

3ABC Nightline: Episode 05/23/96.4Computational cost is calculated as the total number of nodes in the tree-structured lexicon that were activated during the

search, scaled to the range [0,2].

80

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.20

20

40

60

80

100

120

140

160

Entropy

WER

(%)

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.20

20

40

60

80

100

120

140

160

Entropy

WER

(%)

Figure 8.3: WER plotted against average per-frame entropy for 81 segments obtained from approx-imately 30 minutes of BN data. Points are weighted by the number of words (Left) and decodingtime (Right).

a relaxation of the confidence threshold, allowing more segments to be decoded. The flattening ofthe graph clearly indicates the diminishing returns of decoding each successively lower confidencesegment.5

The right panel of figure 8.4 plots the cumulative WER (i.e. WER is computed relative only to thereference transcription of the accepted segments) as the number of segments accepted for decodingincreases. The upper line plots the increasing WER for segments ordered by , whereas thelower line simulates the case where the classifier is a perfect predictor of WER (i.e. segments areordered by their individual WER—the oracle condition). The two operating points marked on theplot are for (1) all the words in the NIST supplied transcription of the episode; and (2) those wordsassigned to the F0 category. Both lines pass very close to these two operating points. If the smoothedper-entropy profile had provided an inappropriate segmentation of the data, mixing poorly and wellmatched acoustics within individual segments, reaching the F0 operating point would not be possible.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 260

65

70

75

80

85

90

95

100

Computional Cost

Ove

rall

WER

(%)

0 10 20 30 40 50 60 70 80 90 10010

20

30

40

50

60

70

% of total words accepted

Cum

ulat

ive

WER

(%)

F0

all eval

Figure 8.4: Left: Overall WER as a function of computational cost. Right: Cumulative WER foraccepted segments. Each point represents a relaxation of the confidence threshold allowing moresegments to be decoded.

5The minimum WER of 63% is calculated relative to all 5186 words that occur in the half hour broadcast, not just the 3500used for the Hub-4 evaluation. Any words remaining outside the NIST/LDC supplied transcription, such as those present incommercials and sports reports, were transcribed by the author.

81

Rejection scheme Segmentation WER (%)Baseline auto 27.1Oracle hand 26.0

based hand 26.5Oracle auto 26.5

based auto 26.8

Table 8.1: Rejection schemes for segmentations drawn from Hub-4E-97. The baseline scheme usesthe automatic segmenter and all segments are decoded. A decrease in WER from 27.1% to 26.5% isstatistically significant at for this data set.

8.3.2 An Alternative Segmentation

Values of the confidence measure were also investigated as the basis for the rejection of seg-ments from Hub-4E-97. For the purposes of the DARPA/NIST evaluations, any decoding hypothesesoutput by the recogniser during intervals labelled as inter-segment gaps are treated as errors. As thetimings of the inter-segment gaps are not supplied for the evaluation test, some method for their auto-matic detection and rejection has the potential to reduce the WER of recogniser. The broader pictureof rejecting low confidence segments given in section 8.3.1 indicates that reductions both in WERand decoding time should be possible.Results given in table 8.1 compare WERs for different segmentations subject to an entropy-basedclassification. The hand segmented condition corresponds to the segmentation provided by theNIST/LDC hand transcription of the data. The automatic segmentation was not generated usingthe procedure described in section 8.2, but was created using the public domain segmenter developedby Matthew Siegler and colleagues at Carnegie Mellon University, described in [186]. The oraclerejection condition corresponds to discarding all inter-segment gaps, as marked in the hand transcrip-tion, and any automatically derived segments whose duration overlaps by more than 50% with anyof the hand marked inter-segment gaps. The table shows (1) that Hub-4E-97 is a relatively ‘clean’dataset, i.e. inter-segment gaps account for a relatively small proportion of the recognition errors,making an improvement of only 1.1% absolute error over the 27.1% WER baseline (obtained usingthe 1997 ABBOT BN evaluation entry) available; but that (2) reasonable fractions of what gain isavailable can be recovered by basing the acceptance or rejection of a segment upon its associatedvalue of .

8.3.3 Noisier Data

In order to more rigorously test the classification of segments according to their values, itwas decided to apply the scheme to a data set containing more examples of non-speech sounds andspeech in the presence of background music than is available from BN data. An appropriate dataset, recorded primarily for the purposes of speech/music distinction experiments, was generouslysupplied by Eric Schierer & Malcolm Slaney [183]. This data set is composed of 246 monophonic,pre-segmented samples of speech and music recorded from variety of San Francisco Bay area radioshows, each 15 s in length. The music samples cover both the vocal and instrumental categoriesand span a broad range of styles, including jazz, pop, country, salsa, raggae and rock. The speechsamples include examples of studio and telephone quality recordings, spoken with varying levelsof background noise and music. Some examples of double-talk are also found amongst the speechsamples. The acoustics were captured using a digital FM tuner, sampled at 22.05 kHz (downsampledto 16 kHz for these experiments) and stored as 16 bit values. The dataset as a whole constitutes justover an hour of particularly diverse audio data.The data was passed through three CI phone classifiers: Two 256 state RNNs, one trained forwardand the other backward in time, using PLP features, on approximately 104 hours of BN data (the1996 and 1997 training sets) and an 8 k HU MLP trained using MSG features on approximately 200hours of BN data (all training sets). These networks formed the basis of the acoustic model for theSPRACH entry to the 1998 DARPA BN evaluations [38]. The posterior probabilities output by these

82

classifiers were combined by averaging in the log domain in the usual way. Cumulative and overallWER results are shown in figure 8.5 for segments ordered according to (a) their oracle WER; and (b)their associated value of .6 The OOV rate for this data set using the 65 k word lexicon usedfor decoding was approximately 3%.

0

20

40

60

80

100

0 50 100 150 200 250

Wor

d Er

ror R

ate

(%)

Number of Segments Decoded

Overall - OracleCumulative - Oracle

Overall - EntropyCumulative - Entropy

Figure 8.5: Cumulative and overall WER as a function of the number of 15 second segments decodedfrom the Schierer/Slaney dataset. The segments are ordered by (1) their average per-frame entropy

; and (2) their oracle WER.

A number of informative trends can be seen in the figure. For the oracle condition, a substantial dropin overall WER can be observed with a point of minima occurring when approximately 130 of the246 segments have been decoded. A steady and substantial rise in overall WER is seen as more ofthe music segments become accepted for decoding, each contributing a large number of insertions. Asimilarly smooth cumulative WER curve, steadily rising up to the point of maxima, is observed for theoracle condition. The overall WER curve for the entropy condition closely tracks that for the oraclecondition. For cumulative WER, the curve for the entropy condition reveals that some segmentswith a low value of received an unexpectedly high WER. Upon closer inspection, thesesegments were found to take the form of interviews between a presenter and a studio guest. Althoughthese segments contained clean speech, the discourse was typically highly disfluent. (Five segmentsin the data set contained exclusively Spanish utterances and were therefore decoded with very highWERs. The prevailing acoustic conditions in these segments—three contained speech in the presenceof background music and the other two were adverts—meant that they also received high values of

, however.) With the exception of the initial ‘blip’ in the cumulative WER curve for theentropy condition, the plot provides further evidence that the value of for a segment is agood predictor of its WER.

8.4 Incorporation of Temporal Properties

Several observations were made whilst analysing the output of the acoustic model given music asinput:

As it was noted in section 8.2.1, some non-speech sounds, such as that from a trumpet, can bewell matched by phone class models. In general, however, non-speech sounds elicit far moreindistinct patterns of activation at the output of the acoustic model than fluent speech.

Non-speech is often characterised by long steady state periods of phone model activation. Thisis in contrast to the relatively rapid and abrupt transitions from phone to phone that are seenfor fluent speech.

6A reference transcription of the data required to compute the WER statistics was created jointly by the author and DanEllis, at ICSI.

83

These points are exemplified by the plots given in figure 8.6. The left panel plots the outputs of a 54CI phone class acoustic model evolving over the duration of a trumpet sound. The right panel plots theoutput of the acoustic model over a similar duration of clean, fluent speech. The posterior probabilityestimates for the CI phone classes are represented by the intensity of shading; black , white

.Ph

one

Clas

s Pos

terio

r Pro

babi

lity

Estim

ates

Time (frames)

Trumpet

20 40 60 80

5

10

15

20

25

30

35

40

45

50

Time (frames)Ph

one

Clas

s Pos

terio

r Pro

babi

lity

Estim

ates

"Smoking deaths are going to triple"

20 40 60 80 100

5

10

15

20

25

30

35

40

45

50

Figure 8.6: The output of the acoustic model evolving over the duration of a trumpet sound (Left)and a similar length interval of clean, fluent speech (Right).

The observations suggest that in addition to a per-frame measure of general acoustic model match,features based on the temporal properties of the signal are also required for a reliable classificationof speech and non-speech. A ‘dynamism’ measure was therefore designed to compliment the av-erage per-frame entropy for a segment with a summary of the segment’s temporal pattern of phoneactivation:

The dynamism of a segment of an acoustic signal is the squared difference between the phoneclass posterior probability vectors for adjacent frames, averaged over a segment:

(8.4)

Both the number and abruptness of phone transitions contribute to a high dynamism value.

The scatter plots given in figure 8.7 and 8.8 show the relationship between dynamism andvalues for segments drawn from the (hand segmented) Hub-4E-97 and Schierer/Slaney data sets.The ‘music’ segments plotted in figures 8.7 are 101 examples of vocal and instrumental music, the‘speech’ segments are 80 examples of speech spoken with and without background noise and the’speech plus music’ segments are 60 examples of speech in the presence of background music. Thevalues plotted in figure 8.7 were computed from the outputs of 2 RNNs using PLP features mergedin the usual fashion. The values plotted in figure 8.8, on the other hand, were computed from theoutputs of an 8 k HU MLP using MSG features.Some interesting trends which can be discerned from these plots are:

Given PLP features, a clear trend exists for higher dynamism values and lower valuesfor clean speech than for non-speech and noisy speech, and vice versa.

Given MSG features, a similar trend for values exists but the predictive power of thedynamism measure is severely impaired. This phenomenon may be explained by the ‘sluggish’nature of the MSG features, which emphasise the slowly varying properties of an acousticsignal.

84

The F0 and inter-segment gap distributions and the speech and music distributions are wellseparated.

Although the plots for all BN focus conditions and either inter-segment gaps or music formmore of a continuum, the distributions for the two data types do not overlap to a large degreeand a useful decision boundary between the two distributions would not be hard to place.

Due to the orientation of the distributions in the -dynamism plane, the best partitionboundary will not be vertical, showing the addition of the dynamism measure to be beneficial.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.5 1 1.5 2 2.5 3

Dyn

amism

Inter-segment gapsFocus conditions

0

0.05

0.1

0.15

0.2

0.25

0.5 1 1.5 2 2.5 3D

ynam

ism

Inter-segment gapsF0 condition

0

0.05

0.1

0.15

0.2

0.25

0.3

0.5 1 1.5 2 2.5 3

Dyn

amism

MusicFocus conditions

0

0.05

0.1

0.15

0.2

0.25

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

Dyn

amism

MusicSpeech plus music

Speech

Figure 8.7: Scatter plots for values of against ‘dynamism’ for various conditions drawnfrom the Hub-4E-97 and Schierer/Slaney data sets. PLP features.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5 1 1.5 2 2.5 3

Dyn

amism

S(n_s,n_e)

Inter-segments gapsFocus conditions

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5 1 1.5 2 2.5 3

Dyn

amism

S(n_s,n_e)

Inter-segment gapsF0 condition

Figure 8.8: Scatter plots for values of against ‘dynamism’ for various conditions drawnfrom the Hub-4E-97 and Schierer/Slaney data sets. MSG features.

85

8.5 Summary

This chapter has focussed upon the segmentation of (relatively) unconstrained acoustics input to arecogniser and the classification of segments as examples of two pragmatic categories—recognisableand unrecognisable acoustics. The second category will not only comprise segments containingnon-speech sounds but also speech which is not sufficiently well matched by the acoustic model toyield an acceptable WER. As the confidence measure quantifies the general quality of theacoustic model match, it seems well suited to the filtering task.One method for applying to the task is to base a segmentation directly upon the per-frameentropy profile of the signal. Due to the piece-wise stationary modelling assumption, however, spikesoccur in the ‘raw’ profile at phone transitions. The fine structure of the profile can be removed toreveal the underlying trends by smoothing, but the heavy degree required compromises the temporalaccuracy of any putative segmentation points. Although the segments obtained after ‘fine-tuning’ theboundaries are of a high quality, i.e. have homogeneous acoustic properties, it may be argued that

is better suited to classifying segments that are obtained via some other means.The value of calculated over the duration of a segment was found to be highly correlatedwith the segment’s WER. As the phone class posterior probability estimates required to compute

are available prior to the computationally expensive decoding stage, rejecting unrecog-nisable segments on the basis of their associated value offers the potential to reduce thecomputational expense of recognition. The rejection of segments on the basis of their associated

value was found to yield substantial savings in computational expense with little increasein overall WER.It was observed that some non-speech sounds can occasionally elicit strong matches to subwordspeech sound classes. Background hiss can be mistaken for the sibilant [s], for example. It maybe conjectured that this phenomenon is a product of the global nature of class boundaries inferredfrom purely speech data during the training of the acoustic model. It was also observed, however,that non-speech sounds tend to yield temporal patterns of acoustic model activation that are distinctfrom those obtained for speech. The dynamism measure is designed to contrast the relatively rapidtransition from phone to phone seen in fluent speech, with the longer periods of steady state phoneclass activation typically seen for non-speech. The per-frame entropy and dynamism measures arecomplimentary and their combination may well be useful for the filtering task.

86

Chapter 9

Discussion

This chapter explores some other applications suggested by the general framework for thinking aboutconfidence measures adopted in this dissertation, together with some issues raised by the results ofthe experiments.

9.1 Recogniser Combination

One of the features of the ABBOT/SPRACH system has been the combination of multiple phone classposterior probability estimators to form the acoustic model. The motivations for this are twofold.Firstly, the combination provides the opportunity to combine different probability estimators, suchas RNNs trained both forward and backward in time and MLPs, for example. Secondly, multipleacoustic representations, such as the use of PLP, MEL+ and MSG, may be exploited. The motivationunderlying these combinations is that, given a diverse membership, a committee may be able benefitfrom the strengths of the different members under differing conditions, mitigating their individualweaknesses and providing a robust acoustic model.If the networks incorporated into the acoustic model all estimate posterior probabilities for the sameset of classes, it is possible to combine the probability estimates at the frame-level. Several strategiesfor combination at this level have previously been investigated within the ABBOT system [86, 87],including simple linear averaging, log domain averaging and mixtures of expert techniques. Theconclusions of these experiments were (1) such combinations were useful in terms of reducing WER;and (2) averaging probabilities in the log domain provided consistently improved performance overthe other combination techniques. The acoustic model used for the SPRACH entry into the 1998DARPA BN evaluations employed the log domain average of CI phone class posterior probabilitiesestimated by two RNNs, trained forward and backward in time using PLP features, and an MLPtrained using MSG features [38].Another strategy is to combine the outputs of multiple recognition streams at the hypothesis-level.This approach is suitable for combining the outputs of several classifiers providing probability es-timates for different sets of classes. A mechanism for performing an hypothesis-level combination isprovided by the ROVER (recognition output voting error reduction) system developed by NIST [58].1The system is schematically illustrated in figure 9.1. From the figure, it can be seen that the system iscomposed of two modules, an alignment module and a voting module. The alignment module growsa single word transition network (WTN) for input recognition streams. The th stream is aligned(via DP techniques) to an existing base WTN, which was in turn created through the alignment ofthe previous recognition streams. (Although this sequential procedure is suboptimal, it doesfacilitate considerable computational savings.) The voting module then evaluates the alternatives ateach branch point in the completed WTN, by means of a vote or confidence weighted vote. Followingthe voting stage, the output of the system is a single, composite recognition stream.

1Available from: http://www.nist.gov/speech/software.htm

87

Recognition Stream A

Recognition Stream B

Recognition Stream C

AlignmentModule

VotingModule

Output Stream

Figure 9.1: A schematic illustration of the two module architecture of the ROVER package.

WER (%)Recogniser Hub-4E-98-1 Hub-4E-98-2Stream-A 23.0 21.5Stream-B 27.1 26.0Stream-C 23.8 21.0ROVER (vote) 22.3 20.4ROVER (confidence) 21.6 19.7

Table 9.1: WERs on Hub-4E-98 three recognition streams used in isolation and in ROVER combin-ation.

Confidence measures may be used to weight the contributions from an ensemble of recognitionstreams to the final decoding hypothesis, where the combination may be made at the frame- orhypothesis-level. Two recent studies investigating the use of confidence measures for the frame-levelcombination of different phone class posterior probability estimators are reported in [110] and [4]. Incommon with a number of other studies, e.g. [86], both investigations find that combining the outputsof phone classifiers, especially those employing different acoustic features, provides reductions inWER over any of the component classifiers used in isolation. Both studies also find that basing theclassifier weighting on confidence estimates computed over intervals longer than a single frame tobe beneficial. In [4], it was found that weighting classifiers according to per-frame entropy valuescomputed from their output probability distributions, averaged over a local analysis window (e.g. 300frames), was equally effective as averaging the phone class probability estimates in the log domain.Small additional reductions in WER were observed when the classifiers were weighted accordingto the difference between the highest and next highest per-frame posterior probabilities—the deltavalue—again averaged over a local analysis window.The purely acoustic confidence measure and the ROVER system were used in conjunctionto merge three recognition streams for the SPRACH entry to the 1998 DARPA BN evaluations [38].Each of these recognition streams shared the same language model, vocabulary and decoder, butdiffered in their acoustic model:

Stream-A employed CI phone class posterior probability estimates derived from two 256 stateRNNs, trained forwards and backwards in time, on PLP features obtained from approximately104 hours of BN data (the 1996 and 1997 training sets), and an 8 k HU MLP trained on MSGfeatures obtained from approximately 200 hours of BN data (all training sets). The outputs ofthe three networks were merged at the frame-level by averaging in the log domain.

Stream-B employed the MLP alone.

Stream-C employed CD phone class probabilities derived from two RNNs trained forwardsand backwards in time. The outputs of the two networks were again merged at the frame-levelby averaging in the log domain.

Table 9.1 provides WER results for each of these streams in isolation and for their ROVER combin-ation, on both halves of the 1998 BN evaluation test data (Hub-4E-98).The results show that combination yields lower WERs than any stream used in isolation and that theuse of a confidence weighted vote provides an improvement over an unweighted vote (statisticallysignificant at ).

88

WER (%)Decoding condition Female Male Overall1: 8 k HU MLP/MSG (142 hrs mixed) 30.9 31.7 31.42: 8 k HU MLP/MSG (43 hrs female) 27.3 52.8 43.33: 8 k HU MLP/MSG (99 hrs male) 43.2 30.9 35.5likelihood, 2 plus 3 - - 29.2likelihood, 1 plus 2 plus 3 - - 28.8oracle, 1 plus 2 plus 3 - - 27.5

Table 9.2: A comparison of WERs obtained using gender-dependent acoustic models and a gender-independent acoustic model, either in isolation or combination (selection performed at the utterance-level).

9.2 Gender Specific Acoustic Models

A number of ‘state-of-the-art’ LVCSR systems obtain small but useful reductions in WER throughthe use of gender-dependent (GD) acoustic modelling, e.g. [71, 203, 211]. One, strategy for makinguse of these specialised acoustic models is to make a gender decision for some segment of acousticsprior to decoding, using two gender models distinct from the recognition models, and to subsequentlydecode the segment using the appropriate GD acoustic model, e.g. [71]. Another strategy is merelyto perform two parallel decodings of a segment using the GD acoustic models and to select thedecoding with the highest likelihood, e.g. [211]. (It is interesting to note that the model with thehighest likelihood needn’t match the gender of the speaker in this case.) The study described belowappeals to the notion of parsimony and asks whether a reliable gender decision can be made usingthe recognition models, but without recourse to a computationally wasteful parallel decoding.Dan Ellis, at ICSI has obtained some preliminary results for making a gender decision prior to thedecoding stage based upon the relative values of the general acoustic confidence measure ,for two GD MLPs estimating CI phone class posterior probability estimates. Table 9.2 provides someresults on Hub-4E-97-subset, using a narrow decoding beam, which illustrate the benefit of decodinga segment of acoustics using the appropriate GD acoustic model.The results given in the first three rows of the table give the performance of three different acousticmodels used in isolation:

1. An 8 k HU MLP trained to estimate CI phone probabilities using MSG features obtained forapproximately 142 hours of mixed (male and female) BN data.

2. A similar MLP trained using approximately 43 hours of female BN data.

3. A similar MLP trained using approximately 99 hours of male BN data.

Unsurprisingly, the GD networks perform better on the appropriate gender-specific data, but worsethan the gender-independent network on the mixed data. The last three rows of the table give theperformance of two selection schemes: (1) The decoding of an utterance is selected on the basis ofits overall (scaled) likelihood (cf. [211]); and (2) the decoding of an utterance is selected on the basisof its known WER—the oracle condition. The results in the table (1) confirm that subdividing theacoustic modelling task by gender is beneficial in WER terms; and (2) show that a lower WER can beobtained if a gender-independent model is incorporated into the selection scheme, in addition to theGD models. This second result indicates that some speakers (or acoustic conditions) at the ‘tail-ends’of the gender model distributions are better matched by a gender-independent model.Although table 9.2 shows that WER reductions are possible, the two selection schemes describedabove are computationally wasteful since multiple decodings of the data must be made. A much moredesirable situation is to be able to select the appropriate acoustic model prior to the decoding stage.Table 9.3 provides a WER comparison between various acoustic model combination and selectionschemes. The first three rows of the table, as before, provide WERs obtained using the GD and

89

WER (%)Decoding condition Female Male Overall1: 4 k HU MLP/PLP (50 hrs mixed) 29.6 33.6 32.12: 2 k HU MLP/PLP (25 hrs female) 28.3 53.1 43.83: 2 k HU MLP/PLP (25 hrs male) 42.1 33.3 36.64: log domain merge, 1 plus 2 27.6 38.5 34.45: log domain merge, 1 plus 3 31.7 32.3 32.1likelihood, 1 plus 2 plus 3 27.5 32.3 30.5likelihood, 1 plus 4 plus 5 27.5 32.4 30.6

, 2 plus 3 27.8 33.3 31.2, 4 plus 5 27.6 32.3 30.6

Table 9.3: A comparison of WERs obtained using gender-dependent acoustic models and a gender-independent acoustic model, either in isolation or combination (either through merging probabilitiesat the frame-level or through acoustic model selection performed at the utterance-level).

gender-independent acoustic models used in isolation. The results in the 4th and 5th rows of the tablecorrespond to two of the three possible permutations for merging the probability estimates output bythe models at the frame-level. The last four rows of the table give the results for permutations of twoschemes for selecting between the recognition streams corresponding to the first five rows. The twoselection criteria are: (1) The decoding of an utterance is selected on the basis of its overall (scaled)likelihood; and (2) an acoustic model is selected on the basis of its associated value of ,averaged over the utterance. (Note a change in features, acoustic model configurations and quantitiesof training data for this table.)Table 9.3 shows that comparable results can be obtained for a selection scheme based uponvalues, available prior to decoding, an for a scheme based upon the selection of utterance decodingsfrom their overall (scaled) likelihoods, available after decoding. This result indicates thatis useful for filtering task that is more refined than the distinction between ‘recognisable’ and ‘unre-cognisable’ portions of acoustics. The result also shows, therefore, that has potential fora number of filtering tasks, such as the distinction between adult and child speech or between ac-cents for example, where the motivations for such distinctions may not be limited to just reductionsin WER. The key advantage for using for such a filtering task is its low computationaloverhead.

9.3 Training Data Selection

The work described in this section was carried out by Tony Robinson, in Cambridge. The idea of atraining data selection procedure can be envisaged in both a supervised and, perhaps more excitingly,an unsupervised recogniser training regime. In supervised mode, confidence measures can be used tofilter out incorrect training targets obtained from the forced alignment of the reference transcriptionto some acoustic training data. As -best decoding statistics are not applicable to this task, andlanguage model mismatch to the reference transcription is highly unlikely (due to relatively weak

-gram constraints), a purely acoustic confidence measure is desirable. Since acceptor HMMs areparticularly well suited to producing purely acoustic confidence measures, a corollary is that acceptorHMMs are well suited to the training data selection task.The histogram given in figure 9.2 was created using frame-level values of , derived from abootstrapped acoustic model, averaged over the duration of (automatically generated) acoustic seg-ments from approximately 50 hours BBC news and current affairs recordings.The segments making up the low-confidence tail of the distribution could be excluded from the train-ing set (by setting an appropriate threshold on the averaged value of ), based on the hypo-thesis that they contain low quality training targets. A clear danger which must be avoided in thiscase, however, is the tendency to only retain portions of training data which are already well matchedby the acoustic model, to the detriment of the final model.

90

−5 −4 −3 −2 −1 00

200

400

600

800

1000

1200

nPPSe

gmen

t Cou

nt

Figure 9.2: A histogram of the values of averaged over segment durations for approximately50 hours of BBC news and current affairs recordings.

Training Regime WER (%)All data (baseline) 37.1Confidence thresholding (experimental) 36.6

Table 9.4: WERs obtained for acoustic models resulting from two different training regimes.

In a preliminary study, reported in [2], two acoustic models were trained for the ABBOT system: Oneusing all 50 hours of the available BBC training data and another trained using 75% of the data, wherethe other 25% was rejected as having insufficient confidence. The WERs given in table 9.4 wereobtained by using each of two acoustic models to decode (in real time) a test set composed two BBCnews broadcasts. Although reduction in WER for the experimental acoustic model over the baselineis not statistically significant, it should be noted that the experimental model was trained using lessdata. The reduction in WER in this study coupled with the encouraging results reported for othertraining data selection schemes, described in section 4.3, are motivations for further investigation inthis area. The and measures could prove to be complimentary for this task sincethe latter could be used to identify low-confidence training targets due to noisy or degraded acoustics.

9.4 Computing Full Model Probabilities

The acceptor HMM formalism described in chapter 2 is typically trained and used for recognition inViterbi mode. Section 2.4.2 described how, by computing the full model probability of the referencetranscription (using the forward-backward algorithm), an acceptor HMM can be trained using softtargets, which better reflect the non-instantaneous changes in the ‘state’ of the vocal apparatus. Bycomputing the full model probability in decoding it may be possible to derive improved posteriorprobability estimates of the model given the data at all levels which will in turn improve the qualityof any confidence measures derived from these estimates.

9.5 Improved Pronunciation Modelling

Discovering the systematic effects of factors such as speaking rate, dialect and co-articulation arevital for accurately predicting the pronunciation of a word. Accurate pronunciation models are im-portant not only for recognition, but also for acoustic model training since subword-level targetsare often derived from a forced alignment of the reference transcription, which relies upon the pro-nunciation lexicon. The quality of the acoustic model is thus indirectly dependent upon the quality

91

of the pronunciation models. This fact highlights the need for a holistic approach to ASR systemimprovement.Although incorporating the factors listed above into a model of pronunciation has so far proven tobe difficult, confidence measures can be used to highlight areas where existing pronunciation modelsmatch the data least well. The application of the confidence measure to the task of baseformevaluation within a data-driven pronunciation learning process was described in chapter 7. Examplesof based evaluations of three phenomena effecting pronunciation are given in figure 9.3.The top two panels of the figure illustrate the effect of stress upon pronunciation. The top left panelshows that following the word going, a stressed version of the word too, is well modelled by thebaseform [tcl t uw]. The top right panel shows that the same baseform does not provide nearly asgood a match to the unstressed homophone to, however. In this case the word pair is realised assomething closer to “gonna” and the initial [tcl] and [t] of [tcl t uw], and to some extent the final[ng] of [gcl g ow ix ng], are assigned appropriately low confidence. The center panels illustratethat with an increased speaking rate, the leading [hh] in have is dropped and [ae] is substituted by[ax]. The baseform [w el ax v] will receive higher confidence than [w el hh ae v] in this case.The bottom panels illustrates the effect of co-articulation. The word found is well modelled bythe baseform [f aw n dcl] when followed by the word in. When found is followed by something,however, the [dcl] is dropped (in anticipation of the articulator configuration required for the [s]) andits hypothesis is hence assigned low confidence. This pattern of articulation illustrates the relativelycommon phonological rule, [n dcl s] [n s].

9.6 Improved Language Modelling

As the recognition tasks tackled by ‘state-of-the-art’ LVCSR systems migrate from read financialnewspaper articles to more challenging domains, such as the decoding of television and radio newsbroadcasts or conversational telephone conversations, the variety of speaking styles and subject mat-ter mushrooms considerably. The relatively poor performance of the grammatical confidence measureand the lattice density based combined confidence measure for utterance verification on BN data, de-scribed in chapter 6 provides an advert for improved language modelling for more challenging tasks.Given this situation, perhaps a model more sophisticated than the standard monolithic -gram willprovide better language modelling and hence confidence measure performance?An attractive alternative to a monolithic -gram, is to condition the probability of a word not onlyupon its local word history, but also upon the long-distance constraints provided by the current topic.An example of the dependence of word combinations upon topic is given by the words Sheffield andWednesday. The pairing of these two words is much more likely when the current topic is soccer asopposed to any other.Two popular approaches to adapting a standard monolithic -gram to the current topic are cache-based and mixture-based -grams. Cache-based models boost the probability of recently observedwords relative to their static probabilities, according to a weighting coefficient :

(9.1)

The goal underlying a mixture-based -gram is to capture the relative differences in word frequenciesseen for different topics by training individual -grams for each topic. A problem with this approach,however, is that only a fraction of the original training text is available for training each topic specific

-gram. This fragmentation of the training data compromises the reliability of the individual -gram estimates. The typical remedy for this is to interpolate a set of topic-specific models with atopic-independent model ( ) which is trained on all the available data:

(9.2)

where .

92

183 193 203 213 223

gcl

tcl

g

t

ng

n

ix

uw

ow

Instance of the Words "Going Too"

Time (frames)Phone Class Posterior Probability Estimates

-0.35

-0.16

-0.49

-0.73

-0.14

-0.55

-1.79

-1.31

12 17 22 27 32

gcl

tcl

g

t

ng

n

ix

ax

ow

Instance of the Words "Going To"

Phone Class Posterior Probability Estimates

Time (frames)

-0.49

-0.55

-0.94

-0.69

-3.13

-2.94

-1.10

-1.24

99 104 109 114 119

v

hh

w

el

l

ax

ae

Instance of the Words "Will Have"


-0.23

-0.35

-0.18

-0.23

-0.08

15 frames

210 215 220 225 230

v

hh

w

el

l

ax

ae

Instance of the Words "Will Have"


-0.09

-0.42

-7.37

-7.73

-1.22

8 frames

528 533 538 543 548 553 558 563

dcl

ng

n

m

f

th

s

aw

ah

ih

Instance of the Words "Found In"


-0.28

-0.65

-0.23

-0.11

-0.14

-0.77

264 274 284 294 304 314

dcl

ng

n

m

f

th

s

aw

ah

ih


Instance of the Words "Found Something"

-0.16

-0.21

-0.62

-0.24

-1.18

-0.02

-0.69

-0.66

-0.25

-5.63

Figure 9.3: based evaluations of three pronunciation phenomena. The effect of stress (Top),speaking rate (Centre) and co-articulation (Bottom).

93

Although both cache-based and mixture-based -grams have been shown to substantially reduceperplexity,2they have unfortunately not been found to reduce WERs [36].A third approach to exploiting topic information whilst also enjoying the robust nature of an -grammodel integrates the two using the maximum entropy (ME) principle [177]. In this case, rather thanbuilding a number of topic-specific -grams a single model is required to satisfy a number of topic-specific constraints whilst also maximising the entropy (i.e. minimising the bias) of all probabilitydistributions in the model. An attractive property of the ME approach is that the constraints needn’t belimited to conditioning -gram probabilities upon topic and may include part-of-speech or syntacticconstraints, for example. Small reductions in perplexity and WER for a topic constrained ME basedlanguage model using substantially fewer parameters than a mixture-based -gram is presented forthe SWB corpus in [107].

2If the words of a language are considered to be emitted from a source, the average degree of uncertainty in the symbolemitted by that source is estimated by , where is assumed to be a sufficiently largesample and is the probability of the word sequence as given by the language model of interest. Since will always be anunderestimate (due to independence assumptions), will be lower bounded by the true value of . The (test set) perplexityof a language model is given by and is roughly defined, e.g. [44], as the average branching factor of a grammar. Anyreduction in perplexity on a given test set therefore represents an improved characterisation of the language ‘source’.

94

Chapter 10

Conclusion

10.1 Summary of Experimental Results

This dissertation has described the derivation of several complimentary confidence measures for anacceptor HMM based LVCSR system, and their application to the tasks of utterance verification,pronunciation model evaluation and the filtering of ‘unrecognisable’ portions of acoustics from theinput to a recogniser.The utterance verification experiments revealed several informative trends:

The combined confidence measure , which draws upon the most sources of informa-tion, provided the best performance on both the NAB and BN corpora at the word-level. Thecalculation of the measure requires a large amount of post-processing, however.was by far the most computationally expensive measure to compute, followed by the othercombined measure . The purely acoustic confidence measures required the leastprocessing, being the least computationally expensive measure.

In broad terms, rejection was more profitable for BN data than for the clean, read speech of theNAB corpus. For the acoustic conditions within the BN corpus, rejection was more profitablefor the noisy FX condition, than for the planned, studio quality speech of the F0 condition.Rejection was also more profitable at the phone- than at the word-level.

Excepting and given a large recognition vocabulary, profitable rejection was onlypossible at the word-level for BN data. As the size of the vocabulary was reduced (from 60 kto 5 k words) profitable rejection also became available for NAB data.

At the word-level, the combined confidence measure was only bettered on NABdata by the measure. On BN data, however, it was usurped by the purely acousticconfidence measure . An explanation for this may be that whilst a trigram languagemodel may provide a relatively good match to the constrained grammar of read newspaper text,the same limited model is less able to capture any regularities in the relatively unconstrainedutterances which form portions of broadcast news shows, where the corresponding reductionin the predictive power of the language model is reflected in the performance of the combinedmeasure. The poor performance of both and at the phone-level maybe explained by the reduced quality of language model (a phone bigram) used to create andrescore the -best lattices at this level.

As the value of the general measure of acoustic model match is independent ofthe actual decoding hypothesis, its poor utterance verification performance at the word-levelis unsurprising. The measure was found to be considerably more informative with regardto the truth or falsity of decoding hypotheses at the phone-level, however, suggesting that asizeable fraction of the decoding errors at this level are caused by noisy (or generally out-of-domain) acoustics. This may be explained by the absence of correlates for OOV words or crudepronunciation models at this level.

95

The overall trends in profitability of rejection for the utterance verification experiments may be ex-plained by the existence of crude pronunciation models which mask the relatively subtle reductions inconfidence that occur in clean speech but not the gross model mismatches that elicited by non-speechsounds.The task of accommodating pronunciation variation in ASR systems is a crucial, but difficult prob-lem. In order to reliably predict the acoustic realisation of a given word, factors such as dialect,co-articulation, speaking style and rate must be teased apart, understood and modelled. Although anacoustic confidence measure was found to be more effective than other methods for evaluating can-didates proposed by a data driven pronunciation model learning process, the WER gains associatedwith the particular procedure investigated were modest. The lack of improvement in pronunciationmodelling is doubly frustrating as there is a clear synergy between pronunciation modelling andconfidence measures.The general measure of acoustic model match was found to be a good predictor of WERfor segments drawn from both the BN corpus and a corpus of speech and music samples. As the valueof for an interval of acoustics is independent from the decoding hypothesis associated withthe interval, segments destined to obtain a high WER could be rejected prior to the computationallyexpensive decoding stage. This facilitated a saving in decoding time with little increase in overallWER for an episode drawn from the BN corpus. The same strategy yielded a marked reductionin overall WER for the corpus of speech and music samples, as valuable computational resourceswere not squandered attempting to decode speech over periods of music. As segments yielding highWERs were found to be particularly expensive to decode, the savings in computational effort wereconsiderable. By setting an appropriate threshold on the value of the measure, it would be possible toconstrain a recogniser to operate within pre-determined WER and response latency bounds.was also found to be useful for the ‘finer grain’ filtering task of identifying the gender of a speakerprior to decoding. As the relative values of for two gender-dependent acoustic modelsprovided a gender decision prior to the decoding stage, the appropriate acoustic model could beused for decoding, providing WER reductions at the cost of a negligible increase computationalexpense. These results suggest that the measure may be useful for a variety of filteringtasks, providing system improvements not limited to to the reduction of WER.Some useful observations were made with regard to the pattern of acoustic model activation forspeech and non-speech sounds. Specifically, it was seen that although non-speech sounds typicallyyield high per-frame entropy values over the distribution of output values for an ANN based phoneclassifier, some non-speech sounds can provide strong phone model matches and hence low per-frame entropies. It is conjectured that this phenomenon may be a product inferring global (in featurespace) acoustic class boundaries whilst training the classifier exclusively on speech data. In additionto these frame-limited observations, it was also seen that whilst fluent speech data yields a pattern ofrelatively rapid and abrupt changes in phone activation, non-speech tends to provide longer periodsof steady-state activation with more gradual changes. On the basis of these temporal observations,a ‘dynamism’ function was designed to assist in the distinction between speech and non-speechsounds. Preliminary scatter plots show dynamism and to be complimentary and so theircombination may well yield an improvement in speech/non-speech discrimination over either of themeasures used individually.

10.2 Novel Aspects

A key contribution of the thesis has been the demonstration that a framework results, within which theutility of confidence measures can be explored throughout the recognition process, from the definitionof a confidence measure as:

A function which quantifies how well a model matches some speech data; where thevalues of the function must be comparable across utterances,

as opposed to:

96

The posterior probability of word correctness, given the values of some set of confidenceindicators.

This rather general definition accrues additional benefits when used in conjunction with the morespecific categories of acoustic, grammatical and combined confidence measures, as they encouragesthe design of confidence measures which have simple and explicit links to the recognition models.In contrast to generative HMMs, acceptor HMMs are well suited to producing acoustic confidencemeasures, as they are trained to directly estimate posterior probabilities for classes of speech sound.Since posterior probabilities are implicitly scaled by the unconditional probability of the acoustics,they are comparable across utterances. The class conditional likelihoods estimated by a generativeHMM are relative to the unconditional probability of the acoustics and hence are not comparableacross utterances. An additional processing step is therefore required to obtain acoustic confidencemeasures from a generative HMM.Some of the described confidence measures ( , , and ) wereborrowed from the literature, extended to the subword-level, and applied to the ABBOT/SPRACHLVCSR system. Others ( , and ) were specifically designed to exploit the phoneclass posterior probability estimates available from the particular acceptor HMM investigated.The main conclusions of the thesis are as follows. First, the derived confidence measures were foundto be useful for the tasks to which they were applied, facilitating overall improvements to the system.Second, in agreement with a number of other studies, it was found that confidence measures whichdraw upon more sources of information may be preferred over those which use less—for the taskof utterance verification. A consequence of creating a confidence measure using many informationsources is that their individual contributions tend to become conglomerated, obscuring the cause ofthe low confidence. Confidence measures with simpler, more explicit links to the recognition modelsare therefore more informative with regard to the timely task of recogniser diagnostics. An exampleof this diagnostic function is the accumulation of evidence to support the notion that crude pronun-ciation models mask the relatively subtle reductions in confidence seen for LVCSR on clean, readspeech, but not the gross model mismatches elicited by non-speech. Acceptor HMMs are capableof producing confidence measures derived from many information sources and are particularly wellsuited to producing those with simple and explicit links to the recognition models.

97

Appendix A

The ICSI/LIMSI Phone SetASCII IPA Example ASCII IPA Example

pcl (p closure) bcl (b closure)tcl (t closure) dcl (d closure)kcl (k closure) gcl (g closure)p p pea b b beet t tea d d dayk k key g g gay

dx dirtych t choke jh d jokef f fish v v voteth thin dh thens s sound z z zoosh shout zh azurem m moon n n noonem m bottom en n buttonng sing

el l bottlel l like r r rightw w wire y j yeshh h hay hv aheader bird axr butteriy i beet ih bitey e bait eh betae æ bat aa fatherao bought ah butow o boat uh bookuw u bootaw about ay biteoy boyax about ix debith# (silence)

98

Bibliography

[1] The Collins English Dictionary, 1979.

[2] D. Abberley, D. Kirby, S. Renals, and T. Robinson. The THISL broadcast news retrieval sys-tem. In Proceedings of the ESCA Tutorial and Research Workshop on Accessing Informationin Spoken Audio, pages 14–19, Cambridge, UK, April 1999.

[3] M. Adda-Decker and L. Lamel. Pronunciation variants across systems, languages and speak-ing styles. In Proceedings of the ESCA Tutorial and Research Workshop on Modeling Pro-nunciation Variation for Automatic Speech Recognition, pages 1–6, Rolduc, Netherlands, May1998.

[4] K. Al-Ghoneim and S. Renals. Confidence-based model combination for connectionist/HMMspeech recognition. Submitted to IEEE Signal Processing Letters, 1999.

[5] T. Anastasakos and S. Balakrishnan. The use of confidence measures in unsupervised ad-aptation of speech recognisers. In Proceedings of the International Conference on SpokenLanguage Processing, pages 2303–2306, 1998.

[6] A. Asadi, R. Schwartz, and J. Makhoul. Automatic detection of new words in a large vocab-ulary continuous speech recognition system. In Proceedings of the International Conferenceon Acoustics, Speech and Signal Processing, pages 125–128. IEEE, 1990.

[7] A. Asadi, R. Schwatrz, and J. Makoul. Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system. In Proceedings of the International Con-ference on Acoustics, Speech and Signal Processing, pages 305–308. IEEE, 1991.

[8] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Maximum mutual informationestimation of hidden Markov model parameters for speech recognition. In Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing, pages 49–52. IEEE,1986.

[9] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Speech recognition with continuousparameter hidden Markov models. Computer Speech and Language, 2:219–234, 1987.

[10] L. R. Bahl, S. Das, P. V. deSouza, M. Epstein, R. L. Mercer, B. Merialdo, D. Nahamoo, M. A.Picheny, and J. Powell. Automatic phonetic baseform determination. In Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing, pages 173–176. IEEE,1991.

[11] L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuousspeech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:179–190, 1983.

[12] L. E. Baum. An inequality and associated maximisation techniques in statistical estimation ofprobabilistic functions of Markov processes. Inequalities, 3:1–8, 1972.

[13] R. Baum and J. A. Eagon. An inequality with applications to statistical estimation for prob-abilistic functions of Markov processes and to a model for ecology. Bulletin of the AmericanMathematical Society, 73:360–363, 1967.

99

[14] J. R. Bellegarda and D. Nahamoo. Tied mixture continuous parameter modeling for speechrecognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(12):2033–2045, 1990.

[15] Y. Bengio and P. Frasconi. Input/output HMMs for sequence processing. IEEE Transactionson Neural Networks, 7(5):1231–1249, 1996.

[16] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, 1985.

[17] G. Bernardis and H. Bourlard. Improving posterior confidence measures in hybrid HMM/ANNspeech recognition systems. In Proceedings of the International Conference on Spoken Lan-guage Processing, pages 775–778, 1998.

[18] C. M. Bishop. Neural Networks for Pattern recognition. Oxford University Press, 1995.

[19] A. W. Black and P. A. Taylor. Automatically clustering similar units for unit selection inspeech synthesis. In Proceedings of EuroSpeech, pages 605–608. ESCA, 1997.

[20] J-M. Boite, H. Bourlard, B. D’hoore, and M. Haesen. A new approach towards keywordspotting. In Proceedings of EuroSpeech, pages 1273–1276. ESCA, 1993.

[21] S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transac-tions on Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979.

[22] H. Bourlard, B. D’hoore, and J-M. Boite. Optimizing recognition and rejection performancein wordspotting systems. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 373–376. IEEE, 1994.

[23] H. Bourlard and N. Morgan. A continuous speech recognition system embedding MLP intoHMM. Advances in Neural Information Processing Systems, 2:413–416, 1990.

[24] H. A. Bourlard, H. Hermansky, and N. Morgan. Towards increasing speech recognition errorrates. Speech Communication, 18(3):205–231, 1996.

[25] H. A. Bourlard, Y. Konig, and N. Morgan. REMAP: Recursive estimation and maximisation ofa posteriori probabilities in connectionist speech recognition. In Proceedings of EuroSpeech.ESCA, 1995.

[26] H. A. Bourlard and N. Morgan. Connectionist Speech Recognition: A Hybrid Approach.Kluwer, 1994.

[27] A. S. Bregman. Auditory Scene Analysis. MIT press, 1990.

[28] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and RegressionTrees. Wadsworth, 1984.

[29] J. Caminero, C. de la Torre, L. Villarrubia, C. Matin, and L. Hernandez. On-line garbage mod-eling with discriminant analysis for utterance verification. In Proceedings of the InternationalConference on Spoken Language Processing, 1996.

[30] J. Caminero, E. Lopez, and L. Hernandez. Two-pass utterance verification algorithm for longnatural numbers recognition. In Proceedings of the International Conference on Spoken Lan-guage Processing, pages 779–782, 1998.

[31] E. I. Chang and R. P. Lippmann. High-performance low-complexity wordspotting using neuralnetworks. IEEE Transactions on Signal Processing, 45(11):2864–2870, 1997.

[32] L. Chase. Blame assignment for errors made by large vocabulary speech recognisers. InProceedings of EuroSpeech, pages 1563–1566. ESCA, 1997.

[33] L. Chase. Word and acoustic confidence annotation for large vocabulary speech recognition.In Proceedings of EuroSpeech, pages 815–818. ESCA, 1997.

100

[34] L. L. Chase. Error-Responsive Feedback Mechanisms for Speech Recognizers. PhD thesis,Carnegie Mellon University, 1997.

[35] F. R. Chen. Identification of contextual factors for pronunciation networks. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 753–756.IEEE, 1990.

[36] P. Clarkson and T. Robinson. The applicability of adaptive language modelling for the Broad-cast News task. In Proceedings of the International Conference on Spoken Language Pro-cessing, pages 1699–1702, 1998.

[37] R. Cole, L. Hirschman, L. Atlas, M. Beckman, A. Biermann, M. Bush, M. Clements, J. Cohen,O. Garcia, B. Hanson, H. Hermansky, S. Levison, K. McKeown, N. Morgan, D. G. Novick,M. Ostendorf, S. Oviatt, P. Price, H. Silverman, J. Spitz, A. Waibel, C. Weinstein, S. Zahorian,and V. Zue. The challenge of spoken language systems: Research directions for the nineties.IEEE Transactions on Speech and Audio Processing, 3(1):1–21, 1995.

[38] G. Cook, J. Christie, D. Ellis, E. Fosler-Lussier, Y. Gotoh, B. Kingsbury, N. Morgan, S. Renals,T. Robinson, and G. Williams. The SPRACH system for the transcription of broadcast news.In Proceedings of the DARPA Broadcast News Workshop, pages 161–165, March 1999.

[39] G. D. Cook, J. D. Christie, P. R. Clarkson, M. M. Hochberg, B. T. Logan, and A. J. Robin-son. Real-time recognition of broadcast radio speech. In Proceedings of the InternationalConference on Acoustics, Speech and Signal Processing, pages 141–144. IEEE, 1996.

[40] M. Cooke. Modelling Auditory Processing and Organisation. PhD thesis, University of Shef-field, 1991.

[41] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.

[42] S. Cox and R. Rose. Confidence measures for the SwitchBoard database. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 511–515.IEEE, 1996.

[43] N. Cremelie and J-P. Martens. In search of pronunciation rules. In Proceedings of the ESCATutorial and Research Workshop on Modeling Pronunciation Variation for Automatic SpeechRecognition, pages 23–27, Rolduc, Netherlands, May 1998.

[44] J. R. Deller, J. G. Proakis, and Hansen J. H. L. Discrete-Time Processing of Speech Signals.Macmillan, 1993.

[45] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data viathe EM algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.

[46] J. G. A. Dolfing and A. Wendemuth. Combination of confidence measures in isolated wordrecognition. In Proceedings of the International Conference on Spoken Language Processing,pages 3237–3240, 1998.

[47] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.

[48] E. Eide, H. Gish, P. Jeanrenaud, and A. Mielke. Understanding and improving speech recog-nition performance through the use of diagnostic tools. In Proceedings of the InternationalConference on Acoustics, Speech and Signal Processing, pages 221–224. IEEE, 1995.

[49] R. El Meliani and D. O’Shaughnessy. Lexical fillers for task-independent-training basedkeyword spotting and detection of new words. In Proceedings of EuroSpeech, pages 2129–2132. ESCA, 1995.

[50] R. El Meliani and D. O’Shaughnessy. New efficient fillers for unlimited word recognitionand keyword spotting. In Proceedings of the International Conference on Spoken LanguageProcessing, 1996.

101

[51] R. El Meliani and D. O’Shaughnessy. Accurate keyword spotting using strictly lexical fillers.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 907–910. IEEE, 1997.

[52] R. El Meliani and D. O’Shaughnessy. Powerful syllabic fillers for general-task keyword-spotting and unlimited-vocabulary continuous-speech recognition. In Proceedings of the In-ternational Conference on Spoken Language Processing, pages 811–814, 1998.

[53] P. Fetter, F. Class, U. Haiber, A. Kaltenmeier, U. Kilian, and P. Regel-Brietzmann. Detectionof unknown words in spontaneous speech. In Proceedings of EuroSpeech, pages 1637–1640.ESCA, 1995.

[54] P. Fetter, F. Dandurana, and P. Regel-Brietzmann. Word graph rescoring using confidencemeasures. In Proceedings of the International Conference on Spoken Language Processing,1996.

[55] P. Fetter, A. Kaltenmeier, T. Kuhn, and Regel-Brietzmann. Improved modeling of OOV wordsin spontaneous speech. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 534–538. IEEE, 1996.

[56] M. Finke and A. Waibel. Flexible transcription alignment. In Proceedings of of the IEEEworkshop on Automatic Speech Recognition and Understanding, pages 34–40, 1997.

[57] M. Finke and A. Waibel. Speaking mode dependent pronunciation modeling in large vocab-ulary conversational speech recognition. In Proceedings of EuroSpeech, pages 2379–2382.ESCA, 1997.

[58] J. Fiscus. A post-processing system to yield reduced word error rates: Recognizer outputvoting error reduction (ROVER). In Proceedings of the IEEE workshop on Automatic SpeechRecognition and Understanding, 1997.

[59] G. D. Forney. The Viterbi algorithm. Proceedings of the IEEE, 61:268–278, 1973.

[60] E. Fosler, M. Weintraub, S. Wegmann, Y-H. Kao, S. Khudanpur, C. Galles, and M. Saraclar.Automatic learning of word pronunciation from data. In JHU/CLSP Workshop PronunciationGroup, 1996.

[61] E. Fosler, M. Weintraub, S. Wegmann, Y-H. Koa, S. Khudanpur, C. Galles, and M. Saraclar.Automatic learning of word pronunciations from data. In Proceedings of the InternationalConference on Spoken Language Processing, 1996.

[62] E. Fosler-Lussier and N. Morgan. Effects of speaking rate and word frequency on conver-sational pronunciations. In Proceedings of the ESCA Tutorial and Research Workshop onModeling Pronunciation Variation for Automatic Speech Recognition, pages 35–40, Rolduc,Netherlands, May 1998.

[63] E. Fosler-Lussier and G. Williams. Not just what, but also when: Guided automatic pronunci-ation modeling for broadcast news. In Proceedings of the DARPA Broadcast News Workshop,pages 171–174, March 1999.

[64] M. A. Franzini, K-F. Lee, and A. Waibel. Connectionist Viterbi training: A new hybrid methodfor continuous speech recognition. In Proceedings of the International Conference on Acous-tics, Speech and Signal Processing, pages 425–428. IEEE, 1990.

[65] J. Fritsch and M. Finke. ACID/HNN: Clustering hierarchies of neural networks for context-dependent connectionist acoustic modelling. In Proceedings of the International Conferenceon Acoustics, Speech and Signal Processing, pages 505–508. IEEE, 1998.

[66] S. Furui. On the role of spectral transition for speech perception. Journal of the AcousticalSociety of America, 80(4):1016–1025, 1986.

102

[67] S. Furui. Speaker independent isolated word recognizer using dynamic features of the speechspectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1):52–59,1986.

[68] R. Gaizauskas. Evaluation in language and speech technology. Computer Speech and Lan-guage, 12:249–262, 1998.

[69] F. Gallwitz, E. Noth, and H. Niermann. A category based approach for recognition of out-of-vocabulary words. In Proceedings of the International Conference on Spoken LanguageProcessing, 1996.

[70] P. H. Garthwaite, I. T. Jolliffe, and B. Jones. Statistical Inference. Prentice-Hall, 1995.

[71] J-L. Gauvain, L. Lamel, G. Adda, and M. Jardino. The LIMSI 1998 Hub-4E transcriptionsystem. In Proceedings of the DARPA Broadcast News Workshop, March 1999.

[72] J-L. Gauvain and C-H. Lee. Maximum a posteriori estimation for multivariate Gaussian mix-ture observations of Markov chains. IEEE Transactions on Speech and Audio Processing,2(2):291–298, 1994.

[73] L. Gillick, Y. Ito, and J. Young. A probabilistic approach to confidence estimation and evalu-ation. In Proceedings of the International Conference on Acoustics, Speech and Signal Pro-cessing, pages 879–883. IEEE, 1997.

[74] B. Gold and N Morgan. Speech and Audio Signal Processing: Processing and Perception ofSpeech and Music. John Wiley, 1999.

[75] Y. Gotoh, S. Renals, and G. Williams. Named entity tagged language models. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 513–516.IEEE, 1999.

[76] S. Greenberg. On the origins of speech intelligibility in the real world. In Proceedings ofthe ESCA Tutorial and Research Workshop on Robust Speech Communication for UnknownCommunication Channels, Pont-a-Mousson, France, 1997.

[77] S. Greenberg. Speaking in shorthand - a syllable-centric perspective for understanding pronun-ciation variation. In Proceedings of the ESCA Tutorial and Research Workshop on ModelingPronunciation Variation for Automatic Speech Recognition, pages 47–56, Rolduc, Nether-lands, May 1998.

[78] S. Greenberg, J. Hollenback, and D. Ellis. Insights into spoken language gleaned from phonetictranscription of the SwitchBoard corpus. In Proceedings of the International Conference onSpoken Language Processing, 1996.

[79] S. K. Gupta and F. K. Soong. Improved utterance rejection using length dependent thresholds.In Proceedings of the International Conference on Spoken Language Processing, pages 795–798, 1998.

[80] R. Haeb-Umbach, E. Beyerlein, and E. Thelen. Automatic transcription of unknown wordsin a speech recognition system. In Proceedings of the International Conference on Acoustics,Speech and Signal Processing, pages 840–843. IEEE, 1995.

[81] D. J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons Ltd.,1997.

[82] H. Heine, G. Evermann, and U. Jost. An HMM-based probabilistic lexicon. In Proceed-ings of the ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation forAutomatic Speech Recognition, pages 57–62, Rolduc, Netherlands, May 1998.

[83] J. Hennebert, C. Ris, H. Bourlard, S. Renals, and N. Morgan. Estimation of global posteriorsand forward-backward training of hybrid HMM/ANN systems. In Proceedings of EuroSpeech,pages 1951–1954. ESCA, 1997.

103

[84] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the AcousticalSociety of America, 87(4):1738–1752, 1990.

[85] L. Hetherington. New words: Effect on recognition performance and incorporation issues. InProceedings of EuroSpeech, pages 1645–1648. ESCA, 1995.

[86] M. M. Hochberg, G. D. Cook, S. J. Renals, and A. J. Robinson. Connectionist model com-bination for large vocabulary speech recognition. In Proceedings of the IEEE workshop onNeural Networks for Signal Processing, 1994.

[87] M. M. Hochberg, S. J. Renals, A. J. Robinson, and Kershaw D. J. Large vocabulary con-tinuous speech recognition using a hybrid connectionist-HMM system. In Proceedings of theInternational Conference on Spoken Language Processing, volume 3, pages 1499–1502, 1994.

[88] M. M. Hochberg, S.J. Renals, A.J. Robinson, and G.D. Cook. Recent improvements to theABBOT large vocabulary CSR system. In Proceedings of the International Conference onAcoustics, Speech and Signal Processing. IEEE, 1995.

[89] T. Holter. Maximum Likelihood Modelling of Pronunciation in Automatic Speech Recognition.PhD thesis, Norwegian University of Science and Technology, 1997.

[90] T. Holter and T. Svendsen. Maximum likelihood modelling of pronunciation variation. In Pro-ceedings of the ESCA Tutorial and Research Workshop on Modeling Pronunciation Variationfor Automatic Speech Recognition, pages 63–66, Rolduc, Netherlands, May 1998.

[91] X. D. Huang and M. A. Jack. Semi-continuous hidden Markov models for speech signals.Computer Speech and Language, 3(3):239–252, 1989.

[92] J. J. Humphries and P. C. Woodland. Using accent-specific pronunciation modelling for im-proved large vocabulary continuous speech recognition. In Proceedings of EuroSpeech, pages2367–2370. ESCA, 1997.

[93] J. J. Humphries and P. C. Woodland. The use of accent-specific pronunciation dictionaries inacoustic model training. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 317–320. IEEE, 1998.

[94] P. Jeanrenaud, K. Ng, J.R. Siu, M. Rohlicek, and H. Gish. Phonetic-based word spotter:Various configurations and application to event spotting. In Proceedings of EuroSpeech, pages1057–1060. ESCA, 1993.

[95] P. Jeanrenaud, M. Siu, J. R. Rohlicek, M. Meteer, and H. Gish. Spotting events in continu-ous speech. In Proceedings of the International Conference on Acoustics, Speech and SignalProcessing, pages 381–384. IEEE, 1994.

[96] F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE,64:532–556, 1976.

[97] F. Jelinek. Up from trigrams! the struggle for improved language models. In Proceedings ofEuroSpeech, pages 1037–1040. ESCA, 1991.

[98] F. Jelinek, L. R. Bahl, and R. L. Mercer. Design of a linguistic statistical decoder for therecognition of continuous speech. IEEE Transactions on Information Theory, 21:250–256,1975.

[99] T. Jitsuhiro, S. Takahashi, and K. Aikawa. Rejection of out-of-vocabulary words using phon-eme confidence likelihood. In Proceedings of the International Conference on Acoustics,Speech and Signal Processing, pages 217–220. IEEE, 1998.

[100] B. H. Juang, S. E. Levison, and M. M. Sondhi. Maximum likelihood estimation for mul-tivariate mixture observations of Markov chains. IEEE Transactions on Information Theory,32(2):307–309, 1986.

104

[101] T. Kawahara, K. Ishizuka, S. Doshita, and C-H. Lee. Speaking-style dependent lexicalizedfiller model for key-phrase detection and verification. In Proceedings of the InternationalConference on Spoken Language Processing, pages 3253–3256, 1998.

[102] T. Kemp and A. Jusek. Modelling unknown words in spontaneous speech. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 530–534.IEEE, 1996.

[103] T. Kemp and T. Schaaf. Estimating confidence using word lattices. In Proceedings ofEuroSpeech, pages 827–830. ESCA, 1997.

[104] T. Kemp and A. Waibel. Unsupervised training of a speech recognizer using TV broadcasts.In Proceedings of the International Conference on Spoken Language Processing, pages 2207–2210, 1998.

[105] D. Kershaw, T. Robinson, and M. Hochberg. Context-dependent classes in a hybrid recurrentnetwork-HMM speech recognition system. In Advances in Neural Information ProcessingSystems, 1995.

[106] D. J. Kershaw. Phonetic Context-Dependency in a Hybrid ANN/HMM Speech RecognitionSystem. PhD thesis, University of Cambridge, 1997.

[107] S. Khudanpur and J. Wu. A maximum entropy language model integrating n-grams and topicdependencies for conversational speech recognition. In Proceedings of the International Con-ference on Acoustics, Speech and Signal Processing, pages 553–556. IEEE, 1999.

[108] B. E. D. Kingsbury. Perceptually-inspired Signal Processing Strategies for Robust SpeechRecognition in Reverberant Environments. PhD thesis, University of California at Berkeley,1998.

[109] B. E. D. Kingsbury, N. Morgan, and S. Greenberg. Robust speech recognition using the mod-ulation spectrogram. Speech Communication, 25(1–3):117–132, 1998.

[110] K. Kirchoff and J. Bilmes. Dynamic classifier combination in hybrid speech recognition sys-tems using utterance-level confidence values. In Proceedings of the International Conferenceon Acoustics, Speech and Signal Processing, pages 693–696. IEEE, 1999.

[111] H. Klemm, F. Class, and U. Kilian. Word- and phrase spotting with syllable-based garbagemodelling. In Proceedings of EuroSpeech, pages 2157–2160. ESCA, 1995.

[112] Y. Konig. REMAP: Recursive Estimation and Maximisation of A Posteriori Probabilities inTransition-based Speech Recognition. PhD thesis, University of California at Berkeley, 1996.

[113] Y. Konig, H. Bourlard, and N. Morgan. REMAP - experiments with speech recognition. InProceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 3351–3355. IEEE, 1996.

[114] Y. Konig, H. A. Bourlard, and N. Morgan. Recursive estimation and maximisation of a pos-teriori probabilities - application to transition-based connectionist speech recognition. In Ad-vances in Neural Information Processing Systems, 1995.

[115] M-W. Koo, C-H. Lee, and B-H. Juang. A new decoder based on a generalized confidence score.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 213–216. IEEE, 1998.

[116] A. Krogh and S. Riis. Hidden neural networks. Neural Computation, submitted 1998.

[117] F. Kubala, J. Bellegarda, J. Cohen, D. Pallet, D. Paul, M. Phillips, R. Rajasekaran, F. Richard-son, M. Riley, R. Rosenfeld, B. Roth, and M. Weintraub. The hub and spoke paradigm for CSRevaluation. In Proceedings of the DARPA Spoken Language Technology Workshop, pages 9–14. Morgan Kaufmann, March 1994.

105

[118] F. Kubula. Design of the 1994 CSR benchmark tests. In Proceedings of the DARPA SpokenLanguage Technology Workshop, January 1995.

[119] P. Ladefoged. A Course in Phonetics. Harcourt Brace Jovanovich, Inc., 1975.

[120] L. Lamel and G. Adda. On designing lexicons for large vocabulary, continuous speech re-cognition. In Proceedings of the International Conference on Spoken Language Processing,1996.

[121] K. F. Lee. Context-dependent phonetic hidden Markov models for speaker independent con-tinuous speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing,38(4):599–609, 1990.

[122] R. P. Lippmann, E. I. Chang, and C. R. Jankowski. Wordspotter training using figure-of-meritback propagation. In Proceedings of the International Conference on Acoustics, Speech andSignal Processing, pages 389–392. IEEE, 1994.

[123] E. Lleida, J.B. Marino, J. Salavedra, A. Bonafonte, E. Monte, and A. Martinez. Out-of-vocabulary word modelling and rejection for keyword spotting. In Proceedings of EuroSpeech,pages 1265–1268. ESCA, 1993.

[124] E. Lleida and R. C. Rose. Efficient decoding and training procedures for utterance verific-ation in continuous speech recognition. In Proceedings of the International Conference onAcoustics, Speech and Signal Processing. IEEE, 1996.

[125] E. Lombard. Le signe de l’elevation de la voix. Ann. Malad. l’Orielle. Larynx. Nez. Pharynx,37:101–119, 1911. Described in [214].

[126] J. Makhoul, S. Roucos, and H. Gish. Vector quantization in speech coding. Proceedings of theIEEE, 73(11):1551–1588, 1985.

[127] K. L. Markey and W. Ward. Lexical tuning based on triphone confidence estimation. InProceedings of EuroSpeech, volume 5, pages 2479–2482. ESCA, 1997.

[128] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Pryzybocki. The DET curve inassessment of detection task performance. In Proceedings of EuroSpeech, pages 1895–1898.ESCA, 1997.

[129] B. Mazor and M-W Feng. Improved a-posteriori processing for keyword spotting. In Proceed-ings of EuroSpeech, pages 2231–2234. ESCA, 1993.

[130] B. L. McKinley and G. H. Whipple. Model based speech pause detection. In Proceedings ofthe International Conference on Acoustics, Speech and Signal Processing, pages 1179–1182.IEEE, 1997.

[131] S. K. Mitra. Digital Signal Processing: A Computer-Based Approach. McGraw-Hill, 1998.

[132] H. Mokbel and D. Jouvet. Derivation of the optimal phonetic transcription set for a wordfrom its acoustic realisations. In Proceedings of the ESCA Tutorial and Research Workshop onModeling Pronunciation Variation for Automatic Speech Recognition, pages 73–78, Rolduc,Netherlands, May 1998.

[133] A. Naadas. A decision-theoretic formulation of a training problem in speech recognition anda comparison of training by unconditional versus conditional maximum likelihood. IEEETransactions on Acoustics, Speech, and Signal Processing, 31(4):814–817, 1983.

[134] C. Neti, S. Roukos, and E. Eide. Word-based confidence measures as a guide for stack searchin speech recognition. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 883–887. IEEE, 1997.

[135] H. Ney, D. Mergel, A. Noll, and A. Paesler. Data-driven search organisation for continuousspeech recognition. IEEE Transactions on Signal Processing, 40:272–281, 1992.

106

[136] N. J. Nilsson. Problem Solving Methods for Artificial Intelligence. McGraw-Hill, New York,1971.

[137] H. J. Nock and S. J. Young. Detecting and correcting poor pronunciations for multiword units.In Proceedings of the ESCA Tutorial and Research Workshop on Modeling PronunciationVariation for Automatic Speech Recognition, pages 85–90, 1998.

[138] Y. Normandin. Maximum mutual information estimation of hidden markov models. In C-H.Lee, F. K. Soong, and K. K. Paliwal, editors, Automatic Speech and Speaker Recognition:Advanced Topics, pages 57–82. Kluwer, 1996.

[139] S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm for large vocabulary continuousspeech recognition. Computer Speech and Language, 11:43–72, 1997.

[140] J. Park and I. W. Sandberg. Approximation and radial-basis-function networks. Neural Com-putation, 5(2):305–316, 1993.

[141] D. B. Paul and J. Baker. The design for the Wall Street Journal-based CSR corpus. In Proceed-ings of the International Conference on Spoken Language Processing, pages 899–902, Banff,1992.

[142] J. Picone. Continuous speech recognition using hidden Markov models. IEEE ASSPMagazine,pages 29–41, 1990.

[143] J. Picone, S. Pike, R. Reagan, T. Kamm, J. Bridle, L. Deng, J. Ma, H. Richards, andM. Schuster. Initial evaluation of hidden dynamic models on conversational speech. In Pro-ceedings of the International Conference on Acoustics, Speech and Signal Processing, pages109–112. IEEE, 1999.

[144] M. Pitz, S. Molau, R. Schluter, and H. Ney. Automatic transcription verification of broadcastnews and similar speech corpora. In Proceedings of the DARPA Broadcast News Workshop,March 1999.

[145] A. B. Poritz. Hidden Markov models: A guided tour. In Proceedings of the InternationalConference on Acoustics, Speech and Signal Processing, pages 7–13. IEEE, 1988.

[146] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recog-nition. Proceedings of the IEEE, 77(2):257–286, 1989.

[147] L. R. Rabiner and B-H. Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993.

[148] L.R. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSPMagazine, 3(1):4–16, 1986.

[149] M.G. Rahim, C-H. Lee, and B-H. Juang. Robust utterance verification for connected digitsrecognition. In Proceedings of the International Conference on Acoustics, Speech and SignalProcessing, pages 285–288. IEEE, 1995.

[150] P. Ramesh, C-H. Lee, and B-H. Juang. Context dependent anti subword modeling for utteranceverification. In Proceedings of the International Conference on Spoken Language Processing,pages 3233–3236, 1998.

[151] M. Ravishankar and M. Eskenazi. Automatic generation of context-dependent pronunciations.In Proceedings of EuroSpeech, volume 5, pages 2471–2474. ESCA, 1997.

[152] S. Renals. Phone deactivation pruning in large vocabulary continuous speech recognition.IEEE Signal Processing Letters, 3:4–6, 1996.

[153] S. Renals and M. Hochberg. Start-synchronous search for large vocabulary continuous speechrecognition. IEEE Transactions on Speech and Audio Processing, to appear July 1999.

[154] S. Renals, N. Morgan, and H. Bourlard. Probability estimation by feed-forward networks incontinuous speech recognition. In Proceedings of the IEEE Workshop on Neural Networks forSignal Processing, pages 309–318, 1991.

107

[155] S. Renals, N. Morgan, M. Cohen, and H. Franco. Connectionist probability estimation in theDECIPHER speech recognition system. In Proceedings of the International Conference onAcoustics, Speech and Signal Processing, pages 601–604. IEEE, 1992.

[156] G. Riccardi, R. Pieraccini, and E. Bocchieri. Stochastic automata for language modelling.Computer Speech and Language, 10:265–293, 1996.

[157] M. D. Richard and R. P. Lippmann. Neural net classifiers estimate posterior probabilities.Neural Computation, 3(4):461–483, 1991.

[158] H. B. Richards and J. Bridle. The HDM: A segmental hidden dynamic model of coarticulation.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 357–360. IEEE, 1999.

[159] S. Riis. Hidden neural networks: Application to speech recognition. In Proceedings of the In-ternational Conference on Acoustics, Speech and Signal Processing, pages 1117–1120. IEEE,1998.

[160] S. Riis and A. Krogh. Hidden neural networks: A framework for HMM/NN hybrids. In Pro-ceedings of the International Conference on Acoustics, Speech and Signal Processing, pages3233–3236. IEEE, 1997.

[161] S. K. Riis. Hidden Markov Models and Neural Networks for Speech Recognition. PhD thesis,Technical University of Denmark, 1998.

[162] M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar,C. Wooters, and G. Zavaliagkos. Stochastic pronunciation modelling from hand-labelled phon-etic corpora. In Proceedings of the ESCA Tutorial and Research Workshop on Modeling Pro-nunciation Variation for Automatic Speech Recognition, pages 109–116, Rolduc, Netherlands,May 1998.

[163] M. D. Riley. A statistical model for generating pronunciation networks. In Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing, pages 737–740. IEEE,1991.

[164] Z. Rivlin. A confidence measure for acoustic likelihood scores. In Proceedings of EuroSpeech,pages 523–526. ESCA, 1995.

[165] Z. Rivlin, M. Cohen, V. Abrash, and T. Chung. A phone-dependent confidence measure forutterance rejection. In Proceedings of the International Conference on Acoustics, Speech andSignal Processing, pages 515–518. IEEE, 1996.

[166] P. Roach and S. Arnfield. Variation information in pronunciation dictionaries. In Proceed-ings of the ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation forAutomatic Speech Recognition, pages 121–124, Rolduc, Netherlands, May 1998.

[167] A. J. Robinson. An application of recurrent nets to phone probability estimation. IEEE Trans-actions on Neural Networks, 5(2):298–305, 1994.

[168] A. J. Robinson, L. Almeida, J. M. Boite, H. Bourlard, F. Fallside, D. Kershaw, P. Kohn, Y. Ko-nig, N. Morgan, J. P. Neto, S. Renals, M. Saerens, and C. Wooters. A neural network based,speaker independent, large vocabulary, continuous speech recognition system: The WER-NICKE project. In Proceedings of EuroSpeech, pages 1941–1944. ESCA, 1993.

[169] A. J. Robinson, M. M Hochberg, and S. J. Renals. The use of recurrent networks in continuousspeech recognition. In C-H. Lee, F. K. Soong, and K. K. Paliwal, editors, Automatic Speechand Speaker Recognition, pages 233–258. Kluwer, 1996.

[170] T. Robinson and J. Christie. Time-first search for large vocabulary speech recognition. In Pro-ceedings of the International Conference on Acoustics, Speech and Signal Processing, pages829–832. IEEE, 1998.

108

[171] J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish. Continuous hidden Markov modelingfor speaker-independent word spotting. In Proceedings of the International Conference onAcoustics, Speech and Signal Processing, pages 627–630. IEEE, 1989.

[172] R. C. Rose. Discriminant wordspotting techniques for rejecting non-vocabulary utterances inunconstrained speech. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 105–108. IEEE, 1992.

[173] R. C. Rose. Keyword detection in conversational speech utterances using hidden Markovmodel based continuous speech recognition. Computer Speech and Language, 9:309–333,1995.

[174] R. C. Rose. Word spotting from continuous speech utterances. In C-H. Lee, F. K. Soong, andK. K. Paliwal, editors, Automatic Speech and Speaker Recognition, pages 303–329. Kluwer,1996.

[175] R. C. Rose and D. B. Paul. A hidden Markov model based keyword recognition system.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 129–132. IEEE, 1990.

[176] R. C. Rose, H. Yao, G. Riccardi, and J. Wright. Integration of utterance verification withstatistical language modelling and spoken language understanding. In Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing, pages 237–240. IEEE,1998.

[177] R. Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Com-puter Speech and Language, 10:187–228, 1996.

[178] A. I. Rudnicky. Hub 4: Business broadcast news. In Proceedings of the DARPA SpeechRecognition Workshop, pages 8–11. Morgan Kaufmann, February 1996.

[179] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by errorpropagation. In D. E. Rumelhart and McClelland, editors, Parallel Distributed Processing:Explorations in the Microstructure of Cognition, volume 1. Bradford Books/MIT Press, 1986.

[180] M. Saraclar. Automatic learning of a model for word pronunciations: Status report. In Pro-ceedings of the DARPA workshop on Conversational Speech Recognition, May 1997.

[181] T. Schaaf and T. Kemp. Confidence measures for spontaneous speech recognition. In Pro-ceedings of the International Conference on Acoustics, Speech and Signal Processing, pages875–878. IEEE, 1997.

[182] F. Schiel, A. Kipp, and H. G. Tillmann. Statistical modelling of pronunciation: It’s not themodel it’s the data. In Proceedings of the ESCA Tutorial and Research Workshop on ModelingPronunciation Variation for Automatic Speech Recognition, pages 131–136, Rolduc, Nether-lands, May 1998.

[183] E. Schierer and M. Slaney. Construction and evaluation of a robust multifeature speech/musicdiscriminator. In Proceedings of the International Conference on Acoustics, Speech and SignalProcessing, pages 1331–1334. IEEE, 1997.

[184] P. Schmid, R. Cole, and M. Fanty. Automatically generated word pronunciations from phon-eme classifier output. In Proceedings of the International Conference on Acoustics, Speechand Signal Processing, pages 223–226. IEEE, 1993.

[185] A. R. Setlur, R. A. Sukkar, and J. Jacob. Correcting recognition errors via discriminativeutterance verification. In Proceedings of the International Conference on Spoken LanguageProcessing, 1996.

[186] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern. Automatic segmentation, classification andclustering of broadcast news. In Proceedings of the DARPA Speech Recognition Workshop,pages 97–99. Morgan Kaufmann, 1997.

109

[187] M-H Siu, H. Gish, and F. Richardson. Improved estimation, evaluation and applications ofconfidence measures for speech recognition. In Proceedings of EuroSpeech, pages 831–834.ESCA, 1997.

[188] T. Sloboda. Dictionary learning: Performance through consistency. In Proceedings of theInternational Conference on Acoustics, Speech and Signal Processing, pages 453–456. IEEE,1995.

[189] T. Sloboda and A. Waibel. Dictionary learning for spontaneous speech recognition. In Pro-ceedings of the International Conference on Spoken Language Processing, 1996.

[190] M. S. Spina and V. W. Zue. Automatic transcription of general audio data: Preliminary ana-lysis. In Proceedings of the International Conference on Spoken Language Processing, 1996.

[191] B. Suhm, M. Woszczyna, and A. Waibel. Detection and transcription of new words. In Pro-ceedings of EuroSpeech, pages 2179–2182. ESCA, 1993.

[192] R. Sukkar. Subword-based minimum verification error (SB-MVE) training for task independ-ent utterance verification. In Proceedings of the InternationalConference on Acoustics, Speechand Signal Processing, pages 229–232. IEEE, 1998.

[193] R. A. Sukkar and C-H. Lee. Vocabulary independent discriminative utterance verification fornonkeyword rejection in subword based speech recognition. IEEE Transactions on Speechand Audio Processing, 4(6):420–429, 1996.

[194] R.A. Sukkar, A.R. Setlur, M.G. Rahim, and C-H. Lee. Utterance verification of keywordstrings using word-based minimum verification error (WB-MVE) training. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 518–521.IEEE, 1996.

[195] T. Svendsen, F. K. Soong, and H. Purnhagen. Optimizing baseforms for HMM-based speechrecognition. In Proceedings of EuroSpeech, pages 783–786. ESCA, 1995.

[196] G. Tajchman, E. Fosler, and D. Jurafsky. Building multiple pronunciation models for novelwords using exploratory computational phonology. In Proceedings of EuroSpeech, pages2247–2250. ESCA, 1995.

[197] R. Valenza, T. Robinson, M. Hickney, and R. Tucker. Summarisation of spoken audio throughinformation extraction. In Proceedings of the ESCA Tutorial and Research Workshop on Ac-cessing Information in Spoken Audio, pages 111–116, 1999.

[198] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young. Lattice-based discriminative trainingfor large vocabulary speech recognition. In Proceedings of the International Conference onAcoustics, Speech and Signal Processing, pages 605–609. IEEE, 1996.

[199] A. P. Varga and R. K. Moore. Hidden Markov model decomposition of speech and noise.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 845–848. IEEE, 1990.

[200] A. P. Varga and R. K. Moore. Simultaneous recognition of concurrent speech signals using hid-den Markov model decomposition. In Proceedings of EuroSpeech, pages 1175–1178. ESCA,1991.

[201] J. G. Vaver. Experiments in confidence scoring using Spanish CallHome data. In Proceedingsof the International Conference on Acoustics, Speech and Signal Processing, pages 209–212.IEEE, 1998.

[202] A. J. Viterbi. Error bounds for conventional codes and an asymptotically optimum decodingalgorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967.

[203] S. Wegmann, P. Zhan, I. Carp, M. Newman, J. Yamron, and L. Gillick. Dragon systems’ 1998broadcast news transcription system. In Proceedings of the DARPABroadcast News Workshop,March 1999.

110

[204] M. Weintraub. Keyword-spotting using SRI’s decipher large-vocabulary speech-recognitionsystem. In Proceedings of the International Conference on Acoustics, Speech and SignalProcessing, pages 463–466. IEEE, 1993.

[205] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke. Neural - network basedmeasures of confidence for word recognition. In Proceedings of the International Conferenceon Acoustics, Speech and Signal Processing, pages 887–890. IEEE, 1997.

[206] P.J. Werbos. Backpropagation through time: What it does and how to do it. Proceedings ofthe IEEE, 78(10):1550–1560, 1990.

[207] F. Wessel, K. Macherey, and R. Schluter. Using word probabilities as confidence measures.In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,pages 225–228. IEEE, 1998.

[208] M. Wester, J. M. Kessens, and H. Strik. Improving the performance of a Dutch CSR by model-ing pronunciation variation. In Proceedings of the ESCA Tutorial and Research Workshop onModeling Pronunciation Variation for Automatic Speech Recognition, pages 145–150, Rolduc,Netherlands, May 1998.

[209] D. Willett, A. Worm, C. Neukirchen, and G. Rigoll. Confidence measures for HMM-basedspeech recognition. In Proceedings of the International Conference on Spoken LanguageProcessing, pages 3241–3244, 1998.

[210] J. G. Wilpon, L. R. Rabiner, Lee C-H., and E. R. Goldman. Automatic recognition of keywordsin unconstrained speech using hidden Markov models. IEEE Transactions on Acoustics,Speech, and Signal Processing, 38(11):1870–1878, 1990.

[211] P. C. Woodland, T. Hain, G. L. Moore, Niesler T. R., D. Pover, A. Tuerk, and E. W. D. Whit-taker. The 1998 HTK broadcast news transcription system: Development and results. InProceedings of the DARPA Broadcast News Workshop, March 1999.

[212] C. Wooters and A. Stolke. Multiple-pronunciation lexical modeling in a speaker independ-ent speech understanding system. In Proceedings of the International Conference on SpokenLanguage Processing, pages 1363–1366, 1994.

[213] S-L. Wu, B.E.D. Kingsbury, N. Morgan, and S. Greenberg. Incorporating information fromsyllable-length time scales into automatic speech recognition. In Proceedings of the Interna-tional Conference on Acoustics, Speech and Signal Processing, pages 721–724. IEEE, 1998.

[214] K. Young, S. Sackin, and P. Howell. The effects of noise on connected speech: a considerationfor automatic speech processing. In M. Cooke, S. Beet, and M. Crawford, editors, VisualRepresentations of Speech Signals, chapter 41, pages 371–378. John Wiley & Sons, 1993.

[215] S. J. Young. The general use of tying in phoneme-based HMM speech recognisers. In Pro-ceedings of the International Conference on Acoustics, Speech and Signal Processing, pages569–572. IEEE, 1992.

[216] S. J. Young, M. Adda-Decker, X. Aubert, C. Dugast, J.-L. Gauvain, D. J. Kershaw, L. Lamel,D. A. Leeuwen, D. Pye, A. J. Robinson, H. J. M. Steeneken, and P. C. Woodland. Mulilinguallarge vocabulary speech recognition: the European SQALE project. Computer Speech andLanguage, 11:73–89, 1997.

[217] S. J. Young and L. L. Chase. Speech recognition and evaluation: a review of the U.S. CSR andLVCSR programmes. Computer Speech and Language, 12:263–279, 1998.

[218] G. Zavaliagkos, M. Siu, T. Colthurst, and J. Billa. Using untranscribed training data to improveperformance. In Proceedings of the InternationalConference on Spoken Language Processing,pages 2551–2554, 1998.

[219] M. H. Zweig and G. Cambell. Receiver-operating characteristic (ROC) plots: A fundamentalevaluation tool in clinical medicine. Clinical Chemistry, 39(4):551–577, 1993.

111

knowing what you don’t know: roles for conﬁdence measures in automatic speech...

Documents