diploma thesis - rwth aachen university...berlin diploma thesis...

ber

lin

Diploma Thesis

Automatic Speech Recognition and Identification ofAfrican Portuguese

submitted byOscar Tobias Anatol [email protected]

Matric. No: 220859

Thesis Supervisor: Prof. Dr.-Ing. R. Orglmeister, TU BerlinThesis Advisor: Dr.-Ing. D. Kolossa, TU BerlinThesis Advisor: Prof. Dr. I. M. Trancoso, INESC-ID Lisboa / IST LisboaThesis Advisor: Dr. A. Abad, INESC-ID Lisboa

Electronics and Medical Signal ProcessingBerlin University of Technology, Department of Energy and Automation Technology

in cooperation with

L2F – Spoken Language Systems LaboratoryINESC ID-Lisboa, Rua Alves Redol 9, 1000-029 Lisboa, Portugal

Lisbon/Berlin, June 2010

[email protected]

Abstract

This thesis deals with accented speech recognition. The performance of a hybrid large vocabularycontinuous speech recognizer, combining multi-layer perceptrons and hidden markov models,degrades heavily in the presence of African Portuguese varieties in broadcast news. Adapted andnewly trained variety-specific acoustic and language models are shown to improve recognitionsignificantly by up to 21.1%.

Further, this thesis proposes a novel and efficient approach to automatically distinguish Africanfrom European Portuguese. The phonotactic variety identification system, based on phonerecognition and language modeling, focuses on a single tokenizer that combines distinctiveknowledge about differences between the target varieties. This knowledge is introduced into a multi-layer perceptron phone recognizer by training variety-dependent phone models for two varietiesas contrasting classes. Significant improvements were achieved, lowering the computational costsignificantly and reducing the equal error rate by more than 60% compared to the state-of-the-artbaseline.

i

Zusammenfassung

Portugiesisch ist weltweit mit rund 178 Millionen Muttersprachlern auf Platz Sieben [Lew09]der meist gesprochenen Erstsprachen. Jedoch nur rund fünf Prozent aller Portugiesischsprecherleben in Portugal und besitzen daher einen europäisch-portugiesischen Akzent. Die automa-tische Untertitelung von Nachrichtensendungen – eines der Hauptaufgabenfelder des SpokenLanguage Systems Laboratory des portugiesischen Forschungszentrums INESC-ID – erfährt einestarke Verschlechterung der Erkennungsrate bei vom europäischen Portugiesisch abweichendenVarietäten. Die Wortfehlerrate des Baseline-Erkenners bleibt bei europäischem Portugiesischunter 20%, Sprecher afrikanisch-portugiesischer Varietäten verursachen jedoch Raten bis über30%.

Im Rahmen dieser Diplomarbeit sollte der vorhandene europäisch-portugiesische Spracherken-ner so angepasst werden, dass er eine verbesserte Erkennung von afrikanischem Portugiesischermöglicht. Der bestehende Baseline-Erkenner ist ein hybrider sprecherunabhängiger automa-tischer Spracherkenner mit großem Wortschatz für kontinuierliche Sprache, der das zeitlicheModellierungsvermögen von Hidden Markov Modellen mit den Klassifizierungscharakteristikavon künstlichen neuronalen Netzen kombiniert.

Die Erkennungsleistung des europäisch-portugiesischen Spracherkenners konnte durch varietä-tenspezifische akustische Modelle und Sprachmodelle um 21.1% verbessert werden. Die Wort-fehlerrate wurde dabei von 30.1% auf 23.7% gesenkt. Etwa neun Zehntel dieser Verbesserungwurden durch eine Adaptierung der europäisch-portugiesischen akustischen Modelle mit knappsiebeneinhalb Stunden manuell transkribierten afrikanisch-portugiesischen Daten erreicht. Dasvarietätenspezifische Sprachmodell, welches durch die Interpolation eines afrikanischen undeuropäischen Sprachmodells erzeugt wurde, bedingte nur circa ein Zehntel der Erkennungsver-besserung. Das afrikanisch-portugiesische Sprachmodell wurde mit einem schriftlichen Korpustrainiert, der im Rahmen dieser Arbeit gesammelt wurde und aus etwa 1.6 Millionen Wörternbesteht.

Eine weitere Errungenschaft dieser Diplomarbeit ist eine neuartige und sehr effiziente automa-tische Methode zur Unterscheidung von afrikanischem und europäischem Portugiesisch. Dasphonotaktische Verfahren zur Identifizierung von Varietäten basiert auf dem Prinzip der Phoner-kennung gefolgt von Sprachmodellen (PRLM). Verwendet wird ein einziger Phonerkenner, der inder Lage ist, varietätenspezifische Monophone zu ermitteln. Die Gleichfehlerrate (EER) konntedabei, gegenüber einem nach aktuellem Stand der Technik arbeitenden Vergleichssystem, umüber 60% gesenkt werden [KAT10].

iii

Erklärung

Hiermit erkläre ich, Oscar Tobias Anatol Koller, dass ich die vorliegende Diplomarbeit ohne fremdeHilfe nur unter Verwendung der angegebenen Hilfsmittel verfasst habe.

Berlin, 15. Juni 2010

v

Acknowledgements

This thesis would not have been possible without the motivation, support and encouragementthat the author received by many people. He would like to thank and acknowledge all that havecontributed to his work.

In first place, special thanks go to Dr. Alberto Abad and Prof. Dr. Isabel Trancoso, who weregenerously available in any moment to collaborate and offer their help. Both supported this workwith extraordinary and continuous engagement and proposed many helpful comments in person,on the telephone and through email.

Further, all colleagues, particularly Hugo Meinedo, Helena Moniz and Céu Viana helped theauthor importantly. Thanks also go to all his office companions, principally Tiago Luís andMiguel Bugalho.

This thesis would not exist without Dr. Thomas Pellegrini, who kindly connected the author toINESC-ID’s Spoken Language Systems Laboratory.

Last but not least, thanks go to Dr.-Ing. Kolossa and Prof. Dr.-Ing. Orglmeister who enabledthis work at TU-Berlin.

This work was funded by FCT project PTDC/PLP/72404/2006.

vii

Contents

1. Introduction 11.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4. State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.1. Variety Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2. Automatic Speech Recognition with Data Constraints . . . . . . . . . . . 7

2. The AUDIMUS automatic Speech Recognition System 92.1. Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2. Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1. Training of the Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . 112.3. Vocabulary and Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1. Evaluating Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 122.4. Pronunciation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5. Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6. Confidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7. Evaluation and Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . 15

3. Corpora 173.1. Acoustic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1. Corpora for Phonetic Classification . . . . . . . . . . . . . . . . . . . . . . 173.1.1.1. African Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.1.2. Brazilian Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.1.3. European Portuguese . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2. Corpora for Variety Identification . . . . . . . . . . . . . . . . . . . . . . . 193.2. Text Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4. African Portuguese Baseline 214.1. Revision of Manual Transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2. Manual Classification of Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3. Baseline Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix

Contents

5. Automatic Identification of African and European Portuguese 255.1. Baseline Variety Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1. Baseline Phonotactic Systems . . . . . . . . . . . . . . . . . . . . . . . . . 265.1.2. Baseline Acoustic System . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1.3. Calibration and Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2. Single Mono-Phonetic PRLM using a Specialized Phonetic Tokenizer . . . . . . . 285.2.1. Determining Mono-Phones . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2.2. Linguistic Interpretation of chosen Mono-Phones . . . . . . . . . . . . . . 295.2.3. Phonotactic Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3. Variety Identification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4. Application to other Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.4.1. Identification of European Portuguese versus Brazilian Portuguese . . . . 325.4.2. Identification of African Portuguese versus Brazilian Portuguese . . . . . 345.4.3. Identification of African Portuguese versus Brazilian Portuguese versus

European Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.5. Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6. Improvement of Acoustic and Language Models 396.1. African Portuguese Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1. Adapted Acoustic Models with EP Initialization . . . . . . . . . . . . . . 396.1.2. Completely Retrained Acoustic Models with Random Initialization . . . . 436.1.3. Unsupervised Training for Acoustic Models with Random Initialization . 45

6.2. African Portuguese Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 516.2.1. Data Collection, Extraction and Normalization . . . . . . . . . . . . . . . 526.2.2. Training of an AP-specific Language Model . . . . . . . . . . . . . . . . . 536.2.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3. Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7. Conclusion and Future Work 57

A. Corpora Sources for Speech Recognition 59

B. Corpora Sources for Variety Identification 67

List of Figures 76

List of Tables 78

Bibliography 79

x

Acronyms

ANN Artificial Neural Network.AP African Portuguese.ASR Automatic Speech Recognition.

BN Broadcast News.BP Brazilian Portuguese.

DET Detection Error Tradeoff.

EER Equal Error Rate.EM Expectation-Maximization.EP European Portuguese.

FA False Alarm.FST Finite State Transducer.

G2P Grapheme-to-Phone.GMM Gaussian Mixture Model.GMM-SVM Gaussian Mixture Model pushed back SVM.GSV Gaussian SuperVector.

HMM Hidden Markov Model.

LID Language IDentification.LLR Linear Logistic Regression.LM Language Model.LONLM Low-Order N-gram Language Model.LRE Language Recognition Evaluation.LVCSR Large Vocabulary Continuous Speech Recognition.

MAP Maximum-a-Posteriori.MAPSSWE MAtched-Pair Sentence-Segment Word Error.ML Maximum Likelihood.

xi

Acronyms

MLP MultiLayer Perceptron.MMI Maximum Mutual Information.MSG Modulation SpectroGram.

NAP Nuisance Attribute Projection.NIST National Institute of Standards and Technology.

OOV Out-Of-Vocabulary.

PALOP Portuguese-speaking African countries.PLP Perceptual Linear Prediction.POS Part-Of-Speech.PPRLM Parallel PRLM.PR Phone Recognizer.PRLM Phone Recognition followed by Language Modeling.

RASTA log-RelAtive SpecTrAl.RASTA-PLP log-RelAtive SpecTrAl PLP.RTP Rádio e Televisão de Portugal.

SAMPA Speech Assessment Methods Phonetic Alphabet.SCTK SCoring ToolKit.SRILM SRI Language Modeling.SVM Support Vector Machine.

TALM Target-Aware Language Modeling.TPA Televisão Pública de Angola.

UBM Universal Background Model.UNLM Universal N-gram Language Model.

VID Variety IDentification.

WER Word Error Rate.WFST Weighted Finite State Transducer.

xii

Symbols

fij(wi) Specific confidence feature for word wi.F Total number of confidence features.φ Grapheme.

G Language model.

H Acoustic model topology.

J Number of words..

κ Cohen Kappa.

L Pronunciation lexicon.λ Model parameter.

M Number of MLPs.µl Left context of a grapheme.µr Right context of a grapheme.

PMSG(ρ) Prior probability for phone ρ with MSG features.PPLP (ρ) Prior probability for phone ρ with PLP features.PRASTA−PLP (ρ) Prior probability for phone ρ with RASTA-PLP fea-

tures.P (~x) General probability density of the feature vector ~x.P (ρ|~x) Combined posterior probability for phone ρ.P (ρ) Combined prior probability for phone ρ.po Proportion of agreement.pc Proportion of agreement expected by chance.P (~x|ρ) Combined scaled likelihood probability.PerG Perplexity of Language Model G..α Level of significance..N Number of speakers.

xiii

Symbols

ρ Phone.

σκ Standard error.

T Word sequences of transcriptions.

wi Recognized word.

~x Acoustic vector.

Z(wi) Normalization factor for a certain word.

xiv

1. Introduction

With around 178 million native speakers, Portuguese is the seventh most spoken language inthe world [Lew09]. But just about five percent of the Portuguese speakers live in Portugal andconsequently speak Portuguese with a European accent. Automatic captioning of BroadcastNews (BN) – one of the core technologies of INESC-ID’s Spoken Language Systems Laboratory(L2F) – faces heavy difficulties in the presence of different accents. The Word Error Rate (WER)of L2F’s baseline European Portuguese (EP) recognizer degrades from under 20% with EP speechto around 30% with African Portuguese (AP) varieties, and above 50% with Brazilian Portuguese(BP). In order to overcome the challenges imposed by the presence of multiple varieties ofPortuguese in BN data, variety dependent recognition systems and efficient variety identificationmodules are needed.

The PostPORT project (Porting Speech Technologies to other varieties of PORTuguese) focuseson these needs. At L2F, scientists have been working for several years on Large VocabularyContinuous Speech Recognition (LVCSR) using hybrid recognizers, combining Artificial NeuralNetworks and Hidden Markov Models (ANN/HMM), the so-called connectionist paradigm. L2F’sfirst LVCSR system was initially developed for EP, and recently ported to BP [ATNV09]. Thisthesis reports on current advances in porting it to the AP variety within the scope of thePostPORT project.

The motivation to consider AP as a broad class instead of training a specific system for everyAfrican variety is related to two reasons. A human benchmark [RTVA08] revealed that identifyingAfrican varieties in BN is much more difficult than identifying accents of everyday’s people onthe street, possibly due to a higher level of education and more contact with EP by a largepart of speakers in BN. Additionally, the available amount of training data for each variety isvery limited. Hence, within this thesis the African varieties of Portuguese are merged into asingle, generalized category encompassing the varieties spoken in the five Portuguese-speakingAfrican countries (PALOP): Angola, Cape Verde, Guinea-Bissau, Mozambique and São Toméand Príncipe.

1.1. Objectives

The goal of this thesis is to improve recognition accuracy of Automatic Speech Recognition(ASR) systems transcribing AP BN. Specialized, variety-specific ASR modules are employed toyield the expected performance gain.

1

1. Introduction

The thesis will be focusing on variety-specific acoustic models. The limited size of the manuallytranscribed corpus may be a major obstacle on this track. Automatically transcribed data consti-tutes a possible solution. Thus, a reliable variety identification system is needed to perform auto-matic selection of data containing speakers with various linguistic profiles.

Further, the impact of a variety-specific language model shall be briefly analyzed, implying theneed for a collection of AP written corpora.

1.2. Motivation

The ASR for AP is a largely unexplored area. However, there is great interest in changing that,both from a scientific and economic point of view, given the rising economy in some of thePALOP countries.

ASR is crucial for many applications in daily life adding communication easiness in situationswhere text information is preferred over voice. ASR is a necessary step towards automatictranslation between different languages, being essential for worldwide exchange. Further, BNtranscriptions enable handicapped people such as deaf to access information from spoken news,providing assistance in classroom situations and facilitating the use of the telephone for hardof hearing people. Moreover, speech recognition may improve communication of people withlanguage disorders. Last but not least, ASR adds a new dimension to information retrieval byarchiving speech as text and making it searchable.

However, big challenges for current ASR systems are regional differences, dialects and accents ofthe speech to be recognized. This thesis addresses this issue with respect to African varieties ofthe Portuguese language.

1.3. Structure

The rest of this introduction focuses on state-of-the-art technologies. The following Section 2describes L2F’s ASR system in depth. Section 3 introduces all corpora used throughout thisthesis and Section 5 is dedicated to presenting an approach to automatically identify varieties.The improvement of recognition results through AP-specific acoustic and language models isshown in Section 6 and a final conclusion is drawn in Section 7.

1.4. State of the Art

This section is dedicated to a brief overview of state-of-the-art techniques, useful in the scope ofthis thesis. Hence, the first subsection presents approaches that may help to differentiate APfrom EP. The second subsection presents common strategies to deal with the shortness of manualtranscriptions when training acoustic models.

2


1.4.1. Variety Identification

Zissman observed that four distinct levels may characterize differences of two languages orvarieties [ZB01]:

• Phonology: phone sets may differ, but often a common subset exists; rules (phonotactics)of joining phones to words may also differ.

• Morphology: word roots and lexica differ.

• Syntax: rules to construct sentences differ.

• Prosody: rhythm and intonation patterns are different.

Automatic Variety IDentification (VID) may exploit any of previously mentioned cues to dis-tinguish two varieties. VID, also known as native accent identification or dialect identification,is a relatively recent research area. The closely related field of foreign accent identification hasreceived more attention from the research community, possibly due to its more apparent needin a continuously globalizing world. Both tasks differ through the fact that native speakersoften proudly mark differences particular for their speech and do not attempt to conform to thestandard variety [TTS96]. On the other hand, non-native speakers’ competences pronouncingsounds that are not part of their native sound inventory may vary significantly, and are usuallyintended to reproduce a native speaker’s intonation.

Another related field with a much longer scientific tradition is automatic Language IDentification(LID). It is very similar to VID, however, two varieties usually use very close phone sets andtheir lexica and grammar are highly related. Nevertheless, in several cases the boundary betweenlanguage and variety is very narrow and the definition of two languages may be rather politicallythan linguistically motivated (for instance Serbian and Croatian). Research in LID can beobserved for more than two decades now and significant advances have been achieved. Since1996, the National Institute of Standards and Technology (NIST)1 periodically organizes theLanguage Recognition Evaluation (LRE), a formal and comparable evaluation of LID systems,which contributed significantly to the continuous progress in LID. The progress of performanceduring the past LREs, being quite designative for the whole area, is displayed in Figure 1.1.

Since LRE 2005 the evaluation also includes tests on neighboring language pairs, which in mostof the cases can be considered varieties. Last year’s LRE 2009 included eight pairs of languagesto be distinguished in separate tests. Five of them, namely Russian-Ukrainian, Hindi-Urdu,Dari-Farsi, Bosnian-Croatian and American English-Indian English, are generally consideredto be mutually intelligible, and therefore produce similar testing conditions as the Portuguesevarieties worked on in this thesis. The kind of data used during LRE 2009 to train and test thesystems for the five language pairs differs partly from data used in this thesis (compare section 3for details on corpora). NIST provided conversational telephone speech for American English

1see: http://www.nist.gov

Oscar Tobias Anatol Koller 3

1. Introduction

0

5

10

15

20

25

1996 2003 2005 2007 2009

Cav

g [%

]

LRE Performance 1996-2009

3 seconds10 seconds30 seconds

Figure 1.1.: Progress in LID during NIST LRE since 1996 [NIS09] given the averagedetection cost.

and Indian English. For Ukrainian, Dari, Bosnian, and Croatian telephone speech extracted fromrecorded radio and broadcast news data was given. The other languages could be trained witha mix of both sources. Figure 1.2 shows the best results (averaged over 3, 10 and 30 seconds)achieved on each language-pair 2.

All Systems that score best on the five presented language-pairs in NIST LRE 2009 madeuse of techniques that exploit both acoustic and phonotactic information. Prosodic and otherapproaches are not represented among the leaders of variety differenciation. The reason forsystems’ improving performance can be found in innovations on the whole scale of employedtechniques including feature and model compensation, adopted models and classifiers, as well asfinal calibration and fusion. The following subsections introduce the technologies employed byNIST LRE 2009’s best scoring systems.

Feature Compensation

Compensating differing train and test conditions when processing the features has shown to becritical for good performance in many speech applications. Mismatch between train and testsettings usually arises from variability in speaker, channel, gender or environment conditions. Com-mon methods for compensation include the use of log-RelAtive SpecTrAl (RASTA) [HMBK92]features, mean and variance normalization, vocal tract length normalization, and feature warping.It has been further shown that feature compensation not only improves performance of pureacoustic approaches, but also enhances phonotactic LID systems [SR07].

2Results are published at http://www.itl.nist.gov/iad/mig/tests/lre/2009/lre09_eval_results.

4


0

5

10

15

20

25

30

american-indian russian-ukrainian dari-farsi bosnian-croatian hindi-urdu

EE

R [%

]

LRE 2009 Best Results averaged over 3, 10, 30 seconds

Figure 1.2.: Averaged results of best participants in the language-pair test.

Model Compensation

Recently, new techniques based on subspace methods for model compensation in Gaussian MixtureModels (GMMs) have become popular in language recognition [CSTR08]. Some examples areFactor Analysis proposed by Kenny [KD04], eigenchannel adaptation, a simplified version ofFactor Analysis by Brümmer [Brü04] and Nuisance Attribute Projection (NAP) [CSRS06]. Both,eigenchannel adaptation, as well as NAP, have recently been approximated in feature domain byCastaldo et al. [CCD+07] and Campbell [CSTR08]. The latter reports comparable performancegains achieved by all previously mentioned model compensation techniques, yielding around 0.7%absolute gain tested on the NIST LRE 2007 14 language closed-set task on 30 and 10 secondutterances.

Models and Classifier

GMMs are used to represent distributions of the features derived from individual languagesand are thus considered purely acoustic approaches. The parameters of GMMs can be traineddiscriminatively using Maximum Mutual Information (MMI) estimation. Matějka et al. [MBSČ06]achieved 50% relative improvement over state-of-the-art Maximum Likelihood (ML) trainedGMMs on NIST LRE 2003 evaluation set for 30 second segments.

Combining GMMs with Support Vector Machines (SVMs) as a discriminative classifier [CCR+06],the so-called GMM-SVM approach, is a common state-of-the-art technique. Usually, a UniversalBackground Model (UBM) is trained using the ML criterion with the Expectation-Maximization(EM) algorithm. The UBM should be based on the maximum quantity of data from diverse sources,containing different languages, speakers and channels. To obtain language specific distributionmodels, the UBM is adapted to the characteristics of each language using the EM algorithm witha Maximum-a-Posteriori (MAP) criterion. The means and variances of these language-specific


1. Introduction

models are stacked into a vector that is trained against the vectors of all other languages in aSVM with an appropriate kernel. The trained SVMs are used to perform the scoring at testtime.

The Gaussian SuperVector, presented as GMM-GSV in [Cam08], uses additionally, in contrastto the simpler GMM-SVM, the means and variations of the UBM in each SVM training vector.It further follows an alternative approach for scoring. The SVM model is pushed back to a GMM,which is then used to calculate the scores. In certain situations, especially on short utterances,this approach has proven good accuracy.

Phonotactic approaches consist of a frontend followed by a backend. The former commonlyextracts phones based on spectral features by help of one or several trained Phone Recognizers(PRs), whereas the latter statistically models the phone sequences, serving as classifier. Generally,there are two approaches for the classification of the token sequences: N-gram language models(PR-LM) and SVMs (PR-SVM). N-gram language models are often used with 3-grams, butpruned n-grams of higher order also seem to carry useful information by capturing regularities ona bigger scale [CCC+10]. In PR-SVM feature vectors are built with a bag-of-n-grams [CCR+04].However, Bing et al. states in [BYD08] that the statistical representation of such a vector issparse and inaccurate. He proposes two adaptation schemes: the adaptation from the UniversalN-gram Language Model (UNLM) trained on all languages, comparable to the UBM used inthe GMM-SVM acoustic approach and the adaptation from the Low-Order N-gram LanguageModel (LONLM). A relative improvement of around 18% compared to a non-adapted baselineis achieved, distinguishing 14 languages on 3, 10 and 30 second test trials from NIST LRE2007.

Tong et al. presents a new method to generate the phone sequences in the frontend. He proposesTarget-Aware Language Modeling (TALM) [TML+09], a method that supplies each phonerecognizer with a priori knowledge of the target languages during generation of the sequences.Tong claims to lower the Equal Error Rate (EER) by about 1.7% absolute in the task ofdistinguishing 14 languages with 30 seconds test trials from NIST LRE 2007.

The approach of using phone lattices instead of 1-best phone outputs from the recognizer wasfirst used by Gauvain [GMS04]. In recent tests with a PR-SVM system on distinguishing LRE2007’s 14 languages Bing et al. [BYD08] achieves an absolute improvement of 3.0%, 5.9% and5.5% for the 30 second, 10 second and 3 second utterances respectively.

Fusion

The calibration and discriminative fusion of the scores from each subsystem may be successfullyperformed with linear backends, whose coefficients are determined with Linear Logistic Regression(LLR). Common aproaches calculate independent backends for each language and for each durationof files to be processed (3, 10 or 30 seconds). In some cases Gaussian backends are employed[CCC+10]. A well known toolkit to perform calibration and fusion is Brümmer’s FoCal Multi-classToolkit [Brü07].

6


1.4.2. Automatic Speech Recognition with Data Constraints

Large vocabulary continuous speech recognizers require large amounts of transcribed data toattain robustly trained acoustic models. The same applies when recognition systems face differentconditions, such as speakers having a different linguistic profile and thus speaking a differentvariety. However, manual transcriptions are expensive goods and require tedious, time consumingwork. Moreover, publicly available sources of manually transcribed data are particularly scarcein the case of less studied varieties. Hence, since several years, research focuses on automaticprocedures capable of training speech recognizer on untranscribed data [KW99]. Radio andbroadcast news can supply unlimited data for automatic training. This data sometimes might beaccompanied by closed-captions, a close, but not exact transcription of what is being spoken.Most of the data, however, comes without any transcripts. For both cases useful strategies havebeen developed to improve final recognition performance [LGA02].

The basic idea is to use an existing speech recognizer, that has at least been bootstrappedwith a small amount of manually transcribed data, to automatically transcribe raw audio datacollected from any source. Ma and Schwartz [MS08] state that the amount of manual datafor bootstrapping has no significant influence on the recognition error, if a large amount ofuntranscribed data is available. However, Wang et al. [WGW07] conclude differently, that theperformance gain seems to be limited, when there is little initial manually transcribed data for aparticular, mismatched, data type.

After generating automatically transcribed data, it may either be used directly for training andrefinement of the speech recognizer, or a filtering procedure based on confidence scores is first usedto select those parts of the automatic transcript, that are likely to be correct. Kemp and Waibel[KW99] compared unsupervised training filtered with real confidence scores, discarding aboutfifty percent of the wrong words at a baseline recognition generating 38.8% WER, to a systemusing ideal confidence scores, that detects any recognition mistake. They claim that an idealaccuracy of confidence scoring does not significantly raise the recognition performance of the finalsystem, which improved from WER 28.5% with real confidence scores to 27.4% using the idealmeasure. Moreover Kemp and Waibel state that constraining the filtering with a high confidencethreshold on one hand helps to sort out most errors, on the other hand, however, it only keepswords that the system already recognizes robustly. Hence, a high confidence threshold does notadd substantial new information. Contrarly Wessel and Ney [WN01] report an improvementfrom 44.0% WER using real confidence measures to 40.0% WER with ideal measures, concludingin a high advantage of accurate confidence filtering. The authors’ baseline system used forautomatic transcription achieves 47.5% WER and is combined with a high filtering threshold of0.9. Unfortunately no equivalent rate of correct filtering is mentioned, resulting in difficulties tocompare both findings.

A further approach aims at enhancing the process of filtering the automatically transcribeddata by combining transcription and confidence measures from different systems [KW99]. Kempand Waibel align the output of both systems and select the word with the higher confidence


1. Introduction

measure in case of disagreeing transcriptions. This leads to an absolute gain of 0.9% in recognitionaccuracy.

Ma and Schwartz compare different training strategies to conduct unsupervised training [MS08].They present an incremental strategy, that consists of doubling the amount of data used in eachiteration of training and automatic transcription. After five iterations, starting with one hour ofmanual bootstrapping, they used the complete corpus of 31 hours of available data and reached aWER of 41.6%. The authors’ second approach uses the whole 31 hours of data for transcriptionin each iteration and reaches a WER of 39.9% after the second iteration.

At L2F unsupervised training also helped to improve the recognition of the EP recognizer[MVN08]. 199 hours of automatically transcribed training data were able to lower the WER ofa system trained initially with 46 hours manually transcribed data from 23.5% to 21.5%. Thisrelative gain of 8.5% was achieved selecting the automatic data based on confidence measures andusing an incremental strategy, as the data was recorded continuously and was thus not availableat once.

8

2. The AUDIMUS automatic SpeechRecognition System

At L2F research has been going on for several years on speaker independent LVCSR. Startingwith recognition systems mainly for American English, in 1997 AUDIMUS was developed, thefirst system for European Portuguese [NMA97]. AUDIMUS is a hybrid recognizer combiningpattern discriminative classification characteristics of Artificial Neural Networks (ANNs) withthe temporal modeling capacity of Hidden Markov Models (HMMs). Continuous improvementled to the current version of AUDIMUS that is trained with over 378 hours of BN (46h manuallytranscribed), using a 100k word vocabulary and achieving a WER of 21.5% on a six hours EPBN evaluation corpus [MVN08].

AUDIMUS is flexible and module-based. Fig. 2.1 shows its schematic block diagram, which willbe successively described in this chapter. Starting with the features used for recognition, theprocess of acoustic modeling is clarified, details about the pronunciation and language modelsare revealed and finally the employed decoder is described.

2.1. Feature Extraction

A basic AUDIMUS system uses the output streams of three distinct feature sets: PerceptualLinear Prediction (PLP) features [Her90], log-RelAtive SpecTrAl PLP (RASTA-PLP) features[HMBK92] and Modulation SpectroGram (MSG) features [KMG98]. To extract the featuresfrom the incoming audio signal a sliding window of 20ms length is used, which is updatedevery 10ms. PLP and RASTA-PLP features consist of 26 parameters per frame, including 12th

order coefficients, the logarithmic spectral energy and first order derivatives. The MSG featuresconsist of 28 static features per frame. Temporal context frames are further used to improve the

Figure 2.1.: AUDIMUS schematic overview [Mei08].

9

2. The AUDIMUS automatic Speech Recognition System

acoustic classification. In case of PLP and RASTA-PLP features six preceding and six followingcontext frames are added, as for the MSG features seven are used. This leads to a total numberof 26 ∗ 13 = 338 and 28 ∗ 15 = 420 inputs to each MultiLayer Perceptron (MLP) classifier.The combination of all three feature sets proved to be most efficient and robust comparedto using any one of the feature sets separately [MN00]. This seems reasonable as the threefeatures incorporate the advantage of human-like speech perception (PLP) [Her90], robustnessto influences by the communication channel (RASTA-PLP) [HMBK92] and stability in presenceof noisy and reverberant conditions (MSG) [KMG98].

2.2. Acoustic Modeling

AUDIMUS’ acoustic modeling follows the connectionist approach [BM94], combining HMMs andANNs to overcome the latter’s difficulty of handling the time sequential nature of speech. Ascan be seen in Fig. 2.1 each of the three distinct feature sets (compare section 2.1) is followedby dedicated MLPs, which share a similar architecture. They are pure feedforward networkswith one input layer, two fully connected non-linear hidden layers with 2000 sigmoid units andone output layer with 39 softmax outputs. Thus, each output provides the context-independentposterior phone probability of one of 38 phones present in EP plus silence. To merge the threeprobabilities per phone P1, P2 and P3 resulting from PLP, RASTA-PLP and MSG features,corresponding values are combined using the average in the log-probability domain [MN00]. Thecombined probability P (ρ|~x) for a phone ρ given the acoustic vector ~x is shown in Equation 2.1,where M denotes the number of MLPs used.

P (ρ|~x) =M∏k=1

Pk(ρ|~x) (2.1)

Following the connectionist approach, the combined network outputs are further used as estimatesof the observation probabilities within the HMM framework. Hence, the observations constitutea stochastic process on a non-observable Markov chain of first-order. However, our MLP classifierprovides conditional posterior probabilities, which, thus, need to be converted to scaled likelihoodsP (~x|ρ) using Bayes’ law.

P (ρ|~x) = P (~x|ρ)P (ρ)P (~x)

⇔ P (~x|ρ) = P (ρ|~x)P (ρ)

P (~x) (2.2)

P (ρ) is the prior probability for phone ρ, which is deducted as the relative frequency from thetraining set. This value equals for all MLPs (PPLP (ρ), PRASTA−PLP (ρ) and PMSG(ρ)), as theyprocess the same data.

PPLP (ρ) = PRASTA−PLP (ρ) = PMSG(ρ) (2.3)

10

2.3. Vocabulary and Language Model

Due to the combination of posterior probabilities in equation 2.1, the combined prior also changescorrespondingly:

P (ρ) =

M∏k=1

Pk(ρ) = [PPLP (ρ)]M (2.4)

The calculation of the general probability density of the feature vectors P (~x) is not needed, asabsolute values of the HMM state emission probabilities are not required for recognition [Sta06].The scaled likelihoods from Equation 2.2 can thus be simplified to:

P (~x|ρ) ∝ P (ρ|~x)P (ρ)

(2.5)

2.2.1. Training of the Acoustic Models

Training of the acoustic models requires transcribed training data. Usually this kind of datacontains no information to synchronize with the corresponding audio data. Thus, a frame-basedalignment that specifies the exact target phone for each occurring audio frame is needed. Toachieve this, AUDIMUS with previously trained (or at least bootstrapped) acoustic models canbe used. In alignment-mode the decoder (compare section 2.5) allows to perform a recognitionmatching transcription and recognized phones.

The acoustic models’ MLPs are trained separately for PLP, RASTA-PLP and MSG features usingan online backpropagation algorithm. To prevent overtraining current training improvement isestimated in a cross-validation after each successful training epoch. In case of a higher degradationthan a limiting threshold (‘total error tolerance’) the training is stopped. For this purpose thetranscribed training data is split in a training and a development set for cross-validation,comprising about 9

10 and 110 of the original data. A method of gradient step-size (learning rate)

adaptation, called ‘ramp mode’, is further employed. If the cross-validation reveals a degradation,which is smaller than the ‘total error tolerance’, the training algorithm goes back to the bestperforming epoch and multiplies the step-size in every succeeding epoch by a reduction factor. Incase of a reoccurring degradation the training is stopped.

L2F’s EP baseline acoustic models are trained with 378 hours of training data, of which 332 wereautomatically transcribed using word confidence measures [MVN08].

2.3. Vocabulary and Language Model

In its current version AUDIMUS uses a vocabulary of 100k words, which is adapted on a dailybasis for the purpose of BN transcription. This vocabulary size was recently adopted, as inearlier versions static 57k words were used. Changing the size of the vocabulary is always atrade-off. On one hand a bigger vocabulary reduces the Out-Of-Vocabulary (OOV) rate, on theother hand it increases the average acoustic confusability of words, resulting in new recognition



errors [Ros95]. The dynamic 100k word vocabulary has reduced the OOV rate from former1.4% in the static 57k case to 0.49% [Mar08], when tested at the time of first implementation.The adaptative algorithm should cope with topic changes in BN over time and the frequentoccurrence of new words appearing every day. This is achieved by maintaining a vocabularycoming from the preceeding seven days of collected online newspapers. The algorithm performs aspecial selection based on the statistical information related to the distribution of Part-Of-Speech(POS) word classes to dynamically select words from two sources, the 57k baseline and the recentvocabulary.

As for statistical language modeling, it was shown to be important not to exclusively rely onnewspaper text for building the model. Sentence structures differs significantly between BN andwritten newspapers [MCNT03]. Using the same data as during the selection of the vocabulary,the EP language model is built by interpolating a 4-gram, absolute discounted language modelbuilt from over 604M words of newspaper text, a 3-gram model based on the available manualtranscriptions with 532k words using Knesser-Ney discounting and a 3-gram, Knesser-Neydiscounted model based on about 560k words coming from recent BN emissions as stated above.The mixture coefficients are estimated using the last three weeks of most accurately recognizedautomatic BN transcripts, measured with high recognition confidences. Details can be found in[Mar08].

2.3.1. Evaluating Language Models

A common metric for evaluating a language model G is the perplexity PerG. The perplexity,as shown in Equation 2.6, measures the ability of the language model to predict an arbitrarysequence of J words w1,w2,...,wJ that has not been used to train the model.

PerG = e−1J

∑Ji=1 ln(PG(wi|w1,...,wi−1) (2.6)

The goal of statistical language modeling can be seen as minimizing the perplexity. Due to thefact that the perplexity is a function of both the model and the text, two perplexity values mayonly be compared with respect to the same text and vocabulary.

2.4. Pronunciation Model

Based on the vocabulary a pronunciation lexicon is built, consisting ideally of all possible pronun-ciations for the words listed. The lexicon restricts the continuously estimated phone probabilitiesfrom the acoustic model to meaningful sequences constituting recognized words. To generate thepronunciation lexicon a Grapheme-to-Phone (G2P) conversion model is used based on a FiniteState Transducer (FST). The G2P is mainly rule-based, but makes further use of an exceptionand a multiple pronunciation lexicon. The 100k vocabulary leads to a final lexicon comprising

12

2.5. Decoder

a total of 115.526 different pronunciations. To generate these pronunciations, the G2P modelpasses through the following procedures (compare [CTVB03] for details):

1. Introduce-phones 2. Exception-lexicon 3. Stress 4. Prefix-lexicon

5. gr2ph 6. sandhi 7. Remove-graphemes

Figure 2.2.: Process flow to generate the pronunciation lexicon starting from the vocabulary.

First (‘introduce-phones’), empty placeholders are introduced between every grapheme of aword. This serves to avoid rule dependencies in the following steps. Second (‘exception-lexicon’),words not covered by the rules are converted based on additional lexica. During the third phase(‘stress’) the stressed vowel of the word is marked. In Portuguese the stressed syllable significantlyinfluences the pronunciation. The fourth phase (‘prefix-lexicon’) consists of 92 pronunciationrules for compound words, with roots of Greek or Latin origin such as “tele” or “aero”. Theprocess ‘gr2ph’ forms the bulk of the system. Around 340 rules convert the graphemes to phones.The underlying rules resemble the structure φ → ρ/µl_µr, where φ, ρ, µl and µr are regularexpressions. When grapheme φ is found in the context with µl on the left and µr on the right, itwill be replaced by its phonetic representation ρ. The sixth phase (‘sandhi’) implements wordco-articulation rules across word boundaries. In the last step (‘remove-graphemes’) all graphemesare removed yielding a sequence of phones.

2.5. Decoder

The AUDIMUS decoder is based on the Weighted Finite State Transducer (WFST) approach tolarge vocabulary speech recognition [MPR02]. In this approach, the posterior phone probabilitiesfrom the acoustic model (compare section 2.2) used in the HMMs are mapped to words. For thispurpose the acoustic model topology H, the pronunciation lexicon L and the language modelG are represented as transducers. The composition of the various transducers forms the finalWFST, what leads to a search space of H ◦ L ◦G.

Traditionally, the composition and further optimization of several transducers happen outsideof the decoder in a static off-line compilation. However, the decoder used in AUDIMUS hasthe ability to perform this on-the-fly, in runtime. For that purpose a specialized algorithm wasdeveloped [Cas03] that composes and optimizes the transducers of pronunciation lexicon andlanguage model in a single step. Due to on-demand (lazy) implementations it just needs tocompute the part of the search space actually required in run-time. This improves significantlymemory efficiency and flexibility. The run-time memory efficiency allows the decoder to use 4-gram



language models in a single pass, whereas other approaches first have to run a smaller languagemodel and recompute preliminary results with a larger model in a second pass. Flexibility arisesdue to the fact of all components being computed during run-time, which allows easy and fastreconfiguration by adapting or replacing certain parts.

The AUDIMUS decoder further allows to perform alignments between audio and fed transcription.In this alignment-mode the decoder’s search space changes the composition of the transducersfrom H ◦ L ◦G to H ◦ L ◦ T , where T is a transducer restricting to the word sequences of themanual transcription permitting additional silence between words. Moreover, the decoder keepstrack of time boundaries between the phones and outputs this information.

2.6. Confidence Measures

During recognition AUDIMUS allows the generation of confidence measures, which estimate theaccuracy of each recognized word. Word confidence measures play a key role for the automaticgeneration of subtitles for BN, as one wants to exclusively display subtitles that are likelyto be recognized correctly. However, associating word confidences with recognition output isessential for evaluating the impact of potential recognition errors in general. It is crucial toselect new acoustic training data for unsupervised training from automatically transcribedmaterial.

The calculation of confidence measures is a process that is based on the classification of severalconfidence features. These features include a recognition score, an acoustic score, a word posteriorprobability, several search space statistics (number of competing hypotheses and number ofcompeting phones) and the phone log-likelihood ratios between the hypothesized and the bestcompeting phone. After recognizing a sentence confidence features for each recognized phone areextracted, combined to word level and scaled to the interval [0,1]. A maximum entropy classifier[BPP96] combines the features according to

P (correct|wi) =1

Z(wi)exp

F∑j=1

λijfij(wi)

(2.7)

where wi is the word i, Z(wi) is a normalization factor, F is the total number of features, λ arethe model parameters and fij(wi) is a single word’s specific confidence feature. The detectorneeds to be trained with transcribed acoustic data and sensitively matches the recognizer’sconfiguration with reference to the acoustic, the pronunciation and the language models used fortraining.

14

2.7. Evaluation and Statistical Significance

2.7. Evaluation and Statistical Significance

For the purpose of evaluating recognition results the WER is used as a measure of the percentage ofword errors. To calculate the WER, an alignment between the recognized automatic transcriptionand the (manual) reference transcription needs to be generated, resulting in the number ofsubstitutions (misspelled or wrong words), deletions (words missing in the recognition output),insertions (words that are wrongly added) and correct words. The ratio of errors and total wordscounted in the reference constitutes the WER.

WER =substitutions+ deletions+ insertions

total words in reference(2.8)

Further, the differences in performance of modified systems are tested on statistical significance.Gillick and Cox [GC89] have suggested the use of the MAtched-Pair Sentence-Segment WordError (MAPSSWE) test, which is able to identify significant differences that are not revealedby other tests, for instance by the McNemar test, a sentence level test. An implementation[PFF90] of the test available through the NIST SCoring ToolKit (SCTK) is used for thatpurpose.

The implementation of the two-tailed MAPSSWE test uses knowledge from the aligned referenceand hypothesized sentence strings to locate segments that contain recognition errors. The selectionof these segments needs to ensure that the errors in one segment are statistically independent ofthe errors in any other segment. To achieve this, the algorithm first identifies segments withouterrors. Subsequently, the correct segments build the boundaries for segments containing errors. Aminimum of two correctly recognized words splitting erroneous segments are assumed to ensurestatistical independence and simultaneously allow the sentence strings to be subdivided into alarge number of segments which facilitates verifying statistical significance.

A 95% confidence level is used for rejecting the null hypothesis of two systems not beingstatistically significant (significance level α = 5%).


3. Corpora

This work targets the recognition of BN, since BN corpora are easy to collect in different varieties,and have a broad range of applications. BN corpora are very diverse. They normally includespeech from different speakers with different backgrounds and different accents. Moreover theycontain different types of speech, for instance read or spontaneous speech. In the case of APhowever, this diversity becomes a problem. Most speakers have Portuguese as second languageand their degree of proficiency often shows great variability. Thus, while some speakers onlyshow small differences relative to the European variety, others strongly deviate from it and alsofrom each other. In neither case may these speakers be considered good representatives of thelocal variety. The most noticeable case are some African politicians and reporters who havestudied or spent a significant part of their lives in Portugal and show little or no local accent atall.

This chapter presents all corpora used throughout this thesis. It will first introduce the corporabased on audio material and later those based on text. The acoustic corpora are further organizedby function: those used for phonetic classification and those serving to build a variety identificationsystem. Further details specifying the exact BN programs used for recording and correspondingdates of emission can be found in Appendices A and B.

3.1. Acoustic Corpora

3.1.1. Corpora for Phonetic Classification

This section introduces the corpora used to train phonetic classifiers, being AUDIMUS’ acousticmodels (see Section 2.2). This section is organized by varieties, presenting the AP, BP and EPcorpora.

3.1.1.1. African Portuguese

The AP BN corpus was recorded from RTP-África, a television channel owned by the Portuguesepublic broadcasting corporation Rádio e Televisão de Portugal (RTP) that aims at the Portuguese-speaking African countries. The corpus consists of several emissions of ‘Reporter Africa’, which isa BN jornal with focus on Africa. The corpus is divided into training (‘AP train’), development(‘AP dev’) and testing (‘AP test’) subsets (for details about exact composition see Tables A.1,

17

3. Corpora

A.2 and A.3, respectively) and is manually orthographically transcribed. The shows have anaverage length of 25 minutes. However, the actual amount of AP speech is much lower, as theanchor, who usually introduces all reports, speaks the EP variety. Moreover, all emissions areof the same program, frequently including the same reporters from different PALOP countries.To overcome this huge lack in variability an AP extra corpus (see Table A.4 for details) wascollected consisting of other types of TV emissions including soaps and BN shows from thenational channel of Angola, Televisão Pública de Angola (TPA).

Due to the shortness of manually transcribed corpora, unsupervised training (compare Sec-tion 6.1.3) will further help to overcome the limitation and hence an automatically recordedand transcribed AP corpus (‘AP automatic’) is used (see Table A.5 for details about exactcomposition).

In Tab. 3.1 all AP corpora and their corresponding durations are displayed. It has to be noted,that the selection of AP accented speech within the AP automatic corpus has been doneautomatically with the variety identification system introduced in Section 5. The corpus’ durationis calculated based on that classification. Refer to Table A.5 for the total raw length of therecorded files.

Corpus Name Total Duration [min.]AP train 456.9AP dev 49.8AP test 47.9AP extra 87.5AP automatic 1883.7

Table 3.1.: Overview of the AP corpora.

3.1.1.2. Brazilian Portuguese

The Brazilian TV corpus contains recordings of several broadcast news and debates. The corpus isdivided into training (‘BP train’) and development (‘BP dev’) (for details about exact compositionsee Tables A.7 and A.6). Most of the recorded shows, including news shows, have been intersectedby several commercial spots. This caused some programs to be split in several short parts. InTable 3.2 all BP corpora and their corresponding durations are displayed.

Corpus Name Total Duration [min.]BP train 456.9BP dev 44.9

Table 3.2.: Overview of the BP corpora.

18

3.2. Text Corpora

3.1.1.3. European Portuguese

The EP corpus consists of a selection of about eight hours coming from the ‘ALERT SpeechRecognition Training’ corpus. The corpus is divided into a training (‘EP train’) and development(‘EP dev’) set (see Tables A.8 A.9 for details about exact composition). Table 3.3 gives an overviewof both sets. The original ALERT corpus contains around 61 hours of manually transcribed BNrecordings collected in the year 2000. The complete corpus involves national and regional newsshows from morning till late evening, including normal broadcasts and specific ones dedicated tosports and financial news.

Corpus Name Total Duration [min.]EP train 454.4EP dev 23.0

Table 3.3.: Overview of the EP corpora.

3.1.2. Corpora for Variety Identification

For the task of variety identification a training (‘VID train’) and a testing corpus (‘VID test’) areintroduced, each consisting of data from all three varieties AP, BP and EP. The corpora containdifferent amounts of hand labeled audio segments giving information about what is represented.Each segment contains speech of a single speaker. The same speaker may appear in severalsegments, but the data is sorted to ensure he does not appear in train and test simultaneously.The data originates from BN recordings and comes, due to limited resources, partly from the samerecordings as the corpora used for phonetic classification in Section 3.1.1. Refer to Appendices Bfor details on the composition and origin of the corpora. Tables 3.4 and 3.5 present numericaldetails about the VID training and VID testing corpora.

Train Data AP BP EP∑

duration [min.] 238.8 256.1 279.1 774.0segments 1424 1434 1283 4141

∅ dur./segm. [s] 10.1 10.7 13.1 11.2<3s [%] 16.9 12.8 0.1 10.3

3-10s [%] 42.3 44.4 49.6 45.310-30s [%] 38.7 38.7 44.1 40.4>30s [%] 2.2 4.1 6.3 4.1

Table 3.4.: Details about the ‘VID train’ corpus.

3.2. Text Corpora

As a part of this work, a text corpus has been collected consisting of data from different Africannewspapers. The data was gathered making use of the newspapers’ on-line archives. The aim


3. Corpora

Test Data AP BP EP∑

duration [min.] 88.8 80.2 99.0 268.0segments 610 462 412 1484

∅ dur./segm. [s] 8.7 10.4 14.4 10.8<3s [%] 23.3 18.0 0.2 15.2

3-10s [%] 43.1 40.0 42.5 42.010-30s [%] 32.8 38.1 50.0 39.2>30s [%] 1.0 4.1 7.5 37.7

Table 3.5.: Details about the ‘VID test’ corpus.

of equally balancing collected corpora sizes could not be achieved, as the number of articlespublished in Angola and Mozambique is much higher than in Guinea-Bissau and São Tomé andPríncipe. A total of 1682k words have been collected. Table 3.6 informs about details of theaccumulated data.

Country Words Start Newspaper URL

Angola 604k Feb 04 AngoNoticias angonoticias.comMay08 Jornal de Angola jornaldeangola.sapo.ao

Cape Verde 320k Sep07 Expresso das Ilhas expressodasilhas.sapo.cv

Guinea Bissau 196kSep05 Bissau Digital bissaudigital.comJan03 Guine-Bissau guine-bissau.comSep09 Jornal Nô Pintcha jornalnopintcha.com

Mozambique 391k Jan06 CanalMoz canalmoz.comApr06 notícias jornalnoticias.co.mz

São Tomé e Príncipe 171kJan04 Jornal de S. Tomé jornal.stJul08 Téla nón telanon.infoAug09 Jornal Vitrina vitrina.st

Table 3.6.: Overview of sources composing the text corpus collected until December 2009(Total words: 1682k).

20

4. Baseline Speech Recognition of AfricanPortuguese

This chapter presents the baseline results that will serve as a reference when evaluating enhancedrecognition systems in Section 6. The creation of baseline results was preceded by a revision ofthe available manual transcriptions, and by a manual classification of the speakers present in thetesting corpus characterizing the strength of their AP accent. The further organization of thischapter closely follows this chronological order.

4.1. Revision of Manual Transcriptions

Intensive work with the AP corpora (see Section 3.1.1.1) revealed that most of the existing tran-scriptions had been inaccurately done. A correction of most errors existed, but was destroyed dueto encoding problems. Correcting the errors requires manual work, which is reduced implementinga semi-automatic process based on the Levensthein algorithm [Lev66]. This algorithm allows theautomatic alteration of most errors by calculating a word distance measure of possible corrections.Manual action is just required in the case of several corrections having the same distance measure,which occurs around once in ten errors. Aside from the encoding errors, wrongly set segmentlengths and a few linguistic transcriptions need to be manually corrected.

The revision’s impact can be measured in improved automatic alignment of audio source and man-ual transcriptions. The number of aligned frames increases, as the transcription corresponds moreaccurately to the actual audio data. 14% more alignment in training, 5% more in development and4% more in testing could be achieved. Table 4.1 shows the duration of aligned frames in minutesavailable for training, development and testing after and before the revision.

alignment [min] old revisedAP train 365.8 416.1AP dev 43.8 45.8AP test 42.0 43.5

Table 4.1.: Alignment of AP corpora before and after manual corrections.

21

4. African Portuguese Baseline

4.2. Manual Classification of Speakers

A detailed analysis of the baseline’s recognition performance may benefit from a classificationof the testing corpus in terms of the strength of the speaker’s AP accent. Such a classifi-cation allows to assess if certain categories suffer a higher degradation in recognition thanothers.

The classification is subjective and based on an expert’s opinion. It is compared to a secondexpert’s judgments. Both experts were asked to label each of the 75 speakers appearing in the‘AP test’ corpus using one of the three labels: no AP accent, slight AP accent, and strong APaccent, corresponding to following characteristics.

• No accent: Pronunciation common among natives from Portugal.

• Slight accent: AP characteristics are present, although not very frequent and less intense.No syntactic differences are observed.

• Strong accent: Characteristics of AP present, frequently noticeable and subjectively strong,particularly in contexts where in EP /R/, /r/, /l˜/, /l/, /L/, /J/ and the vowels would berealized. Further, observation of neglected articles, substitution of prepositions and nonconcordance in number.

The experts reported easiness on classifying speakers without any accent. However, the coherentdistinction between slight and strong accents was claimed to be less trivial. This can be confirmedcomparing the expert’s labels using the Cohen Kappa [Coh60], a statistical measure assessingthe reliability and agreement of two evaluators. Following Equation 4.1 and 4.2 an overall CohenKappa of κ = 0. 58 and, respectively, a standard error of σκ = 0. 09 are achieved, where po isthe proportion of agreement between the experts, pc is the proportion of agreement expected bychance and N is the number of speakers being classified.

κ =p0 − pc1− pc

(4.1)

σκ =

√p0(1− p0)N(1− pc)2

(4.2)

After Landis and Koch’s [LK77] interpretation of Kappa values, this corresponds to a ‘moderateagreement’ between both experts. The Cohen Kappa for single classes are further broken downon Table 4.2 reflecting an ‘almost perfect agreement’ classifying no accent and a ‘moderateagreement’ on slight and strong accents.

Moreover, the speaking style and the acoustic conditions of each speaker’s turn are also manuallyclassified. The speaking style classification only distinguishes between spontaneous and non-spontaneous speech, taking into account the presence of disfluencies (e.g., filled pauses, repetitions,hesitations, etc.), and colloquial syntax. The acoustic classification refers only to the presence

22

4.2. Manual Classification of Speakers

no accent slight accent strong accentκ 0.86 0.46 0.57σκ 0.10 0.11 0.10

Table 4.2.: Cohen Kappa and standard error for each individual class.

0

5

10

15

20

25

30

35

no accent slight accent strong accent

Am

ou

nt

of

wo

rds [

%]

spontaneous 0, noise 0spontaneous 0, noise 1spontaneous 1, noise 0spontaneous 1, noise 1

Figure 4.1.: Distribution of accent strength, spontaneous speech and noise in ‘AP test’corpus.

of noise in the utterance. Figure 4.1 shows the distribution of accent strength coinciding withspontaneous speech and noise in the ‘AP test’ corpus. ’Zeros’ and ’ones’ refer to ’not present’and ’present’, respectively. The existence of speakers without any AP accent included in thetesting corpus was surprising to the author, as in previous work the testing corpus had alreadybeen manually filtered to uniquely contain AP speech. Consequently, the relative amount of EPaccented speech within the ‘AP test’ corpus is small. It will be discarded from the testing corpusfor all further evaluations within this thesis, as the performance of developed systems needs toonly be assessed on AP varieties. Further, due to size and quality, the EP accented parts are notrepresentative to indicate performance on EP speech.

It is worth noticing that the worst acoustic conditions (spontaneous speech and noise present)correlate with the strength of the accent. This is probably due to the fact of encounteringstronger accents mostly outside of studio situations, in street interviews with loud backgroundnoise.

After the exclusion of any EP speakers, the testing corpus contains 6806 words spoken, 46.4%with a slight AP accent and 53.6% with a strong AP accent. To further differentiate results,speakers in clean acoustic conditions (planned speech and no noise) are additionally assessed.Slight accents in clean acoustic conditions acount for 34.4% of word uterrances, whereas strongaccents in clean acoustic conditions constitute 23.3% of the words spoken in the testing cor-pus.


4. African Portuguese Baseline

4.3. Baseline Results and Discussion

L2F’s standard EP automatic speech recognition system, as described detailed in Section 2, wasused to produce baseline results. It uses a vocabulary of 100k words. The language model resultsfrom the interpolation of a 4-gram language model built from over 604M words of newspapertext, a 3-gram model based on manual BN transcriptions with 532k words, and a 3-grambased on about 560k words coming from automatic BN transcriptions. The EP acoustic modelshave been trained on 378 hours of training data, including 46 hours of manually transcribeddata.

The baseline results for slight and strong AP accents overall and in clean acoustic conditions areshown in Table 4.3. For slight accents, the results are about 20% worse than results reportedfor EP, whereas strong accents degrade the recognition by nearly 60% compared to EP results[MVN08]. The performance on strong accents is about 33% worse than on slight accents. Inclean acoustic conditions, however, this degradation reduces to about 12%. The correlationbetween noisy conditions and accent strength is surely a factor accounting partly for thisreduction.

WER [%] accent characterizationslight strong ∅

any condition 25.4 34.0 30.1clean 20.5 23.0 21.5

Table 4.3.: Baseline WER with L2F’s standard EP system on the ‘AP test’ corpus.

24

5. Automatic Identification of African andEuropean Portuguese

Variety-specific acoustic models, used to improve the automatic recognition of AP within thisthesis, are trained with transcribed AP BN recordings. Nevertheless, manual transcriptionsare costly and time-consuming and therefore always limited. As reported in Section 1.4.2, acommon approach to cope with limited training data is unsupervised training with automat-ically transcribed speech, which has also significantly lowered the WER in L2F’s EP recog-nizer.

However, as opposed to distant languages that are not mutually intelligible, different varietiesoften appear side-by-side in public media. Thus, adapting an ASR system to a certain varietynecessarily involves the need for separating the data prior to training. In the case of AP-specificspeech recognition within this thesis, available BN recordings for automatic transcription mainlycontain speech with AP and EP accents. Unsupervised training cannot be realized without havinga VID module to solely select AP accented speech.

This chapter deals with a promising implementation of such a VID module. A specialized languageidentifier, based on Phone Recognition followed by Language Modeling (PRLM), is used todifferentiate Portuguese varieties spoken in the PALOP countries from EP in a highly accurateand efficient manner. Moreover, to underline the effectiveness of the proposed approach, it isextended to cover the BP variety, distinguishing BP vs. EP, AP vs. BP and all three varietiesfrom each other.

In the following Section 5.1 a conventional VID system is introduced, serving as baseline. Theproposed approach is presented in Section 5.2. Experimental results are shown in Section 5.3.In Section 5.4 the approach is applied to cover the third Portuguese variety. A discussion of allresults can be found in Section 5.5, which also specifies the final VID system used for automatictranscription within the rest of this thesis.

5.1. Baseline Variety Identification

There are several approaches to tackle the problem of automatic language -or more specificallyvariety- identification, as presented detailed in Section 1.4.1. Most common approaches includeacoustic, phonotactic or even prosodic based methods [CCC+10][ZB01][NLLM10]. A combinationof phonotactic and acoustic methods is usually considered among the best performing approaches.

25

5. Automatic Identification of African and European Portuguese

Thus, in order to serve as a strong reference, the following subsections present the baseline VIDsystem being a combination of phonotactic and acoustic systems.

5.1.1. Baseline Phonotactic Systems

The key aspect of PRLM systems are robust phonetic classifiers that generally need to be trainedwith word-level or phonetic-level transcriptions. The tokenization of the input speech data isdone with the neural networks used for acoustic modeling in L2F’s hybrid recognition system(AUDIMUS), as introduced in depth in Chapter 2. This type of recognizer is composed by threephonetic classification networks, particularly MLPs, working with PLP, RASTA-PLP and MSGfeatures. The same classifier fed with speech of different varieties produces different sequences oftokens for each of them. In training mode, variety-specific statistical models are created. In testmode, the posterior probability of belonging to a certain variety may be estimated by comparingthe token sequences for a given speech frame (and its context) with a certain variety’s trainedstatistical model. Thus, phone recognizers from any arbitrary language may be used, indifferentto the VID system’s target varieties.

Three of L2F’s phone recognizers are used in parallel within the phonotactic VID system.This common method is refered to as Parallel PRLM (PPRLM) [Zis96]. The EP classifier wastrained with 378 hours of training data, of which 332 were automatically transcribed, usingword confidence measures [MVN08]. It used the same data for training as the baseline ASRsystem introduced in Chapter 4. The BP classifier was trained with more than 46 hours of BNdata from the record channel transmitted by cable TV in Portugal, out of which 13 had beenmanually transcribed (being consistent with the ‘BP train’ corpus, see [ATNV09] for details).The AP classifier is a preliminary version of those developed in the following Chapter 6. It isan adaptation of L2F’s current EP recognizer trained with few epochs of about seven hoursmanually transcribed AP BN training data (‘AP train’ corpus). It is hence the only baselineclassifier that has been trained with data from two different varieties.

The size of the neural networks of each tokenizer differs due to the different amounts of trainingdata. According to Chapter 2, the context windows of the MLP networks trained with PLPand RASTA-PLP features is fixed to 13, while a context of 15 frames was chosen for MSGfeatures. The EP networks have two hidden layers of 500 weights and an output layer of 39weights. The BP networks have also two hidden layers of 500 units and an output layer of 40units. The AP networks consist of two hidden layers with 2000 weights each and have 39 outputweights. Note, that the size of the output layer corresponds to the number of phonetic unitsof each variety, plus silence (no additional sub-phonetic or context-dependent units have beenconsidered [AN08]). The phonetic classifier used in the baseline VID system are summerized inTab. 5.1.

26

5.1. Baseline Variety Identification

Train Data [hours] Layer SizeClassifier total manual Hidden Output

EP 378 46 500 39BP 46 13 500 40AP 378+7 46+7 2000 39

Table 5.1.: Overview of employed baseline classifiers.

Phonotactic Modeling

For every phonetic tokenizer, the phonotactics of each target language are modeled with a3-gram back-off model, that is smoothened using Witten-Bell discounting. For that purposethe SRI Language Modeling (SRILM) toolkit is used [Sto02]. The token sequences are modeledseparately for each target variety and every tokenizer. To model the EP variety, all 1283 segments(279 minutes) from the ‘VID train’ corpus are used, according to Section 3.1.2. The AP modelis trained with 1424 segments, adding up to 240 minutes. The ‘VID test’ corpus is used fortesting.

In both training and testing, the raw phonotactic sequence obtained by each tokenizer is filtered,in order to avoid spurious phone recognitions. Concretely, phones that appear only once in themiddle of long sequences of identical phones are deleted and only transitions between phones areconsidered in the language model.

5.1.2. Baseline Acoustic System

As described in Section 1.4.1, but not further addressed in this thesis, a Gaussian SuperVector(GSV) [CCR+06] system based on mean supervectors (MAP adaptation of the Gaussian means)with an alternative scoring approach [Cam08] has been chosen to support the baseline VIDsystem. In constrast to the conventional GSV, each language SVM model is pushed back toa positive and a negative language-dependent GMM model, which are then used to calculatelog-likelihood ratio scores. In certain situations, especially on short utterances, this approach(henceforth referred to as GMM-SVM) has shown improved accuracy.

5.1.3. Calibration and Fusion

Calibration of each individual system and the right weighting for the fusion of all four parallelsystems is needed. Linear logistic regression fusion and calibration is done with the FoCalMulticlass Toolkit [Brü07]. A single cross-fold calibration is performed for all segments of differentlengths. The final calibration and fusion weights correspond to the mean of five independentcalibrations, each using random 20% of the test set ‘VID test’. The data has been selected inorder to ensure the same speakers do not appear in train and test simultaneously. The EP testingdata contains 412 segments (99 minutes), whereas the AP test data contains 610 segments (89minutes). Details can be found in Section 3.1.2.



5.2. Single Mono-Phonetic PRLM using a Specialized PhoneticTokenizer

The proposed system follows the PRLM approach. It employs a single but highly specializedtokenizer. Since two varieties realize acoustically very similar phone sets, a tokenizer incorpo-rating the phonetic differences between AP and EP may improve recognition ability. To bettercharacterize these differences, all occurring phones in the varieties are divided into the followingtwo groups [BAB94]:

1. mono-phones: phones in a variety/language, that overlap little or not at all with those inanother variety/language (e.g. the English /r/ vs. the German /r/).

2. poly-phones: phones that are similar enough across the varieties/languages to be equated(e.g. /sh/ in English and German)

Phonotactic approaches are able to benefit from both types of phones in order to classifyspeech. With the help of statistical modeling differently occurring poly-phones carry importantinformation through the sequences they appear in. However, if mono-phones are found in speechthey could, at least in theory, instantly help to differentiate the varieties.

To incorporate this idea, the proposed phonetic classifier needs to be able to identify mono-phonesof each variety. Therefore, the classifier network is trained with data from both varieties, AP andEP, simultaneously. Each of the varieties contains a set of 38 phones plus silence. If every phoneout of these sets was considered to be a mono-phone, a phonetic tokenizer would generate 76different output tokens plus silence, corresponding to the phonetic realization in each variety.However, since most of the phones are poly-phones, this is not convenient, as it would preventthe training process of the phonetic classifier to converge. The resulting system would be poorlytrained. It is rather important to find out which phones actually have mono-phonetic properties.Only those should be given distinct outputs, instead of training a phone recognizer with 76output phones.

5.2.1. Determining Mono-Phones

Determining the set of phones, which is unique for a certain variety, given its neighboring varietyis not straightforward. Linguistic knowledge about the varieties’ phonological characteristics iscrucial, but often not available, not sufficiently detailed or controversial. Within this thesis acomputational method is used instead to find variety dependent unique phones. Binary MLPs aretrained to discriminate the same pairs of aligned phone classes from both varieties. It is worthnoting that the phones selected by this technique are no mono-phones from a strict linguistic pointof view, but rather mono-phone-like units. For the sake of simplification the term ’mono-phones’is nevertheless used in the scope of this thesis to refer to these units.

L2F’s EP ASR system [MCNT03] is used to align the training and development data of bothvarieties (‘AP train’, ‘AP dev’, ‘EP train’ and ‘EP dev’). To train a binary classifier, that allows

28

5.2. Single Mono-Phonetic PRLM using a Specialized Phonetic Tokenizer

to see if an AP representation of a certain phone differs from its EP counterpart, solely those twophones are kept in the training data. All other phones are removed from the training corpus. Thechosen two phones are given distinct output classes. For the sake of simplification, MSG featuresare chosen for training. After training, the successful separation of both classes can be verifiedusing the development data. Further, the training is limited to seven epochs after the step-sizereduction (see Section 2.2) started. As a consequence, it is a fast process, which enables the binaryclassifier to be trained for all 38 pairs of phones. In this way it may be determined, which phoneclasses are different enough to be successfully distinguished and hence contain mono-phoneticcharacteristics. The performance of all binary classifier is shown in Fig. 5.1.

Figure 5.2 shows the Detection Error Tradeoff (DET) performance of systems using the high-est scoring seven, eight and nine mono-phones. Further, the performance of two VID systemsbeing trained with the lowest eight and medium eight mono-phones are given for compari-son.

It may be concluded that a selection of eight mono-phones seems to be the best choice, probablydue to a good trade-off between network complexity, training data and classification performance.Moreover the quality of mono-phones in terms of supporting the VID perfromance, seems tocorrespond to the ranking displayed in Figure 5.1. The eight best performing phone classes arethus chosen as mono-phone units, namely /L/, /O/, /l/, /e˜/, /J/, /a/, /e/ and /Z/ using thePortuguese Speech Assessment Methods Phonetic Alphabet (SAMPA). This leads to a phoneticrecognizer with 30 poly-phones and two times eight mono-phones plus silence, thus 47 outputs,which seems reasonable considering the 14 hours of available training data.

5.2.2. Linguistic Interpretation of chosen Mono-Phones

Although it is possible to argue that, at an underlying phonological level, all Portuguese varietiesshare a common segmental inventory, some important differences may be found in the way thoseunderlying segments are realized phonetically in different contexts. Some of these contextualvariants are unique in the sense that they do not belong to the phone set common to all varietiesand, if correctly identified by language specific phone models, they may be used as importantcues for accent identification.

In the experiments described above, /L/ and /J/ are among the best candidates to mono-phones,and effectively, /L/ and /J/ are frequently not pronounced as such by most AP speakers, but asa slightly palatalized lateral or nasal consonant followed by /j/. This pronunciation is identical tothe one found when a /l/ is followed by a /i/, and that may partly explain why /l/ also appearsamong the best candidates. Note, however, that a mono-phone for this consonant probably alsoaccounts for lateral flaps. Otherwise, one could expect it would be ranked closer to other coronalconsonants which are most often apico-alveolars in all AP varieties. This feature, particularlynoticeable in /r/,/l/,/t/ and /d/ is often the only hint to the listener to identify AP speakersthat otherwise do not differ from EP ones. It is surprising that /t/ and /d/ are not in a higherranking.



20

30

40

50

60

70

80

90

100

S f @j~ k 6 t u n p i d i~ v s r u~z g m 6~ ol~ w~ R w j b o~ E Z e a J e~ lO L

corr

ect s

epar

atio

n [%

]

Figure 5.1.: Correct separation of AP and EP by each SAMPA phone using a binaryclassifier.

1

2

5

10

20

1 2 5 10 20

Mis

s p

rob

ab

ility

[%

]

False Alarm probability [%]

lowest-8top-9

middle-8top-7top-8

Figure 5.2.: DET of VID performance of mono-phonetic systems using the best seven, eightand nine mono-phones and systems with a selection of the lowest and middlescoring mono-phones for training.

30

5.3. Variety Identification Results

SignalFeature

ExtractionMono-Phonetic

Phone Recognizer

EP statistical Model

AP statistical Model

Decision

Figure 5.3.: Block diagram of the mono-phonetic PRLM system.

Concerning vowel differences, besides the fact that in AP open-mid and close-mid vowels maypresent an intermediate quality, the constraints that regulate vowel contextual realizations arealso different. Thus, for instance, while the presence of an /l/ in final syllable position blocks vowelreduction in EP, the same does not always happen in AP: e.g /a/ may be realized as /6/ or even/@/ in this context (e.g Almeida - [al˜m"6jd6] in EP, but [6l˜m"ejd6] in AP) . Similarly, whereasin EP neither the secondary stressed vowels nor the linking vowel of morphological compoundsmay be reduced, in AP those compounds are often realized as single words and these vowelsmay surface either as mid or high (e.g rodoviária - [R%OdOvj"arj6] in EP, but [ruduvj"arj6] or[rodovj"arj6] in AP, depending on the speaker’s linguistic profile).

5.2.3. Phonotactic Core

As the phonotactic baseline system in Section 5.1.1, the mono-phonetic phone-recognizer combinesthree MLP outputs trained with PLP, RASTA-PLP and MSG features. The mono-phoneticclassifier already achieves high performance using just a single phone recognizer with subsequentstatistical modeling of the phone occurrences, as shown in Fig. 5.3. To train the mono-phonetic neu-ral network, balanced AP and EP data (‘AP train’ and ‘EP train’) is taken.

5.3. Variety Identification Results

Figure 5.4 shows the DET curve of differently combined baseline systems (named using capitalletters) and two mono-phonetic approaches (named using lower case letters). Baseline results aregiven using the acoustic system (‘GMM-GSV-Baseline’), the fusion of AP and EP phonotacticsystems (‘AP+EP-Baseline’), the fusion of AP, BP and EP phonotactic systems (‘AP+BP+EP-Baseline’) and the fusion of all baseline systems (‘AP+BP+EP+GMM-SVM-Baseline’). TheDET performance of the mono-phonetic system trained with previously determined selectionof mono-phones is named ‘apep’. To allow assessing the quality of the chosen mono-phones,a system using a different selection has also been trained. The second mono-phonetic system(‘apep-tuned’) uses a selection of eight mono-phonetic vowels, namely /a/, /6/, /e˜/, /i/, /o/,/O/, /o˜/ and /u/.

The best baseline system, being the fusion of four independent systems, including phonotacticand acoustic approaches, reaches an EER of 11.4%. The mono-phonetic system ‘apep’ achieves arelative reduction of more than 60%, having an EER of 4.1%. The tuned mono-phonetic system



1

2

5

10

20

40

60

1 2 5 10 20 40 60

Mis

s p

rob

ab

ility

[%

]


GMM-GSV-BaselineAP+EP-Baseline

AP+BP+EP-BaselineAP+BP+EP+GMM-GSV-Baseline

apepapep-tuned

Figure 5.4.: DET curve identifying AP vs EP.

‘apep-tuned’ further lowers the EER to 3.7%. Moreover, as the proposed mono-phonetic aproachis a single PRLM, the processing cost is reduced drastically compared to the parallel baselinesystems.

5.4. Application to other Varieties

In order to confirm that the proposed mono-phonetic approach is relevant for variety recognitionin general, it needs to be tested with other varieties. The approach is hence extended to theBP variety. Systems recognizing EP vs. BP and AP vs. BP are trained and compared tothe performance achieved using the previously introduced baseline. Further, experiments areconducted applying the procedure to the three varieties in a single mono-phonetic classifier atonce, in order to differentiate EP vs. BP vs. AP.

The BP corpora ‘BP train’ and ‘BP dev’ presented in Chapter 3 are used to extend the mono-phonetic classifier to cover BP. First, binary classifiers are trained to determine the phones withthe strongest mono-phonetic characteristics. Then the phonetic classifier and the statistical modelsare trained and finally calibration and, if necessary, fusion are performed.

5.4.1. Identification of European Portuguese versus Brazilian Portuguese

To align the BP corpora L2F’s ported BP ASR system [ATNV09] is used, which produces moreaccurate results. It incorporates a slightly different phone set than the EP system, with /tS/, /dZ/

32


Classifier EER [%]Baseline AP 20.1Baseline BP 19.9Baseline EP 17.5Baseline GMM-GSV 16.7Baseline AP+EP 14.5Baseline AP+BP+EP 13.2Baseline AP+BP+EP+GMM-GSV 11.4mono-ph. apep 4.1mono-ph. apep-tuned 3.7

Table 5.2.: EERs of all competing VID systems identifying AP vs. EP (‘+’ denotes afusion).

20

30

40

50

60

70

80

90

100

w~ e~ L i~ i 6 t u w S k Z a p g n v f z b s d j~m E J r O l e6~ u~ R j o~

corr

ect s

epar

atio

n [%

]

Figure 5.5.: Correct separation of BP and EP by each SAMPA phone using a binaryclassifier.

and /x/ as phones that appear uniquely in BP and without /@/ and /l˜/, that are EP-specific.Thus, to keep the same recognizer layout as the AP/EP mono-phonetic system (see Section 5.2.1)with 47 outputs, additional five mono-phones are chosen through binary classification, namely/o˜/, /j/, /R/, /u˜/ and /6˜/. The result of the binary classifiers for all phones is shown inFigure 5.5.

The DET performance of relevant baseline systems and the ‘bpep’ mono-phonetic classifierare shown in Figure 5.6. Further, the EERs for all competing systems are given in Table 5.3.The mono-phonetic classifier ‘bpep’ achieves an EER performance of 5.9%. This is about 1.8%(absolute) worse than the ‘apep’ mono-phonetic classifier identifying AP vs. EP. Nevertheless,it achieves clearly better results (between 20% and 54% improvement) than any other singlephonotactic or accoustic system. The mono-phonetic classifier also outperforms the fused baselinesystem ‘BP+EP’. However, it has to be noted that the single baseline systems achieve on BPvs. EP between 35% (‘BP-Baseline’) and 63% (‘AP-Baseline’) improvement compared to their



1

2

5

10

20

40

60

1 2 5 10 20 40 60

Mis

s p

rob

ab

ility

[%

]


BP-BaselineGMM-GSV-Baseline

EP-BaselineBP+EP-Baseline

bpepAP+BP+EP-Baseline

AP+BP+EP+GMM-GSV-Baseline

Figure 5.6.: DET curve identifying BP vs EP.

Classifier EER [%]Baseline BP 12.8Baseline GMM-GSV 10.0Baseline EP 7.9Baseline AP 7.4Baseline BP+EP 6.3mono-ph. bpep 5.9Baseline AP+BP+EP 5.4Baseline AP+BP+EP+GMM-GSV 5.2

Table 5.3.: EERs of all competing VID systems identifying BP vs. EP (‘+’ denotes afusion).

counterparts on AP vs. EP. This finding confirms the fact that the latter are significantly moredifficult to differentiate. The improved performance of the single baseline systems explainsthat the fusion of more than two baseline systems outperforms the mono-phonetic approachslightly.

5.4.2. Identification of African Portuguese versus Brazilian Portuguese

The AP corpora are aligned using L2F’s EP ASR system, whereas the ported BP version is usedfor the BP corpora. Five additional mono-phones, namely /o˜/, /u˜/, /R/, /j/ and /6˜/, achievehigher separability than the others, as shown in Figure 5.7. This leads to a recognizer layoutwith 47 outputs as in the AP/EP and EP/BP mono-phonetic systems.

34


20

30

40

50

60

70

80

90

100

k u f j~ s a 6 i o e w S b p E m v n Z z i~ gw~ l d t L O Je~ r 6~ j Ru~ o~

corr

ect s

epar

atio

n [%

]

Figure 5.7.: Correct separation of AP and BP by each SAMPA phone using a binaryclassifier.

For the identification task AP vs. BP the mono-phonetic system reaches an EER of 7.6%, as canbe seen in Table 5.4 and in the DET plot in Figure 5.8. It outperforms the single baseline systems

1

2

5

10

20

40

60

1 2 5 10 20 40 60

Mis

s p

rob

ab

ility

[%

]


BP-BaselineAP-Baseline

AP+BP-BaselineGMM-GSV-Baseline

apbpAP+BP+EP+GMM-GSV-Baseline

Figure 5.8.: DET curve identifying AP vs. BP.

by up to 63% (‘BP-Baseline’) and the best fused phonotactic baseline (‘AP+BP+EP’) still by10%. However, the acoustic ‘GMM-GSV’ system achieves a good performance and lowers itsEER to 50% of the value achieved identifying AP vs EP. As the acoustic system contains highlycomplementary information, its fusion with the phonotactic baseline (‘AP+BP+EP+GMM-GSV’)outperforms the mono-phonetic approach.



Classifier EER [%]Baseline BP 20.5Baseline EP 13.8Baseline AP 10.6Baseline AP+BP 8.6Baseline AP+BP+EP 8.4Baseline GMM-GSV 8.3mono-ph. apbp 7.6Baseline AP+BP+EP+GMM-GSV 5.2

Table 5.4.: EERs of all competing VID systems identifying AP vs. BP (‘+’ denotes afusion).

5.4.3. Identification of African Portuguese versus Brazilian Portuguese versusEuropean Portuguese

A phonetic tokenizer with AP, BP and EP data is trained, in order to evaluate the performanceachieved by extending the mono-phonetic approach to distinguish three varieties from each other.Hence, ‘AP train’, ‘AP dev’, ‘BP train’ ‘BP dev’, ‘EP train’ and ‘EP dev’ corpora are usedsimultaneously. The system is set up to produce 57 different output tokens. The mono-phones arenot chosen using a triple classification comparable to the previously employed binary classifier.Determined mono-phones from previous sections are rather used. Hence, it is probable not to bethe most efficient selection.

Figure 5.9 shows the DET performance of all relevant baseline systems, of the triple mono-phonetic PRLM ’apepbp’ and of a fusion of the three previous mono-phonetic classifiers trainedwith two varieties (’apep+apbp+epbp’). The EERs of all systems are given in Table 5.5.

Classifier EER [%]Baseline BP 19.8Baseline EP 16.5Baseline AP 15.1mono-ph. bpep 14.5Baseline GMM-GSV 14.1mono-ph. apbp 12.2Baseline AP+BP+EP 10.9mono-ph. apep-tuned 9.8mono-ph. apep 9.1Baseline AP+BP+EP+GMM-GSV 8.4mono-ph. apbpep 8.4mono-ph. apep+apbp+bpep 5.7

Table 5.5.: EERs of all competing VID systems identifying AP vs. BP vs. EP (‘+’ denotesa fusion).

36

5.5. Discussion and Future Work

1

2

5

10

20

40

60

1 2 5 10 20 40 60

Mis

s p

rob

ab

ility

[%

]


AP-BaselineGMM-GSV-Baseline

AP+BP+EP-BaselineAP+BP+EP+GMM-GSV-Baseline

apbpepapep+apbp+bpep

Figure 5.9.: DET curve identifying AP vs. BP vs. EP.

Table 5.5 shows that the fusion of all three mono-phonetic classifiers ’apep+apbp+epbp’ out-performs all other approaches. However, the single mono-phonetic ’apepbp’ classifier achievesbetween 40% (‘GMM-GSV’) and 58% (‘BP-Baseline’) better performance than any other singleclassifier and is hence the best single classifier. It outperforms the fusion of all phonotacticbaseline systems (‘AP+BP+EP’) and it reaches the same EER as the fusion of four phonotacticand acoustic baseline systems (‘AP+BP+EP+GMM-GSV’). Two of the mono-phonetic classifierstrained with two varieties (’apbp’ and ’epbp’) do not perform significantly better than thebaseline, which is due to a high error rate of the corresponding third variety, they have not beentrained with. The reason for the ‘apep’ mono-phonetic classifier to perform much better thanthese two, is probably its strong performance identifying AP and EP which are easier confusablethan the other varieties. As all DET curves in Figure 5.9 are close to be straight lines, theirunderlying likelihood distributions are normal.


In this chapter, an approach for the identification of AP and EP exploiting variety-dependentmono-phones has been presented. The EER achieved by a single mono-phonetic classifier could bereduced by more than 60%, from 11.4% to 4.1%, compared to the best performing VID baselinebeing a fusion of four independent phonotactic and acoustic systems. The approach proves to bevery efficient, employing just a single mono-phonetic tokenizer instead of four parallel systemsused before.

It has also been shown that the proposed approach produces good results with other combinationsof varieties, reaching an EER of 5.9% on EP vs. BP and 7.6% identifying AP vs. BP. However,



these varieties proved to be more easily distinguishable, and hence the gain compared to thebaseline systems is smaller. Even the detection of all three varieties justifies the use of a singlemono-phonetic classifier, which reaches an EER of 8.4%. However, to achieve best results withthree varieties, mono-phonetic systems should be trained for two varieties and then employed inparallel.

As out of the three varieties AP and EP are most similar and hence most difficult to distinguish,it can be concluded that mono-phonetic approaches produce particularly good results for veryclose varieties, where common systems fail. Moreover, with more distinct varieties there is not asmuch relative improvement possible, as the baseline already classifies well. Nevertheless, futureresearch with more varieties is needed to confirm this finding.

Future work also includes experiments with more data for training, calibration and test of thestatistical models. Moreover a better understanding is needed of how to choose the mono-phones,as further improvement can be achieved with different combinations and different number ofchosen phones.

Finally, applying the mono-phonetic approach to varieties without available transcribed data,could be an interesting future investigation. This could be tried using automatic transcriptionsfrom an existing speech recognition system, analog to the “root phonetic recognizer approach”presented in [MTG+06].

Some of the results presented in this chapter were accepted for publication in the peer-reviewed conference ‘IEEE Odyssey 2010: The Speaker and Language Recognition Workshop’[KAT10].

In the next chapter, the presented VID approach is used for automatic selection of AP data. Ascomputational load is not an important factor, the best performing single VID systems will beemployed in parallel to ensure highest performance. The mono-phonetic classifier ‘apep’, ‘apep-tuned’ and ‘apbpep’ are the three best performing single systems. Joining a fourth phonotacticsystem does not contribute significantly to lower the identification error. Thus, the acousticsystem ‘GMM-GSV’ containing complementary information has been additionally chosen. Theworking point of all four fused systems is adjusted to achieve overall 1.7% False Alarm (FA) forEP, 2.8% FA for BP and 10% miss probability of AP segments. Higher miss probability andlower FA probability are advantageous as AP segments should rather be left out than to includeother varieties’ data.

38

6. Improvement of Acoustic and LanguageModels

In order to improve reported baseline recognition results on AP speech in Section 4.3, a variety-specific ASR system is built, comprising AP-specific acoustic modeling and language modeling.Advances are based on manually and automatically collected, labeled and transcribed variety-specific data.

The first part of this chapter is dedicated to lowering the recognition WER through acousticmodeling, including the adaptation of existing EP models and retraining of acoustic modelsusing the VID system introduced in Chapter 5 for unsupervised training. In the last part of thischapter, variety-specific language models are evaluated.

6.1. African Portuguese Acoustic Models

Two distinct approaches are followed to train AP-specific acoustic models in the followingsections:

• Adaptation of the baseline acoustic models.

• Complete retraining of new variety-specific acoustic models.

During all experiments on acoustic modeling, the baseline system’s language model and 100kword vocabulary are kept, as described in Section 2.3.

6.1.1. Adapted Acoustic Models with EP Initialization

The baseline acoustic models, being trained for EP, achieve reasonable recognition results on the‘AP test’ corpus (see Section 4.3). The accent classification reveals that the EP system showssignificantly less degradation on speech classified as ‘slight accent’ (compare Table 4.3). StrongAP accents, often correlated with spontaneous speech in less controlled recording conditions,cause more severe problems. AP-specific acoustic models could, thus, exploit the advantages ofthe EP models that have been developed for several years and that incorporate large amounts ofmanually and automatically transcribed data (see Section 2.2 for details). The approach followedinvolves adapting the previously trained EP acoustic models with the limited, but preciselytranscribed data from the ‘AP train’ corpus in a few training epochs. The ‘AP dev’ corpus isused for cross-validation.

39

6. Improvement of Acoustic and Language Models

First Iteration: Alignment with Baseline

For the first alignment and thus the first adaptation, neural nets are used with the same layoutas the original EP models, having two fully connected non-linear hidden layers with 2000 sigmoidunits each and one output layer with 39 softmax outputs, as described in Section 2.2. The weightsof these ANNs are initialized with the EP models’ values. The number of training epochs is limitedto five, in order not to ‘unlearn’ previous knowledge of the nets with the limited new trainingdata, and a common learning rate is used. A cross-validation is performed after each successfulepoch to start the learning rate reduction or to stop the training in case of a slight or strongdegradation, respectively. The baseline ASR system is used to align manual transcriptions andaudio recordings frame by frame. Table 6.1 shows details on the training parameters, includingthe total epochs of training the neural nets for MSG, PLP and RASTA-PLP features, the epochnumbers, when reduction of the learning rate started (see Section 2.2 for details), the appliedlearning rate before reduction, the layout of the ANN, the ratio of training frames per weightgiven the hidden layer size and the duration of the training data. Further, details on the trainingdata and the decoder settings used for alignment of the data are given.

MSG PLP RST NeuralN

et

total epochs 5 3 4start η reduction 3 3 3learning rate η 0.002

hidden layer size 2000frames/weight 0.50

Data

training data ‘AP train’aligned minutes 416

development data ‘AP dev’alignment EP Baseline

Decoder

maximum states 3500beamsize 6

Table 6.1.: Characteristics of adaptation, 1st iteration.

acoustic condition any cleanaccent slight strong ∅ slight strong ∅

WER on ‘AP test’ 20.8 27.9 24.6 16.6 16.2 16.5gain relative to baseline 18.3 18.1 18.3 18.8 29.6 23.5

Table 6.2.: WER performance and relative error rate reduction of adapted AP-specificacoustic models (1st iteration) in [%].

The adaptation produces considerably improved results, reducing the baseline WER on averageby around 18%, reaching on clean conditions and strong accents nearly 30% improvement (see

40


Table 6.2 for details). The MAPSSWE test finds significant differences to the baseline results(p<0.001).

Decoder Tuning to Enhance Alignment

Further performance gain has been observed when tuning the decoder during recognition, whichmay be equally applicable for alignment. As described detailed in Section 2.5, a dynamiccomposition of the WFSTs is used to perform decoding in the AUDIMUS speech recognizer.The dynamic algorithm limits the amount of active states and a beam search strategy that onlykeeps the best partial results limits its processing needs by a predefined beam size. Increasing themaximum active states and the beam size improves recognition results under the cost of higherprocessing load.

-2

-1

0

1

2

3

4

5

6

3500-bs6 3500-bs15 3500-bs40 100000-bs15 100000-bs40

Re

lative

WE

R im

pro

ve

me

nt

[%]

all conditions, slight accent

-2

-1

0

1

2

3

4

5

6

3500-bs6 3500-bs15 3500-bs40 100000-bs15 100000-bs40

Re

lative

WE

R im

pro

ve

me

nt

[%]

all conditions, strong accent

-2

-1

0

1

2

3

4

5

6

3500-bs6 3500-bs15 3500-bs40 100000-bs15 100000-bs40

Re

lative

WE

R im

pro

ve

me

nt

[%]

clean conditions, slight accent

-2

-1

0

1

2

3

4

5

6

3500-bs6 3500-bs15 3500-bs40 100000-bs15 100000-bs40

Re

lative

WE

R im

pro

ve

me

nt

[%]

clean conditions, strong accent

Figure 6.1.: Relative WER in [%] of systems with different decoder settings compared to theadapted AP-specific acoustic models (1st iteration) with a standard of maximum3500 active states and the beam size parameter set to 6.

Figure 6.1 displays the gain in recognition performance with an increased number of active statesand beam size. The WER is shown relative to recognition results of the adapted AP-specificacoustic models (1st iteration) with a standard of 3500 maximum active states and a beam size of6. It can be seen that the decoder tuned to 100k maximum states and to a beam size of 40 improvesthe recognition on average by 3% to 4%. After all, processing time also multiplies by factor 15,which is not bearable in common recognition applications. In this case, however, the tuning is



applied to improve the alignment, which is performed only once and, hence, computational loadis not significant.

A secondary effect of tuning the decoder occurs during alignment. As the search space greatlyincreases, the decoder is now able to align a higher percentage of the transcriptions. 456 comparedto 416 minutes were aligned in the ‘AP test’ corpus counting a total of 456.9 minutes with tunedand previous decoder settings, respectively. Nevertheless, this may be a drawback as wrongtranscriptions and inconsistencies between pronunciation lexicon and actual acoustic realizationalso get aligned and, thus, produce errors. In the case of accurate manual transcriptions this mightbe neglectable. However, this process does not seem apropriate for automatic transcriptions, asmany errors may get inserted.

Second Iteration: Alignment with Previous AP system

Due to a high gain in recognition accuracy of around 18% with the first alignment of the adaptedAP-specific acoustic models, a second alignment is pursued. The training starts with the initialweights from the EP baseline models, just as the first alignment. However, the data used fortraining and cross-validation is processed differently. For the first adaptation the baseline wasused to align the data frame by frame with the manual transcriptions. Employing the adaptedmodels for alignment may improve its accuracy, as pursued in this step. Further, previouslyintroduced decoder tuning is used to enhance the alignment with 100k active states and a beamsize of 40.

MSG PLP RST NeuralN

et



Data

training data ‘AP train’aligned minutes 456

development data ‘AP dev’alignment AP Adapted 1st

Decoder


Table 6.3.: Characteristics of adaptation, 2nd iteration.

Compare Table 6.3 for details on training settings. Results achieved with the second alignment areshown in Table 6.4. There is a slight overall improvement of 1.3% compared to the first iteration,being 19.4% compared to the EP baseline. However, on clean conditions the results degradedby 1.4%. The MAPSSWE test reveals statistical significance compared to the baseline results

42




Table 6.4.: WER performance and relative error rate reduction of adapted AP-specificacoustic models (2nd iteration) in [%].

(p<0.001), however the improvement relative to the first adapted iteration is not statisticallysignificant (p=0.674). As recognition did not improve significantly, further iterations will not beadvantageous.

6.1.2. Completely Retrained Acoustic Models with Random Initialization

Results achieved with an adaptation of the EP acoustic models (see Section 6.1.1) are powerful.However, improvement stagnates after the second alignment. Completely retrained acousticmodels are the intuitive approach in order to obtain AP-specific acoustic models. This methodmay overcome stagnation experienced with the adaptation. However, with around seven and ahalf hours of available transcribed data in the ‘AP train’ corpus and about one and a half hoursin the ‘AP extra’ corpus, training data is very scarce, particularly compared to over 378 hoursthe EP baseline models have been trained with [MVN08].

The layout of the neural nets is chosen in relation to the available training data. The ratio ofavailable training samples per network weight determines the size of the hidden layers. Ideally,there are more than ten training samples per weight [Mei08], in this case of very limiteddata, reducing this ratio by increasing network complexity showed to be advantageous up toa certain limit. Figure 6.2 shows the influence of sample/weight ratios from 1.5 ranging to 8.9compared to a system trained on 15.5 samples/weight. The best performing systems base ona ratio of 2.5 for strong AP accents and 3.6 in the case of slight AP accents. A lower ratioconstitutes a higher network complexity, as more neurons, and consequently more weights,are trained with the same amount of samples. Speech classified as ‘slight accent’ shows lessvariability in terms of accent and acoustic conditions compared to speech labeled as ‘strongaccent’ (compare Section 4.2). This probably accounts for the fact of achieving better resultswith the higher ratio (3.6 samples/weight) and, thus, lower network complexity, for slight APaccents, compared to a lower ratio (2.5 samples/weight) and higher network complexity for strongaccents.

First Iteration: Alignment with Baseline

A ratio of training samples to weights of about 3.0 is expected to be advantageous for both slightand strong accents. This leads to 770 units for both hidden layers in the neural net layout using‘AP train’ and ‘AP extra’ corpora for training. The common learning rate of 0.002 is used and a



0

5

10

15

Ratio 1.5 Ratio 2.5 Ratio 3.6 Ratio 8.9

Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


Figure 6.2.: Influence of sample/weight ratio on the relative WER improvement comparedto a system using 15.5 samples/weight.

step-size adaptation of 0.5 is applied after the first degradation of the system. A maximum of50 epochs are set, the algorithm terminates however much earlier. The neural net is initializedrandomly. Compare Table 6.5 for details on algorithm characteristics.

Table 6.6 resumes results achieved with the completely retrained AP-specific acoustic modelswith random initialization. Overall, the complete retraining leads to more than three percentdegradation compared to the baseline. Certainly, the results indicate a great lack of data. However,on strong accents the system performs better than on slight accents. This is probably due tomore similarity between EP and slight AP accents and consequently a better performing baseline(that has been trained with EP) on these accents, rather than on strong accents. Noise andspontaneous speech represent more diverse acoustic situations, that also arise in EP, but thatare not sufficiently covered by the few hours of available data for training the AP-specific models.Nevertheless, on clean conditions and strong accents the AP-specific models even achieve a slightgain compared to the baseline.

The MAPSSWE test reveals statistical significance compared to the baseline results(p=0.032).

Achieved results suggest the use of more data to overcome restrictions imposed by lackingdiversity of limited training data. Particularly the gain by nearly 2% on strong accents in cleanacoustic conditions empowers this need.

44


MSG PLP RST NeuralN

et



Datatraining data ‘AP train’+‘AP extra’

aligned minutes 416+88development data ‘AP dev’

alignment EP Baseline

Decoder


Table 6.5.: Characteristics of retrained AP models randomly initialized (1st iteration).


WER on ‘AP test’ 26.8 34.5 31.1 21.9 22.5 22.2gain relative to baseline -5.4 -1.5 -3.3 -7.1 1.9 -3.2

Table 6.6.: WER performance and relative error rate reduction of retrained AP-specificacoustic models with random initialization (1st iteration) in [%].

6.1.3. Unsupervised Training for Acoustic Models with Random Initialization

In the previous section, the need for more training data arises. However, the process of creatingmanually transcribed data is costly and time consuming. To overcome the shortness of transcribedtraining data, an approach of unsupervised training using the ‘AP automatic’ corpus is suggested.The process of unsupervised training involves following steps:

1. Selection of AP speech within the mixed EP/AP corpus.

2. Training of confidence measures.

3. Recognition of AP parts from automatic corpus.

4. Using a confidence threshold to exclude wrongly recognized words for retraining the acousticmodel.

The VID system developed in the scope of this thesis and presented in Section 5 is used for theselection of AP speech. Three mono-phonetic and one acoustic VID system are used in parallelto ensure highest performance. The system achieves 1.7% FA for EP, 2.8% FA for BP and 10%miss probability of AP segments. The system determines 31 out of a total of 67 hours to beAfrican accented speech.

Confidence measures based on a maximum entropy classifier are used to estimate the accuracyof each recognized word and to reject those with a confidence below a certain threshold (see



Section 2.6 for details). Figure 6.3 shows the DET curve for the trained classifier. The percentageof words wrongly classified as correct is displayed on the x-axis (FA probability), whereas they-axis shows the percentage of correct words undesirably excluded (miss probability). Ideally, FAprobability and miss probability are as low as possible. Figure 6.4 shows the relation between achosen confidence threshold and the FA probability.

0.1 0.2

0.5

1

2

5

10

20

40

60

80

90

95

0.1 0.2 0.5 1 2 5 10 20 40 60 80 90 95

Mis

s pr

obab

ility

[%]


Figure 6.3.: DET-Curve for confidence measures.

0.7

0.75

0.8

0.85

0.9

0.95

1

0.1 1 10 100

Con

fiden

ce V

alue

Tre

shol

d

False Alarm Probability [%]

Figure 6.4.: Relation between confidence threshold and false alarm probability.

The influence of different confidence thresholds (and hence different percentages of FA and missprobability) on unsupervised training has been studied. The relative WER improvements ofsystems whose acoustic models have been trained using different choices of confidence thresholds

46


relative to a completely retrained system without any unsupervised data are given in Figure 6.5.It can be seen that slight accents perform better with 15% to 50% FA probability (corresponding

0

5

10

15

FA1 FA9 FA15 FA25 FA50 FA100

Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


Figure 6.5.: Threshold influence on relative WER improvement compared to a systemtrained without any unsupervised data.

to between 50% and 18% miss probability, referring to Figure 6.3). In the case of strongaccents in all acoustic conditions the unsupervised system achieves more accurate recognitionresults on strong accents in noisy acoustic conditions, when FA probability is equal to 25%and higher and miss probabiltity is arround 30% and lower. Strong accents in clean conditionsare also able to benefit from a confidence threshold producing these FA and miss probabilities.However, the highest gain in clean conditions is achieved with 9% FA and hence nearly 60% missprobability.

Reasons for the tendency to perform better with less restrictive thresholds go along withobservations made by Kemp and Waibel (refer to Section 1.4.2), who state that a high andrestrictive confidence threshold, and thus a low FA probability, only keeps words for trainingthat the system already recognizes robustly. The baseline, used to recognize the ‘AP automatic’corpus, achieves a much more robust performance on slight accents in any acoustic conditionand on strong accents in clean conditions (see Table 4.3). Hence, a more restrictive confidencethreshold (9% to 15% FA probability) may primarly add information enhancing these betterperforming conditions. It is worth noticing that no compensation of different amounts of trainingdata and thus different sample/weight ratios has been done. Effects described in Section 6.1.2



may also influence the result shown in Figure 6.5, as a higher FA probability provides more datafor training and hence higher sample/weight ratios.

Impact of Variety Identification on Unsupervised Training

Based on previous observations, the impact of the VID has been evaluated using thresholdsproducing 9% and 25% FA probability. Figure 6.6 shows the impact of VID by displaying therelative gain of WER of systems using the 31 hours VID selected corpus for unsupervised trainingcompared to their counterparts trained using all unselected 67 hours of the automatic corpus.The sequence of processing follows previously mentioned steps. Thus, after recognition of the APor mixed (in case of not using VID) parts of the corpus the corresponding confidence thresholdis applied.

The impact of VID on slight accents is stronger in case of 25% FA probability, reaching 4.4%relative WER reduction. However, on strong accents the system trained with a threshold producing9% FA probability shows a stronger impact. The WERs, mentioned on top of each bar in Figure6.6, reveal that the recognition performance of both systems employing VID does not differ asmuch as their impact compared to corresponding systems without variety selection. Particularlyon strong accents, both VID systems achieve close results (all conditions: 32.2% (VID FA9)vs. 31.8% (VID FA25), clean conditions: 20.1% (VID FA9) vs. 20.0% (VID FA25)) and thedegradation in relative WER reduction arises due to a significantly better performance of thesystem without VID with 25% FA probability (WER 32.4%) compared to its 9% counterpart(WER 33.3%). It can be concluded that the impact of VID differs with regard to the appliedconfidence threshold. This seems to be caused by unsteady performance of systems withoutapplied VID with respect to slight and strong AP accents. However, the VID system with aconfidence threshold producing 25% FA probability outperforms all other configurations, whichmay also be seen in Table 6.7 showing the averaged WERs of VID and noVID systems for allaccents.

WER [%] FA9 FA25all clean all clean

VID 29.2 20.7 28.1 19.7noVID 29.7 20.9 29.1 20.1

relative gain 1.9 1.1 3.2 1.9

Table 6.7.: Absolute WER performance of VID/noVID FA9 and VID/noVID FA25 on allaccents and relative gain of VID systems.

The reason for variable performance of systems without VID can be found analyzing the changesduring all epochs of unsupervised training, given in Figure 6.7. The figure shows the recognitionperformance relative to the baseline for each epoch during training. It can be seen that the finalresult of the system employing VID is not only better, but the training process also proves to bemore stable than its counterpart using no VID. The back propagation algorithm only converges

48


0

5

10

15

VID/noVID FA9 VID/noVID FA25

Re

lative

WE

R im

pro

ve

me

nt

[%]


absolute WER25.6% / 25.4%


0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


absolute WER32.2% / 33.3% absolute WER

31.8% / 32.4%

0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]




0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]




Figure 6.6.: WER performance of VID systems relative to the corresponding noVID systems.

-2

0

2

4

6

8

10

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27final

Re

lative

WE

R im

pro

ve

me

nt

[%]

Training Epoch

VID FA25

all accentsslight accents

strong accents

-2

0

2

4

6

8

10

10 11 12 13 14 15 16 17 18 20 final

Re

lative

WE

R im

pro

ve

me

nt

[%]

Training Epoch

noVID FA25

all accentsslight accents

strong accents

Figure 6.7.: WER performance relative to the baseline of VID and noVID systems duringthe last epochs of training.



as foreseen when the data has been limited using VID. The VID system undergoes a degradationfrom the best point during training (epoch 23) to the final epoch of less than 0.32%. Contrarily,this degradation is up to 2.82% without VID. It can hence be concluded that the VID moduleprovides more homogeneous data that contains less unwanted speech.

Second Iteration: Alignment with Previous AP System

Based on previous experiences, new acoustic models are trained using the additional recordingsin the ‘AP automatic’ corpus for unsupervised training. The VID system as presented in Section5 is used to select solely AP accented speech. Confidence measures are used to estimate theaccuracy of each recognized word and to reject those with a confidence below a certain threshold.The threshold applied produces 25% of wrong choices and has around 35% miss probability. Incorrespondence to Ma and Schwartz [MS08] presented in Section 1.4.2 the training strategy ischosen using all data for automatic transcription at once. The alignment of audio signal andtranscription is done with the baseline EP recognizer. Around 25.6 hours of aligned data isavailable for training. A neural net layout with 1500 units in each hidden layer is chosen, as itproduces a sample/weight ratio of about 3. A detailed view on all training parameters is given inTable 6.8.

MSG PLP RST NeuralN

et



Data

training data ‘AP train’+‘AP extra’+‘AP automatic’aligned minutes 416+88+1032

development data ‘AP dev’alignment EP Baseline

Decoder


Table 6.8.: Characteristics of completely retrained AP models with random initialization(2nd iteration).

Recognition performance of these AP-specific acoustic models trained with significantly moredata are presented in Table 6.9. An overall improvement of 4.4% relative to the baseline could beachieved, which equals more than 7.4% compared to the first alignment without unsupervised train-ing. Strong accents in clean acoustic conditions perform 14.2% better than the baseline and still12.6% better than with the previous alignment of acoustic models. Nevertheless, the adapted acous-tic models from Section 6.1.1 still reach around 15% better performance.

50

6.2. African Portuguese Language Model


WER on ‘AP test’ 24.9 32.1 28.8 20.6 19.7 20.2gain relative to baseline 2.2 5.6 4.4 -0.6 14.2 5.8

Table 6.9.: WER performance and relative error rate reduction of completely retrainedAP-specific acoustic models with random initialization (2nd iteration) in [%].

The MAPSSWE test reveals no statistical significance compared to the baseline results (p=0.091),however the improvement relative to the first completely retrained iteration is statisticallysignificant (p<0.001).

Third Iteration: Alignment with Adapted AP System

Due to the fact that the last iteration did not significantly improve the recognition, but the adaptedAP-specific acoustic models (compare Section 6.1.1) achieve significantly better WERs on the test-ing corpus, their impact on performance should be analyzed when used to align the data availablefor completely retrained acoustic models with random initialization.

The training settings match those used during the second alignment, see Table 6.10 for details.The AP-specific acoustic models adapted from the EP models produce a small gain in alignedframes (0.9%) and align around 3.4% frames differently. The hidden layer size is kept at 1500units. Table 6.11 shows the performance on the testing corpus, which could be raised on averageby an absolute 2.1%, from 4.4% to 6.5% compared to the baseline and by an absolute 3.7% onclean acoustic conditions from 5.8% to 9.5%. Nevertheless, the adapted models still reach around13% better performance.

The MAPSSWE test reveals statistical significance compared to the baseline results (p=0.008).Also the improvement compared to the first iteration of the retrained acoustic models withrandom initialization is statistically significant (p<0.001). However performance gain relative tothe second completely retrained iteration is not statistically significant (p=0.144). Thus, furtheriterations do not seem advantageous.


Within this section the impact of an AP-specific language model is evaluated. The objectiveis, due to time constraints, to create a simple, static language model based on the writtenAP corpora that have been collected for this purpose. In the following sections, the process ofgenerating such a model, including data collection, extraction, normalization, building of n-gramsand interpolation of different models is described. Finally, results with the AP-specific languagemodel are compared to the baseline EP language model (see section 2.3 for details) and resultsof an interpolation of both are given.



MSG PLP RST NeuralN

et



Data

training data ‘AP train’+‘AP extra’+‘AP automatic’aligned minutes 420+88+1081

development data ‘AP dev’alignment AP adapt 1st iteration

Decoder


Table 6.10.: Characteristics of completely retrained AP acoustic models with random ini-tialization (3rd iteration).



Table 6.11.: WER performance and relative error rate reduction of completely retrainedAP-specific acoustic models with random initialization (3rd iteration) in [%].

6.2.1. Data Collection, Extraction and Normalization

As a part of this thesis, a small text corpus has been collected, described in Section 3.2. Thedata for this corpus comes from eleven different newspapers with available on-line archives ofold articles. The multiplicity of sources required additional work developing journal-specificscripts to download the archived articles and extract the relevant data from each saved website.Nevertheless, this additional work is necessary in order to gather more data from rare countrieslike Guinea-Bissau and São Tomé and Príncipe, whose corpora continue to count just about onethird of the data collected from Angolan newspapers.

The location of relevant data on each extracted website consists of a simple search based onindividually defined starting and ending patterns characteristic for each on-line source. Thismethod, easy to implement, nevertheless, introduces a small amount of errors evoked by partlyinconsistent layouts regarding the whole period of filing. Most of the introduced errors aremanually fixed, some, however, may not have been accounted for.

The following normalization performs an expansion of abbreviations, conversion of numbers, unitsand currency signs to words and deletes unnecessary punctuation marks.

52


6.2.2. Training of an AP-specific Language Model

The baseline vocabulary contains 100k words and produces an OOV rate of 0.96% on the ‘AP test’corpus. Due to the limited scope of this thesis and regarding the low OOV rate, the selection ofa different vocabulary does not seem to be advantageous. As a consequence, the pronounciationlexicon can also be kept the same. Nevertheless, the impact of an artificially lowered OOV rateto 0.00% is analyzed.

The targeted AP-specific language model generalizes differences between all five PALOP countriesAngola, Cape Verde, Guinea Bissau, Mozambique and São Tomé and Príncipe. Nevertheless,equal a-priori occurrence probabilities are assumed. Thus, it is a prerequisite to use balancedcorpora sizes from each country for training. As equal balancing could not be assured, modelingeach variety on its own and interpolating resulting models to a general AP-specific model seemsmost promising.

With the SRILM toolpack [Sto02] 4-grams of each countries’ written corpora of 1.6M wordsand the manual transcriptions from ‘AP train’ and ‘AP extra’ corpora with 86.2k words areindependently built and smoothened using modified Kneser-Ney discounting [CG98], as thisgenerally proved to produce good results. Based on the perplexities measured on the ‘AP dev’transcriptions, optimal interpolation weights are computed iteratively and a joint 3-gram modelis interpolated.

In a last step, an interpolation of the AP-specific 3-gram language model with the EP base-line 4-gram language model is created, based on the perplexities measured on the ‘AP dev’transcriptions.

6.2.3. Results

The perplexities measured on the ‘AP test’ corpus for the EP baseline Language Model (LM)(‘EP baseline LM’), the AP LM (‘AP LM’) and the interpolation of both (‘AP+EP LM’) aregiven in Table 6.12. The AP LM has a significantly higher perplexity than the EP LM. This isdue to highly different corpora sizes available to train the LMs (1.6M compared to 604M, refer toSection 2.3 for details on the baseline LM). Nevertheless the interpolation of both LMs achievesa reduction of the baseline perplexity.

PerplexityEP baseline LM 165.4

AP LM 244.0EP+AP LM 150.6

Table 6.12.: Perplexities of different language models on the ‘AP test’ corpus (‘+’ denotesan interpolation).



-25

-20

-15

-10

-5

0

5

10

15

AP LM AP+EP LM AP+EP-noOOV LM

Re

lative

WE

R im

pro

ve

me

nt

[%]


-25

-20

-15

-10

-5

0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


-25

-20

-15

-10

-5

0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


-25

-20

-15

-10

-5

0

5

10

15


Re

lative

WE

R im

pro

ve

me

nt

[%]


Figure 6.8.: Relative WER in [%] of AP, interpolated AP+EP and interpolated languagemodel with artificially lowered 0% OOV rate compared to baseline EP languagemodel.

Figure 6.8 displays the WER performance relative to the baseline LM achieved by the ‘AP LM’,the interpolated ‘AP+EP LM’ and the interpolated LM with an updated vocabulary includingthe OOV words (‘AP+EP-noOOV’) and thus having an OOV rate of 0.0%. It can be seen thatthe ‘AP LM’ degrades the baseline recognition performance on average by 11.5%. It is worthnoticing that degradation is stronger for clean acoustic conditions, up to almost 25%. This is apositive effect, as it shows that the ‘AP LM’ is able to cope better with strong accents in noisyconditions, including spontaneous speech and grammatically incorrect sentences (see Section 4.2for details on speech classification). The interpolated LM improves recognition overall by 2.5%.However, on strong accents in clean conditions, the model shows a slight degradation comparedto the baseline.

The OOV rate proves to have an overall impact of an absolute 1% WER degradation. Thiscorresponds to other findings claiming the gain in WER is roughly 1.2 times the reduction inOOV rate [Gau95]. Thus, a dynamical implementation of an AP language model may, at most,improve recognition by this value. However, dynamic LM generation may enable to continuouslykeep OOV rates of different testing corpora as low as on the ‘AP test’ corpus within thisthesis.

Final recognition performance in terms of WER with the AP-specific adapted acoustic models(2nd alignment, see Section 6.1.1) and the completely retrained acoustic models with random

54


initialization (3rd alignment, see Section 6.1.3) are given in Table 6.13. An overall WER of


AP adapted 2nd 20.2 26.8 23.7 16.8 16.4 16.6gain relative to baseline 20.5 21.3 21.1 18.0 28.8 22.6

AP retrained 3rd 24.0 31.2 27.9 19.0 19.5 19.2gain relative to baseline 5.7 8.1 7.2 7.1 15.1 10.5

Table 6.13.: WER performance in [%] of AP-specific acoustic models combined with the‘AP+EP LM’ interpolated language model.

23.7% is achieved with the adapted acoustic models, being an improvement of 21.1% relativeto the baseline and around 2.2% relative performance gain due to the AP-specific LM. Thecompletely retrained acoustic models with random initialization could only advance on average byrelative 0.7% with the new LM, on strong accents in clean conditions the new LM even degradedperformance by 3.7%.

The MAPSSWE test reveals that the AP LM achieves statistically significant improvementcompared to the baseline ASR system (p=0.021, WER result shown in Figure 6.8). However, theperformance gain of the adapted system is not statistically significant (p=0.057), even thoughthe significance level of α = 5% is close. The retrained acoustic models with random initializationalso do not achieve a significant improvement (p=0.904).


Within this chapter it has been shown that adapting the baseline EP acoustic models witharound seven and a half hours of manually transcribed data improves the overall recognitionperformance on AP speech by 19.4%. Limited to clean acoustic conditions and strong AP accents,improvements even reach up to 29.6%. Similar results may also be achieved with less training data,if phonetic and acoustic diversity of supplied data is garanteed. In contrast, additional data provedto have no further significant effect when adapting the acoustic models.

However, in order to completely retrain acoustic models with random initialization, the manuallytranscribed data proved to be too limited. Unsupervised training with additional 31 hours ofautomatically selected AP speech using a VID system has shown to improve performance onaverage by 6.5% compared to the baseline and reached 18.1% improvement on clean strongaccents. The fact that the adapted models outperform these results by nearly 13% may beattributed to the large amount of training hours (378 hours) the EP baseline has been trainedwith. Besides collecting more data for unsupervised training, promising strategies to improvedata efficiency during automatic transcription have been presented by Kemp and Waibel [KW99](see Section 1.4.2). An algorithm comparing the confidence measures of different recognitionsystems may join performances and permit more accurate alignment.



The fact that AP and EP are very similar in BN may account for the moderate impact of VIDduring unsupervised training, as shown in Section 6.1.3. Tuning the VID system to uniquelyselect the strongest accents for unsupervised training, which degrade the baseline performancethe most, needs to be investigated.

An AP-specific LM was further able to improve recognition rates by between 0.7% to 2.2% (notin all cases statistically significant, though) compared to using the baseline LM, yielding anaverage WER performance gain relative to the baseline of 21.1% for the adapted models and7.2% for the acoustic models with random initialization.

On slight AP accents the best performing AP-specific recognition system achieves a performance(WER=20.2%) that seems at least as competive as the EP results reported in [MVN08]. StrongAP accents (WER=26.8%) achieve recognition results that are about 20% worse than thosereported for EP.

56

7. Conclusion and Future Work

The goal of this thesis was to improve recognition accuracy of automatic speech recognitionsystems transcribing African Portuguese broadcast news. A reliable variety identification systemwas needed to perform an automatic selection of African Portuguese speech from untranscribeddata containing speakers with various linguistic profiles. The development of specialized, variety-specific acoustic and language models was hypothesized to yield recognition performance gain onAfrican Portuguese broadcast news.

Concerning corpora, the main contributions were the following: Existing transcriptions weremanually corrected and thus significantly improved, being helpful for future research that usesthe African Portuguese corpora. Further, the African Portuguese testing corpus was manuallyclassified to describe the accent strength of all speakers. Besides the manual correction and theclassification of the African Portuguese corpora, this work also resulted in the collection of asignificant amount of written African Portuguese corpora, that can be used for further linguisticstudies of Portuguese.

One of the main achievements of this thesis was the proposal of a novel and efficient approach toautomatically distinguish African Portuguese and European Portuguese, lowering the compu-tational cost significantly and reducing the equal error rate by more than 60% compared to astrong state-of-the-art variety identification baseline [KAT10].

Within this thesis, the recognition accuracy of African Portuguese was significantly improvedby 21.1% compared to the European Portuguese baseline, from 30.1% Word Error Rate (WER)to 23.7% WER using AP-specific acoustic and language models. On slight African Portugueseaccents, a WER of 20.2% is as competive as European Portuguese results reported in [MVN08].Strong AP accents achieve a WER of 26.8%. However, the correlation of noise and sponta-neous speech with strong African Portuguese accents in the testing corpus may account partlyfor this difference. These results have also been described in the paper “Exploiting variety-dependent Phones in Portuguese Variety Identification applied to Broadcast News Transcription”[KATV10].

Future work in terms of automatic speech recognition for African Portuguese may involvethe use of more data for automatic training, acoustic models specifically trained for strongAfrican Portuguese accents and dynamically generated vocabulary and language models. More-over, the impact of an AP-specific pronunciation lexicon on recognition accuracy needs to beevaluated.

57

7. Conclusion and Future Work

Many speakers of African Portuguese broadcast news do not speak Portuguese as their firstlanguage. However, their native language is not generally known. Corpora collected in morecontrolled conditions, with knowledge of each speaker’s language background, would offer thepossibility to relate the most distinguishable mono-phones to the characteristics of the nativelanguages.

The extension of the variety identification module to other languages, particularly very close andmutually intelligible varieties, such as those tested in the NIST LRE, would be very interesting.Finally, an application to varieties without available transcribed data could be a useful futureinvestigation. It would allow a comparison of the most distinguishable mono-phones acrossvarious languages. By making the approach more generally applicable, it may prove to have greatrelevance to the whole language recognition research community.

58

A. Corpora Sources for Speech Recognition

Program Date Dur.Reporter-Africa 061107 14.3

061108 15.4061114 17.5061120 16.8061129 17.1061130 17.8061206 15.7061212 17.6061219 17.9061220 17.0061221 16.8061222 17.3061226 14.1061227 14.9061228 14.4061229 17.8070716 14.8070720 12.9070724 18.2070725 15.8070726 17.5070727 19.3070730 19.1070731 17.5070803 16.2070809 17.2070810 17.4070823 8.7

Table A.1.: The ‘AP train’ corpus coming originally from the ‘PostPORT AP train’ corpus(Total duration = 456.9 min.).

59


Program Date Dur.Reporter-Africa 071107 17.4

071108 14.6071115 17.8

Table A.2.: The ‘AP dev’ corpus originally from the ‘PostPORT AP development’ corpus(Total duration = 49.8 min.).

Program Date Dur.Reporter-Africa 071116 16.6Reporter-Africa 071119 12.3Reporter-Africa 071129 19.1

Table A.3.: The ‘AP test’ corpus originally from the ‘PostPORT AP test’ corpus (Totalduration = 47.9 min.).

Program Date Dur.Mo-Nweti 080116 24.1Soap-Angola 070806 16.9TPA-Telejornal 081030 57.1

Table A.4.: The ‘AP extra’ corpus originally from the PostPORT ‘AP extra’ corpus (Totalduration = 87.5 min.).

60

Table A.5.: The ‘AP automatic’ corpus (total raw recorded duration = 4028.7 min., auto-matically detected AP duration = 1883.7 min.).

Program Date raw Dur. auto. AP Dur.

TPA Telejornal 080318 57.8 37.7080321 54.1 33.5080322 53.4 22.6080323 62.2 33.7080324 54.9 37.1080417 54.6 36.8080418 59.9 41.8080419 54.7 38.3080426 62.3 41.0080501 59.8 38.6081030 62.1 8.2

Todos Iguais 080128 30.7 10.0Reporter Africa 071130 24.3 14.5

071203 27.3 15.4071210 30.0 14.1071211 26.7 13.2080904 36.0 13.1080905 36.0 15.0080906 36.0 13.7080907 36.0 2.3080908 36.0 3.3080909 36.0 10.5080910 36.0 15.7080911 36.0 15.0080912 36.0 14.5080913 36.0 15.6080914 36.0 1.5080915 36.0 9.6080916 36.0 11.2080917 36.0 15.1080917 36.0 19.1080918 36.0 17.8080919 36.0 17.0

Continued on next page.



Continued from previous page.Program Date Raw Dur. auto. AP Dur.

080922 36.0 9.0080923 36.0 18.0080924 66.0 20.3080925 36.0 11.3080926 36.0 12.0080929 36.0 19.3080930 36.0 19.9081001 66.0 21.5081002 36.0 17.4081003 41.0 16.6081006 31.0 14.6081007 36.0 15.0081008 36.0 11.4081009 36.0 15.6081010 36.0 12.8081013 30.0 14.5081014 36.0 17.9081015 36.0 15.5081016 36.0 13.9081020 36.0 11.9081021 36.0 15.8081022 36.0 12.5081023 36.0 14.2081027 31.0 11.6081028 36.0 4.6081029 81.0 22.6081030 36.0 7.7081105 36.0 7.0081106 36.0 12.1081110 36.0 8.2081113 36.0 13.4081117 34.0 14.8081119 36.0 21.3081120 36.0 17.8081124 34.0 18.8081126 96.0 52.5


62

Continued from previous page.Program Date Raw Dur. auto. AP Dur.

081127 36.0 19.6081203 66.0 36.7081204 36.0 22.0081208 36.0 18.4081210 51.0 22.9081211 36.0 19.5081215 36.0 16.2081217 36.0 7.1081219 36.0 19.6081222 36.0 17.3081224 36.0 21.8090105 36.0 27.0090107 36.0 22.8090108 36.0 22.7090112 36.0 22.9090114 36.0 29.5090115 36.0 1.7090119 36.0 25.0090121 66.0 45.6090126 21.0 4.1090202 35.0 22.2090204 66.0 37.7090209 38.0 23.7090211 66.0 42.8090213 36.0 18.7090216 31.0 21.2090218 36.0 25.3090225 36.0 23.1090302 36.0 25.1090304 36.0 11.7



Program Part Date Dur.Record 3 071106 5.1

4 071106 5.1Record 2 071108 2.5

3 071108 3.44 071108 10.7

Record 2 071112 6.73 071112 6.34 071112 5.2

Table A.6.: The ‘BP dev’ corpus originally from the ‘PostPORT BP development’ corpus(Total duration = 44.9 min).

Program Part Date Origin Dur.Debate-Bola 1 070804 PostPORT BP Train 15.1

2 070804 10.01 070809 17.32 070809 7.41 070815 16.12 070815 5.61 070817 14.41 071004 9.33 071004 7.71 071005 10.62 071005 6.13 071005 9.31 071105 8.62 071105 10.6

Debate-Publico 071004 PostPORT BP Train 21.5071005 14.4

Fala-Escuto 070817 PostPORT BP Train 44.1070823 50.7071005 50.1

Hoje-em-Dia 1 070718 PostPORT BP Train 8.7Record 070103 PostPORT BP Train 29.9

070105 7.6070118 36.7

1 071106 PostPORT BP Dev 14.72 071106 7.51 071108 9.61 071112 13.5

Table A.7.: The ‘BP train’ corpus originally from the ‘PostPORT BP training’ corpus(Total duration = 456.9 min).

64

Program Date Dur.24 Horas 001010 16.9Acontece 001009 24.2

001012 22.3Jornal 2 001011 31.3Jornal da Tarde 001009 38.4

001013 47.3Noticias 001010 3.0Pais Regioes 001010 24.0

001013 19.5Pais Regioes Lx 001012 17.8RTP Economia 001010 6.2Telejornal 001009 51.1

001010 49.5001011 46.8001012 56.1

Table A.8.: The ‘EP train’ corpus originally from the ‘ALERT SR Train’ corpus (Totalduration = 454.4 min.

Program Date Dur.Noticias 001012 3.1Pais Regioes Lx 001013 14.5RTP Economia 001011 5.4

Table A.9.: The ‘EP dev’ corpus originally from the ‘ALERT SR Train’ corpus (Totalduration = 23.0 min.


B. Corpora Sources for Variety Identification

Program DateReporter-Africa 061107

061108061114061120061129061130061206061212061219061220061221061222061226061227061228061229070716070720070724070725070726070727070803070809

Table B.1.: The ‘VID training (AP)’ and the ‘VID testing (AP)’ corpus originally fromthe ‘PostPORT AP training’ corpus.

67


Program Part DateDebate-Bola 070803

1 0708042 070804

Fala-Brasil 070719070723

Fala-Escuto 070817Hoje-em-Dia 1 070718

2 0707183 070718

Record 070103070104070105070108070117070118

Table B.2.: The ‘VID testing (BP)’ corpus originally from the ‘PostPORT BP training’corpus.

Program Part DateDebate-Bola 070803

1 0708042 070804

Fala-Brasil 070719070723

Fala-Escuto 070817Record 070103

070105070108070117070118

Reporter-Africa 061107061114061130070724

Table B.3.: The ‘VID training (BP)’ corpus originally from the ‘PostPORT BP training’corpus.

68

Table B.4.: The ‘VID testing (EP)’ corpus.

Program Date Origin

24 Horas 000719 Alert SR pilot001010 Alert SR train001017001025001106001207 Alert SR devel001209

Acontece 001016 Alert SR train001019001206 Alert SR devel

Jornal 2 000418 Alert SR pilot001011 Alert SR train001015001019001023001028001031001104001206 Alert SR devel

Jornal da Tarde 000420 Alert SR pilot001009 Alert SR train001013001017001022001025001029001102001106001210 Alert SR devel

Noticias 000414 Alert SR pilot001026 Alert SR train001109

Pais Regioes 000418 Alert SR pilot001010 Alert SR train




Continued from previous page.Program Date Origin

001013001017001018001020001023001025001101001102001106001108001208 Alert SR devel

Pais Regioes Lx 000418 Alert SR pilot001012 Alert SR train001013001019001023001106001207 Alert SR devel

Telejornal 000406 Alert SR pilot001009 Alert SR train001010001011001012001013001014001015001016001017001018001019001020001021001022001023001024001025


70


001026001027001028001029001030001031001101001102001103001104001105001106001107001205 Alert SR devel001207001208

Table B.5.: The ‘VID training (EP)’ corpus.

Program Date Origin

24 Horas 000719 Alert SR pilot001010 Alert SR train001017001025001207 Alert SR devel001209

Acontece 000418 Alert SR pilot001009 Alert SR train001012001016001019001023001026001102001106

Jornal 000418 Alert SR pilotContinued on next page.




001011 Alert SR train001015001019001023001028001031001104001206 Alert SR devel

Jornal da Tarde 000420 Alert SR pilot001009 Alert SR train001013001017001022001025001029001102001106001210 Alert SR devel

Noticias 000414 Alert SR pilot001016 Alert SR train001107

Pais Regioes 000418 Alert SR pilot001010 Alert SR train001013001016001017001018001020001023001025001101001102001106001107001108001208 Alert SR devel


72


Pais Regioes Lx 000418 Alert SR pilot001012 Alert SR train001013001019001023001027001031001106001207 Alert SR devel

Remate 000417 Alert SR pilotRTP Economia 001106 Alert SR trainTelejornal 000406 Alert SR pilot

001009 Alert SR train001010001011001012001013001014001015001016001017001018001019001020001021001022001023001024001025001026001027001028001029001030001031001101





001102001103001104001105001106001107001205 Alert SR devel001207001208

74

List of Figures

1.1. Progress in LID during NIST LRE since 1996 [NIS09] given the average detectioncost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2. Averaged results of best participants in the language-pair test. . . . . . . . . . . . 5

2.1. AUDIMUS schematic overview [Mei08]. . . . . . . . . . . . . . . . . . . . . . . . 92.2. Process flow to generate the pronunciation lexicon starting from the vocabulary. . 13

4.1. Distribution of accent strength, spontaneous speech and noise in ‘AP test’ corpus. 23

5.1. Correct separation of AP and EP by each SAMPA phone using a binary classifier. 305.2. DET of VID performance of mono-phonetic systems using the best seven, eight

and nine mono-phones and systems with a selection of the lowest and middlescoring mono-phones for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3. Block diagram of the mono-phonetic PRLM system. . . . . . . . . . . . . . . . . 315.4. DET curve identifying AP vs EP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.5. Correct separation of BP and EP by each SAMPA phone using a binary classifier. 335.6. DET curve identifying BP vs EP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.7. Correct separation of AP and BP by each SAMPA phone using a binary classifier. 355.8. DET curve identifying AP vs. BP. . . . . . . . . . . . . . . . . . . . . . . . . . . 355.9. DET curve identifying AP vs. BP vs. EP. . . . . . . . . . . . . . . . . . . . . . . 37

6.1. Relative WER in [%] of systems with different decoder settings compared to theadapted AP-specific acoustic models (1st iteration) with a standard of maximum3500 active states and the beam size parameter set to 6. . . . . . . . . . . . . . . 41

6.2. Influence of sample/weight ratio on the relative WER improvement compared to asystem using 15.5 samples/weight. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3. DET-Curve for confidence measures. . . . . . . . . . . . . . . . . . . . . . . . . . 466.4. Relation between confidence threshold and false alarm probability. . . . . . . . . . 466.5. Threshold influence on relative WER improvement compared to a system trained

without any unsupervised data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.6. WER performance of VID systems relative to the corresponding noVID systems. 496.7. WER performance relative to the baseline of VID and noVID systems during the

last epochs of training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.8. Relative WER in [%] of AP, interpolated AP+EP and interpolated language model

with artificially lowered 0% OOV rate compared to baseline EP language model. . 54

75

List of Tables

3.1. Overview of the AP corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2. Overview of the BP corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3. Overview of the EP corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4. Details about the ‘VID train’ corpus. . . . . . . . . . . . . . . . . . . . . . . . . . 193.5. Details about the ‘VID test’ corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6. Overview of sources composing the text corpus collected until December 2009 (Total

words: 1682k). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1. Alignment of AP corpora before and after manual corrections. . . . . . . . . . . . 214.2. Cohen Kappa and standard error for each individual class. . . . . . . . . . . . . . 234.3. Baseline WER with L2F’s standard EP system on the ‘AP test’ corpus. . . . . . 24

5.1. Overview of employed baseline classifiers. . . . . . . . . . . . . . . . . . . . . . . 275.2. EERs of all competing VID systems identifying AP vs. EP (‘+’ denotes a fusion). 335.3. EERs of all competing VID systems identifying BP vs. EP (‘+’ denotes a fusion). 345.4. EERs of all competing VID systems identifying AP vs. BP (‘+’ denotes a fusion). 365.5. EERs of all competing VID systems identifying AP vs. BP vs. EP (‘+’ denotes a

fusion). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1. Characteristics of adaptation, 1st iteration. . . . . . . . . . . . . . . . . . . . . . 406.2. WER performance and relative error rate reduction of adapted AP-specific acoustic

models (1st iteration) in [%]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3. Characteristics of adaptation, 2nd iteration. . . . . . . . . . . . . . . . . . . . . . 426.4. WER performance and relative error rate reduction of adapted AP-specific acoustic

models (2nd iteration) in [%]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.5. Characteristics of retrained AP models randomly initialized (1st iteration). . . . . 456.6. WER performance and relative error rate reduction of retrained AP-specific acoustic

models with random initialization (1st iteration) in [%]. . . . . . . . . . . . . . . 456.7. Absolute WER performance of VID/noVID FA9 and VID/noVID FA25 on all

accents and relative gain of VID systems. . . . . . . . . . . . . . . . . . . . . . . 486.8. Characteristics of completely retrained AP models with random initialization (2nd

iteration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.9. WER performance and relative error rate reduction of completely retrained AP-

specific acoustic models with random initialization (2nd iteration) in [%]. . . . . . 51

77

List of Tables

6.10. Characteristics of completely retrained AP acoustic models with random initializa-tion (3rd iteration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.11.WER performance and relative error rate reduction of completely retrained AP-specific acoustic models with random initialization (3rd iteration) in [%]. . . . . . 52

6.12. Perplexities of different language models on the ‘AP test’ corpus (‘+’ denotes aninterpolation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.13.WER performance in [%] of AP-specific acoustic models combined with the ‘AP+EPLM’ interpolated language model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.1. The ‘AP train’ corpus coming originally from the ‘PostPORT AP train’ corpus(Total duration = 456.9 min.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.2. The ‘AP dev’ corpus originally from the ‘PostPORT AP development’ corpus(Total duration = 49.8 min.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3. The ‘AP test’ corpus originally from the ‘PostPORT AP test’ corpus (Totalduration = 47.9 min.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.4. The ‘AP extra’ corpus originally from the PostPORT ‘AP extra’ corpus (Totalduration = 87.5 min.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.5. The ‘AP automatic’ corpus (total raw recorded duration = 4028.7 min., automati-cally detected AP duration = 1883.7 min.). . . . . . . . . . . . . . . . . . . . . . 61

A.6. The ‘BP dev’ corpus originally from the ‘PostPORT BP development’ corpus(Total duration = 44.9 min). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.7. The ‘BP train’ corpus originally from the ‘PostPORT BP training’ corpus (Totalduration = 456.9 min). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.8. The ‘EP train’ corpus originally from the ‘ALERT SR Train’ corpus (Total duration= 454.4 min. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.9. The ‘EP dev’ corpus originally from the ‘ALERT SR Train’ corpus (Total duration= 23.0 min. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

B.1. The ‘VID training (AP)’ and the ‘VID testing (AP)’ corpus originally from the‘PostPORT AP training’ corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.2. The ‘VID testing (BP)’ corpus originally from the ‘PostPORT BP training’ corpus. 68B.3. The ‘VID training (BP)’ corpus originally from the ‘PostPORT BP training’ corpus. 68B.4. The ‘VID testing (EP)’ corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.5. The ‘VID training (EP)’ corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

78

Bibliography

[AN08] Abad, A. ; Neto, J.: Incorporating Acoustical Modelling of Phone Transitions inan Hybrid ANN/HMM Speech Recognizer. In: Proc. Interspeech. Brisbane,Australia, 2008

[ATNV09] Abad, A. ; Trancoso, I. ; Neto, N. ; Viana, M. C.: Porting an EuropeanPortuguese Broadcast News Recognition System to Brazilian Portuguese. In: Proc.Interspeech. Brighton, UK, September 2009

[BAB94] Berkling, K. ; Arai, T. ; Barnard, E.: Analysis of Phoneme-Based Features forLanguage Identification. In: Proc. IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) vol. 1, 1994, pp. 289–292

[BM94] Bourlard, H. ; Morgan, N.: Connectionist Speech Recognition: A HybridApproach. Springer, 1994

[BPP96] Berger, A. L. ; Pietra, V.J.D ; Pietra, S.A.D: A Maximum Entropy Approachto Natural Language Processing. In: Computational Linguistics 22 (1996), no. 1, pp.39–71

[Brü04] Brümmer, N.: Spescom DataVoice NIST 2004 System Description. In: Proc. NISTSpeaker Recognition Evaluation. Toledo, Spain, 2004, pp. 1–8

[Brü07] Brümmer, N.: FoCal Multi-class: Toolkit for Evaluation, Fusion and Calibration ofMulti-class Recognition Scores—Tutorial and User Manual—. (2007)

[BYD08] Bing, Xu ; Yan, Song ; Dai, Lirong: The Adaptation Schemes In PR-SVM BasedLanguage Recognition. In: Proc. International Symposium on Chinese SpokenLanguage Processing (ISCSLP). Kunming, China, 2008, pp. 1 –4

[Cam08] Campbell, W. M.: A Covariance Kernel for SVM Language Recognition. In: Proc.IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Las Vegas, USA, 2008, pp. 4141 –4144

[Cas03] Caseiro, D.: Finite-State Methods in Automatic Speech Recognition. Lisbon,Portugal, Instituto Superior Técnico, Universidade Técnica de Lisboa, Diss., 2003

[CCC+10] Castaldo, Fabio ; Colibro, Daniele ; Cumani, Sandro ; Dalmasso, Emanuele ;Laface, Pietro ; Vair, Claudio: Loquendo-Politecnico Di Torino System for the2009 NIST Language Recognition Evaluation. In: Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) (2010)

79

Bibliography

[CCD+07] Castaldo, F. ; Colibro, D. ; Dalmasso, E. ; Laface, P. ; Vair, C.:Compensation of Nuisance Factors for Speaker and Language Recognition. In: IEEETransactions on Audio, Speech, and Language Processing 15 (2007), no. 7, pp.1969–1978

[CCR+04] Campbell, W. M. ; Campbell, J. P. ; Reynolds, D. A. ; Jones, D. A. ; Leek,T. R.: High-Level Speaker Verification with Support Vector Machines. In: Proc.IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) vol. 4, 2004

[CCR+06] Campbell, W. M. ; Campbell, J. P. ; Reynolds, D. A. ; Singer, E. ;Torres-Carrasquillo, P. A.: Support Vector Machines for Speaker and LanguageRecognition. In: Computer Speech and Language 20 (2006), no. 2-3, pp. 210–229

[CG98] Chen, S. ; Goodman, J.: An Empirical Study of Smoothing Techniques forLanguage Modeling / Computer Science Group, Harvard University. 1998(TR-10-98). – Technical Report

[Coh60] Cohen, J.: A Coefficient of Agreement for Nominal Scales. In: Educational andPsychological Measurement 20 (1960), no. 1, pp. 37–46

[CSRS06] Campbell, W. M. ; Sturim, D. E. ; Reynolds, D. A. ; Solomonoff, A.: SVMBased Speaker Verification Using a GMM Supervector Kernel and NAP VariabilityCompensation. In: Proc. IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) vol. 1, 2006

[CSTR08] Campbell, W. M. ; Sturim, D. E. ; Torres-Carrasquillo, P. ; Reynolds,D. A.: A Comparison of Subspace Feature-Domain Methods for LanguageRecognition. In: Proc. Interspeech. Brisbane, Australia, 2008

[CTVB03] Caseiro, D. ; Trancoso, I. ; Viana, C. ; Barros, M.: A Comparative Descriptionof GtoP Modules for Portuguese and Mirandese Using Finite State Transducers. In:Proc. International Congress of Phonetic Sciences (ICPhS). Barcelona, Spain, 2003

[Gau95] Gauvain, J. L.: Developments in Continuous Speech Dictation using the DARPAWSJ Task. In: Proc. IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). Detroit, USA, 1995

[GC89] Gillick, L. ; Cox, S.: Some Statistical Issues in the Comparison of SpeechRecognition Algorithms. In: Proc. IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Glasgow, Scotland, 1989, pp. 532–535

[GMS04] Gauvain, J. L. ; Messaoudi, A. ; Schwenk, H.: Language Recognition UsingPhone Latices. In: Proc. International Conference on Spoken Language Processing(ICSLP), 2004, pp. 1215–1218

[Her90] Hermansky, Hynek: Perceptual Linear Predictive (PLP) Analysis of Speech. In:The Journal of the Acoustical Society of America 87 (1990), no. 4, pp. 1738–1752

80

Bibliography

[HMBK92] Hermansky, Hynek ; Morgan, Nelson ; Bayya, Aruna ; Kohn, Phil: Rasta-PLPSpeech Analysis Technique. In: Proc. IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) (1992), pp. I.121–I.124

[KAT10] Koller, Oscar ; Abad, Alberto ; Trancoso, Isabel: Exploiting variety-dependentPhones in Portuguese Variety Identification. In: IEEE Odyssey 2010: The Speakerand Language Recognition Workshop, 2010

[KATV10] Koller, O. ; Abad, A. ; Trancoso, I. ; Viana, C.: Exploiting variety-dependentPhones in Portuguese Variety Identification applied to Broadcast NewsTranscription. In: submitted to Proc. Interspeech. Makuhari, Japan, 2010

[KD04] Kenny, P. ; Dumouchel, P.: Experiments in Speaker Verification Using FactorAnalysis Likelihood Ratios. In: IEEE Odyssey 2004: The Speaker and LanguageRecognition Workshop, 2004

[KMG98] Kingsbury, B.E.D ; Morgan, N. ; Greenberg, S.: Robust Speech RecognitionUsing the Modulation Spectrogram. In: Speech Communication 25 (1998), no. 1-3,pp. 117–132

[KW99] Kemp, T. ; Waibel, A.: Unsupervised Training of a Speech Recognizer: RecentExperiments. In: Proc. European Conference on Speech Communication andTechnology vol. 6. Budapest, Hungary, 1999, pp. 2725–2728

[Lev66] Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions, andReversals. In: Soviet Physics-Doklady vol. 10, 1966

[Lew09] Lewis, M. P.: Ethnologue: Languages of the World, 16th Edition. 16th. SILInternational, 2009 http://www.ethnologue.com/. – ISBN 1556712162

[LGA02] Lamel, Lori ; Gauvain, Jean luc ; Adda, Gilles: Lightly Supervised andUnsupervised Acoustic Model Training. In: Computer Speech and Language 16(2002), no. 1, pp. 115–129

[LK77] Landis, J. R. ; Koch, Gary G.: The Measurement of Observer Agreement forCategorical Data. In: Biometrics 33 (1977), no. 1, pp. 159–174

[Mar08] Martins, C. A.: Dynamic Language Modeling for European Portuguese, Universityof Aveiro, Portugal, Diss., 2008

[MBSČ06] Matějka, P. ; Burget, L. ; Schwarz, P. ; Černocky, J.: Brno University ofTechnology System for NIST 2005 Language Recognition Evaluation. In: IEEEOdyssey 2006: The Speaker and Language Recognition Workshop, 2006, pp. 1–7

[MCNT03] Meinedo, H. ; Caseiro, D. ; Neto, J. ; Trancoso, I.: AUDIMUS.media: ABroadcast News Speech Recognition System for the European Portuguese Language.In: International Workshop on Computational Processing of the PortugueseLanguage. Faro, Portugal : Springer, 2003, pp. 196


http://www.ethnologue.com/

Bibliography

[Mei08] Meinedo, H.: Audio Pre-Processing and Speech Recognition for Broadcast News,Universidade Técnica de Lisboa, Diss., 2008

[MN00] Meinedo, H. ; Neto, J. P.: Combination of Acoustic Models in Continuous SpeechRecognition Hybrid Systems. In: Proc. International Conference on SpokenLanguage Processing (ICSLP) vol. 2. Beijing, China, 2000, pp. 931–934

[MPR02] Mohri, M. ; Pereira, F. ; Riley, M.: Weighted Finite-State Transducers inSpeech Recognition. In: Computer Speech and Language 16 (2002), no. 1, pp. 69–88

[MS08] Ma, J. ; Schwartz, R.: Unsupervised Versus Supervised Training of AcousticModels. In: Proc. Interspeech. Brisbane, Australia, 2008

[MTG+06] Montero-Asenjo, A. ; Toledano, D. T. ; Gonzalez-Dominguez, J. ;Gonzalez-Rodriguez, J. ; Ortega-Garcia, J.: Exploring PPRLM Performancefor NIST 2005 Language Recognition Evaluation. In: IEEE Odyssey 2006: TheSpeaker and Language Recognition Workshop, 2006, pp. 1–6

[MVN08] Meinedo, H. ; Viveiros, M. ; Neto, J.: Evaluation of a Live Broadcast NewsSubtitling System for Portuguese. In: Proc. Interspeech. Brisbane, Australia, 2008

[NIS09] NIST: LRE progress in LID performance. http://www.itl.nist.gov/iad/mig/tests/lre/2009/lre09_eval_results/index.html. Version:April 2009

[NLLM10] Ng, Raymond W. M. ; Leung, Cheung-Chi ; Lee, Tan ; Ma, Bin: ProsodicAttribute Model for Spoken Language Identification. In: Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) (2010)

[NMA97] Neto, J. P. ; Martins, C. A. ; Almeida, L. B.: The Development of a SpeakerIndependent Continuous Speech Recognizer for Portuguese. In: Proc. EuropeanConference on Speech Communication and Technology. Rhodes, Greece, 1997

[PFF90] Pallet, D. S. ; Fisher, W. M. ; Fiscus, J. G.: Tools for the analysis of benchmarkspeech recognition tests. In: Acoustics, Speech, and Signal Processing, 1990.ICASSP-90., 1990 International Conference on, 1990, pp. 97 –100 vol.1

[Ros95] Rosenfeld, R.: Optimizing Lexical and N-gram Coverage Via Judicious Use ofLinguistic Data. In: Proc. European Conference on Speech Communication andTechnology. Madrid, Spain, 1995, pp. 1763–1766

[RTVA08] Rouas, J. ; Trancoso, I. ; Viana, C. ; Abreu, M.: Language and VarietyVerification on Broadcast News for Portuguese. In: Speech Communication 50(2008), no. 11-12, pp. 965–979

[SR07] Shen, W. ; Reynolds, D.: Improving Phonotactic Language Recognition withAcoustic Adaptation. In: Proc. Interspeech. Antwerp, Belgium, 2007

[Sta06] Stadermann, J.: Automatische Spracherkennung mit hybriden akustischenModellen, Technische Universität München, Diss., 2006

82

http://www.itl.nist.gov/iad/mig/tests/lre/2009/lre09_eval_results/index.html

http://www.itl.nist.gov/iad/mig/tests/lre/2009/lre09_eval_results/index.html

Bibliography

[Sto02] Stolcke, A.: SRILM- an Extensible Language Modeling Toolkit. In: Proc.International Conference on Spoken Language Processing (ICSLP). Denver,Colorado, 2002

[TML+09] Tong, Rong ; Ma, Bin ; Li, Haizhou ; Chng, Eng S. ; Lee, Kong-Aik:Target-Aware Language Models for Spoken Language Recognition. In: Proc.Interspeech. Brighton, UK, September 2009, pp. 200–203

[TTS96] Teixeira, C. ; Trancoso, I. ; Serralheiro, A.: Accent Identification. In: Proc.International Conference on Spoken Language Processing (ICSLP) vol. 3, 1996, pp.1784–1787

[WGW07] Wang, L. ; Gales, M. J. F. ; Woodland, P. C.: Unsupervised Training forMandarin Broadcast News and Conversation Transcription. In: Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP)vol. 4, 2007, pp. 353–356

[WN01] Wessel, Frank ; Ney, Hermann: Unsupervised Training of Acoustic Models forLarge Vocabulary Continuous Speech Recognition. In: Proc. IEEE Workshop onAutomatic Speech Recognition and Understanding, 2001, pp. 307–310

[ZB01] Zissman, Marc A. ; Berkling, Kay M.: Automatic Language Identification. In:Speech Communication 35 (2001), no. 1-2, pp. 115–124

[Zis96] Zissman, M.A.: Comparison of Four Approaches to Automatic LanguageIdentification of Telephone Speech. In: IEEE Trans. Speech and Audio Proc 4 (1996),no. 1, pp. 31–44


diploma thesis - rwth aachen university...berlin diploma thesis...

Documents