data-driven neural network based feature front-ends for

DATA-DRIVEN NEURAL NETWORK BASED FEATURE

FRONT-ENDS FOR AUTOMATIC SPEECH RECOGNITION

by

Samuel Thomas

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

December, 2012

c© Samuel Thomas 2012

All rights reserved

Abstract

Speech contains information about at least three constituent elements - (1)

the message that is being communicated, (2) the speakers who are communicating and

(3) the environment in which the communication occurs. Depending on the final goal,

information about each of these elements is processed by a feature extraction front-end

before being used for subsequent pattern recognition applications. Feature extraction

front-ends for automatic speech recognition (ASR) are designed to derive features that

characterize underlying speech sounds in the signal that are useful in recognizing the

spoken message. Irrelevant variability from speakers and the environment should also

be alleviated to the extent possible.

In this thesis, we improve conventional feature extraction techniques by de-

veloping a data-driven feature extraction approach. The key element in this ap-

proach is a feed-forward neural network trained on large amounts of data to recog-

nize phonemes which are basic speech units of speech occurring at intervals of 5-10

milliseconds. We show how these data-driven features can benefit significantly by

combining information from multiple acoustic features derived using novel signal pro-

ii

ABSTRACT

cessing techniques. In experiments on a variety of ASR tasks - from a small vocabulary

continuous digit recognition task to a large vocabulary continuous speech recognition

(LVCSR) task, the proposed features provide about 14% relative reduction of word

error rate (WER).

The other problem we address in this thesis relates to the development of

LVCSR systems with only few hours of training data. In conventional systems, the

performance degrades considerably when the amount of training data is reduced. We

propose several techniques to deal in these low resource scenarios by using features

from data-driven feature extractors trained on data from different languages and

domains. The proposed techniques allow the feature front-ends to be trained on

multilingual data transcribed using different phoneme sets. Our approaches show that

with this kind of prior training at the feature extraction level, data-driven features can

compensate significantly for the lack of large amounts of training data in downstream

speech applications. We demonstrate an absolute WER reduction of about 15% on a

low-resource task with only 1 hour of transcribed training data for acoustic modeling.

Apart from being used to generate features, we also show how outputs from

the proposed data-driven front-ends can be used for a host of other speech appli-

cations. In noisy environments we show how data-driven features can be used for

speech activity detection on acoustic data from multiple languages transmitted over

noisy radio communication channels. In a novel speaker recognition model using neu-

ral networks, posteriors of speech classes are used to model parts of each speakers

iii

ABSTRACT

acoustic space, via a training objective function based on posterior probabilities of

broad phonetic classes. In zero resource settings, tasks such as spoken term discovery

attempt to automatically discover repeated words and phrases in speech without any

transcriptions. With no transcripts to guide the process, results of the search largely

depend on the quality of the underlying speech representation being used. Our ex-

periments show that in these settings significant improvements can be obtained using

phoneme posterior outputs derived using the proposed front-ends. We also explore a

different application of these posteriors - as phonetic event detectors for speech recog-

nition. These event detectors are used along with Segmental Conditional Random

Fields (SCRFs) to improve the performance of speech recognition systems.

Thesis Committee

Prof. Mounya Elhilali, Prof. Aren Jansen (Reader) and Prof. Hynek Hermansky

(Reader and Advisor)

iv

Acknowledgments

This thesis would never have been in place without so many great people

around me. I would like to thank my advisor, Prof. Hynek Hermansky for his

guidance and support. He always allowed me to learn, contribute and collaborate on

different projects and work with other research groups. Thank you very much for all

the mentoring!

I owe much to Sriram for always being there for me as a great friend and

collaborator. We have worked together on many interesting ideas and projects, several

of which form the core of this thesis. He has always been around to help - many thanks

for also reading this thesis! My sincere thanks to my colleagues - Sivaram, Harish,

Keith, Vijay, Feipeng, Ehsan, Janu, Kailash, Sridhar, Mike, Bala, Deepu, Joel, Fabio,

Petr, Mathew, John, Tamara, Lakshmi, Hari, Weifeng and Phil. Graduate school

would never have been as it was, without all of you! Thank you very much for the

collaborations and help!

I was fortunate to work with several researchers at UMD (Shihab, Nima,

Xinhui, Daniel, Dmitry and Ramani), BBN (Spyros, Stavros, Tim, Long and Bing)

v

ACKNOWLEDGMENTS

and BUT (Lukas, Petr, Pavel, Martin, Ondrej and Honza) on the IAPRA BEST,

BABEL and DARPA RATS programs. Thank you very much!

I spent three different summers working with CLSP summer workshop teams

lead by Dan, Nagendra, Lukas and Richard in 2009, Geoff and Patrick in 2010 and

Aren, Mike and Ken in 2012. These were great workshops! Thanks for having me

on your teams! My sincere thanks to Sanjeev for organizing these workshops and the

CLSP, HLTCOE and ECE for supporting me on various grants.

My sincere thanks to Aren and Mounya for being on my committees from

my GBO to final defense. Thank you very much Aren for the great collaboration,

advise and comments on my thesis!

I pulled through all of this because of the love and prayers of my family -

my son Joshua, wife Jamie and our parents. No words can express my thanks to the

part you have in all this!

vi

Dedication

This thesis is dedicated to my family.

vii

Contents

Abstract ii

Acknowledgments v

List of Tables xiii

List of Figures xv

1 Introduction 1

1.1 Overview of Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 1

1.2 Conventional Feature Extraction Techniques for ASR . . . . . . . . . . . . . 3

1.3 Integrating Training Data with Feature Extraction . . . . . . . . . . . . . . 4

1.4 Review of data-driven feature transforms for ASR . . . . . . . . . . . . . . 7

1.4.1 Front-end feature transforms . . . . . . . . . . . . . . . . . . . . . . 9

1.4.2 Back-end feature transforms . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Focus of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.6 Outline of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

viii

CONTENTS

2 From Acoustic Features to Data-driven Features 24

2.1 Time-Frequency Representations of Speech . . . . . . . . . . . . . . . . . . 24

2.1.1 Processing across the frequency axis . . . . . . . . . . . . . . . . . . 25

2.1.2 Processing across the time axis . . . . . . . . . . . . . . . . . . . . . 26

2.1.3 Integrating speaker and channel invariance . . . . . . . . . . . . . . 27

2.2 Towards Improved Features for ASR . . . . . . . . . . . . . . . . . . . . . . 29

2.2.1 Long-term acoustic features . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.2 Parametric models of temporal envelopes . . . . . . . . . . . . . . . 31

2.2.3 Neural network features . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.4 Combination of information from multiple streams . . . . . . . . . . 33

2.3 Novel Short-term and Long-term Features for Speech Recognition . . . . . . 34

2.3.1 FDLP based time-frequency representation . . . . . . . . . . . . . . 36

2.3.2 Short-term Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.3 Long-term Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.4 Data-driven Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Speech Recognition Experiments and Results . . . . . . . . . . . . . . . . . 44

2.4.1 Phoneme Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.2 Small Vocabulary Digit Recognition . . . . . . . . . . . . . . . . . . 46

2.4.3 Large Vocabulary Continuous Speech Recognition . . . . . . . . . . 47

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Data-driven Features for Low-resource Scenarios 51

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ix

CONTENTS

3.2 Training Using a Combined Phone set . . . . . . . . . . . . . . . . . . . . . 54

3.3 Training Using Multiple Output Layers . . . . . . . . . . . . . . . . . . . . 57

3.4 Speech Recognition Experiments and Results . . . . . . . . . . . . . . . . . 59

3.4.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4.2 Low-resource LVCSR System . . . . . . . . . . . . . . . . . . . . . . 60

3.4.3 Building Data-driven Front-ends using a Common Phoneme Set . . 61

3.4.4 Data-driven Front-ends with MLPs Adapted using Multiple Output

Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Training with 2 languages . . . . . . . . . . . . . . . . . . . . . . . . 66

Training with 3 languages . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Wide and Deep MLP Architectures in Low-resource Settings 70

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 Wide Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.1 Building the Data-driven Front-ends . . . . . . . . . . . . . . . . . . 72

4.2.2 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Deep Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.1 DNN Pretraining and Initialization . . . . . . . . . . . . . . . . . . . 79

4.3.2 DNN Adaptation with task specific data . . . . . . . . . . . . . . . . 81

4.3.3 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . 82

DNN pretraining with cross-lingual data . . . . . . . . . . . . . . . . 82

DNN adaptation to low-resource settings . . . . . . . . . . . . . . . 84

x

CONTENTS

ASR Experiments using DNN features . . . . . . . . . . . . . . . . . 85

4.4 Semi-supervised training in Low-resource Settings . . . . . . . . . . . . . . . 85

4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.2 Selecting Reliable Data . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 88

Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Semi-supervised training of DNNs . . . . . . . . . . . . . . . . . . . 89

Semi-supervised training of Acoustic Models . . . . . . . . . . . . . 91

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 Applications of Data-driven Front-end Outputs 93

5.1 Application 1 - Speech Activity Detection . . . . . . . . . . . . . . . . . . . 93

5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1.2 Data-driven Features for SAD . . . . . . . . . . . . . . . . . . . . . . 94


5.2 Application 2 - Neural Network based Speaker Verification . . . . . . . . . . 98

5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2.2 AANN Models for Speaker Verification . . . . . . . . . . . . . . . . . 99

Modeling Speaker Data . . . . . . . . . . . . . . . . . . . . . . . . . 99

Mixture of AANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


5.3 Application 3 - Zero Resource Settings . . . . . . . . . . . . . . . . . . . . . 104

5.4 Application 4 - Event detectors for Speech Recognition . . . . . . . . . . . . 106

xi

CONTENTS

5.4.1 Building Phoneme Detectors . . . . . . . . . . . . . . . . . . . . . . 106

5.4.2 Integrating Detectors with SCARF . . . . . . . . . . . . . . . . . . . 107

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6 Conclusions 110

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Bibliography 117

Vita 139

xii

List of Tables

1.1 LDA with different representations of speech. . . . . . . . . . . . . . . . . . 12

2.1 FDLP model parameters that improve robustness of short-term spectral fea-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2 FDLP model parameters that improve performance of long-term modulationfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3 Phoneme Recognition Accuracies (%) for different feature extraction tech-niques on the TIMIT database . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4 Word Recognition Accuracies (%) on the OGI Digits database for differentfeature extraction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.5 Word Recognition Accuracies (%) on RT05 Meeting data, for different featureextraction techniques. TOT - total word recognition accuracy (%) for alltest sets, AMI, CMU, ICSI, NIST, VT - word recognition accuracies (%) onindividual test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.6 Recognition Accuracies (%) of broad phonetic classes obtained from confusionmatrix analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1 Word Recognition Accuracies (%) using different Tandem features derivedusing only 1 hour of English data . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2 Word Recognition Accuracies (%) using Tandem features enhanced usingcross-lingual posterior features . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.3 Word Recognition Accuracies (%) using multi-stream cross-lingual posteriorfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4 Word Recognition Accuracies (%) using two languages - Spanish and English 673.5 Word Recognition Accuracies (%) using three languages - Spanish, German

and English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1 Word Recognition Accuracies (%) using different amounts of Callhome datato train the LVCSR system with conventional acoustic features . . . . . . . 77

4.2 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 834.3 Word Recognition Accuracies (%) at different word confidence thresholds . 894.4 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 90

xiii

LIST OF TABLES

4.5 Word Recognition Accuracies (%) with semi-supervised acoustic model training 91

5.1 Equal Error Rate (%) on different channels using different acoustic featuresand combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Performance in terms of Min DCF (×103) and EER (%) in parentheses ondifferent NIST-08 conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3 Integrating MLP based event detectors with ASR . . . . . . . . . . . . . . . 108

6.1 Performances in a low-resource setting using different data-driven front-endsproposed in the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xiv

List of Figures

1.1 Broad Classification of Feature Transforms for ASR. . . . . . . . . . . . . . 81.2 Spectral basis functions derived using PCA on the bark-spectrum of speech

from the OGI stories database - Eigenvalues of the KLT basis, total covari-ance matrix projected on the first 8 KLT vectors, first 6 KL spectral basisfunctions derived by PCA analysis. . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 LDA-derived spectral basis functions of the critical band spectral space de-rived from the OGI Numbers corpus. . . . . . . . . . . . . . . . . . . . . . . 13

1.4 (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from cleanSwitchboard database, (b) Frequency and impulse responses of the RASTAfilter and the RASTA filter combined with the delta and double-delta filters. 14

1.5 Thesis contributions to developing better data-drive neural network featuresfor ASR pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP. 35

2.2 PLP (b) and FDLP (c) spectrograms for a portion of speech (a). . . . . . . 382.3 Schematic of the joint spectral envelope, modulation features for posterior

based ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Schematic of the proposed training technique with multiple output layers . 583.2 Deriving cross-lingual and multi-stream posterior features for low resource

LVCSR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Tandem and bottleneck features for low-resource LVCSR systems. . . . . . 68

4.1 (a) Wide and (b) Deep neural network topologies for data-driven features . 714.2 Data driven front-end built using data from the same language but from a

different genre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3 A cross-lingual front-end built with data from the same language and with

large amounts of additional data from a different language but with sameacoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xv

LIST OF FIGURES

4.4 LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends . . . . . . . . . . . . . . . . . . . . . . . 78

4.5 MLP posteriogram based phoneme occurrence count . . . . . . . . . . . . . 87

5.1 Schematic of (a) features and (b) the processing pipeline for speech activitydetection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2 Average precision for different configuration of the wide topology front-ends 105

xvi

Chapter 1

Introduction

This chapter introduces the automatic speech recognition problem and machinery.

The theme of the thesis - developing data-driven feature extractors for speech recognition, is

motivated along with a discussion on techniques that have been developed in the past. The

chapter also outlines the thesis and its contributions.

1.1 Overview of Automatic Speech Recognition

Automatic speech recognition is the process of transcribing speech into text. Cur-

rent speech recognition systems solve this task in a probabilistic setting using four key

components: a feature extraction module, an acoustic model, a pronunciation dictionary

and a language model. In a word recognition task, given an acoustic signal corresponding

to a sequence of words X = x1x2 . . . xn, the feature extraction module first generates a

compact representation of the input as sequence of feature vectors Y = y1y2 . . . yt. The

1

CHAPTER 1. INTRODUCTION

acoustic model, pronunciation dictionary and a language model are then used to find the

most probable word sequence X given these feature vectors. This is done by expressing the

desired probability p(X|Y ) using Bayes theorem as

X = arg maxX

p(X|Y ) = arg maxX

p(Y |X)p(X)p(Y )

(1.1)

p(X) is the a priori probability of observing a sequence of words in the language, inde-

pendent of any acoustic evidence and is modeled using the language model component.

p(Y |X) corresponds to the likelihood of the acoustic features Y being generated given the

word sequence X.

In current ASR systems, both the language model and the acoustic model are

stochastic models trained using large amounts training data [1, 2]. Hidden Markov Models

(HMMs) or a hybrid combination of neural networks and HMMs [3] are typically used as

acoustic models.

For large vocabulary speech recognition, not all words have adequate number of

acoustic examples in the training data. The acoustic data also covers only a limited vo-

cabulary of words. Instead of modeling incorrect probability distributions of entire words

or utterances using limited examples, acoustic models for basic speech sounds are instead

built. By using these basic units, recognizers can also recognize words without acoustic

training examples.

To compute the likelihood p(Y |X), each word in the hypothesized word sequenceX

is first broken down into its constituent phones using the pronunciation dictionary. A single

composite model for the hypothesis is then constructed by combining individual phone

HMMs. In practice, to account for the large variability of basic speech sounds, HMMs

2


of context dependent speech units with continuous density output distributions are used.

There exist efficient algorithms like the Baum-Welch algorithm to learn the parameters of

these acoustic models from training data [4].

N -grams, typically bi-grams or tri-grams, are used as language models to generate

the a priori probability p(X) [2]. Although p(X) is the probability of a sequence of words,

N -grams model this probability assuming the probability of any word xi depends on only

N-1 preceding words. These probability distributions are estimated from simple frequency

counts that can be directly obtained from large amounts of text. To account for the inability

to estimate counts for all possible N -gram sequences techniques like discounting and back-off

are used [5].

1.2 Conventional Feature Extraction Techniques for ASR

Front-ends for ASR which have traditionally evolved from coding techniques like

linear predictive coding (LPC) [6] start by performing a short-term analysis of the speech

signal. Based on the assumption that speech is stationary in sufficiently short-time intervals,

the power spectrum (squared magnitude of the short-time Fourier spectrum) of the signal

is computed every 10 ms in overlapping Hamming analysis windows of 25 ms duration

[7, 8]. This spectral representation of speech is then transformed into an auditory-like

representation by warping the frequency axis to the Mel or Bark scale and applying a non-

linear cubic root or logarithmic compression. Mel-frequency Cepstral Coefficients (MFCC)

[9] or Perceptual Linear Prediction (PLP) [10] features for speech recognition are cepstral

coefficients derived by projecting the auditory-like representation onto a set of discrete cosine

3


transform (DCT) basis functions. Since these techniques analyze the speech signal only in

short analysis windows, information about local dynamics of the underlying speech signal is

often provided by augmenting these features with derivatives of the cepstral trajectories at

each instant [11]. In speech recognition applications, the first 13 cepstral coefficients along

with their delta and double deltas are typically used.

1.3 Integrating Training Data with Feature Extraction

In practical classification settings, the goal of a classifier is to assign one of J class

labels to an entity given an N dimensional feature vector x. One approach to this problem,

involves inferring posterior class probabilities p(Cj |x) of each class given the features. The

entity is then assigned to the class that gives the highest class posterior probability [12].

The posterior probability p(Cj |x) of each class can be estimated in multiple ways.

In a Bayesian formulation, p(Cj |x) can be expanded as p(x|Cj)p(Cj)p(x) . Each of the quantities

p(x|Cj) and p(Cj) are then separately computed from generative models trained to capture

these distributions from data. The probability p(Cj |x) can also be estimated directly from

a parametric model, whose parameters have also been optimized using the training data.

A non-probabilistic approach to the classification problem involves discriminant

functions that predict the class label of the input [13]. In this framework, classification

is viewed as partitioning of the input feature space into different classes using decision

boundaries or surfaces. For a simple two class problem, a linear discriminant function can

be constructed as the linear combination of the input feature vector with a weight vector

4


w as

f(x,w) = wTx + w0. (1.2)

In the N-dimensional input space, the function f(x,w) = wTx + w0 forms an N-1 di-

mensional hyperplane that assigns x to class C1 if f(x,w) ≥ 0 and to class C2 otherwise.

Discriminant functions can be further extended as generalized linear discriminant functions,

of the form

f(x,w) = wTφ(x) + w0, (1.3)

where φ(.) is a fixed linear or non-linear vector function of the original input vector x. Using

these functions, for the J class problem we can design for example, a J-class discriminant

with J linear functions of the form

fj(x,wj) = wTj φ(x) + w0. (1.4)

x is assigned to class Ck if fk > fj for all k �= j.

From a feature extraction perspective, discriminant functions provide an interest-

ing avenue for integrating information from the data through the data dependent transfor-

mations of the input features. An example of a linear discriminant function is Fisher’s linear

discriminant. In this method, instead of using the linear combination of the input vector

to form a hyperplane for class assignment, the linear combination is used as a dimension-

ality reduction technique. The weight vector w is designed as a set of basis functions that

projects the feature vector x to a lower dimension such that there is maximal separation

between class means and variance within each class is minimum. A common criteria used

5


for this objective is defined as

F(w) = trace(S−1w Sb), (1.5)

where Sw and Sb are within-in class and between class covariance matrices of the data. If

the dimensionality of the new projection space is M , the weight vector can be shown to

be the set of basis functions corresponding to M eigenvectors of S−1w Sb with the largest

eigenvalues [12].

More powerful discriminant functions can be designed by using non-linear basis

functions. In feed-forward neural networks, which are classic examples of these models, the

generalized linear discriminant function is modified as

f(x,w) = g(K∑

k=1

wkφk(x)), (1.6)

where g(.) is a non-linear activation function and φk is now a non-linear. During the training

phase, the basis functions and the weights are adjusted using the training data [13].

In a two layer neural network for example, the processing starts by creating linear

combinations of the N dimensional feature vector at each of the K hidden layer units. With

each of the hidden nodes being connected to every input node through a set of weights, an

activation input of the form

ak =N∑

n=1

wnkxn + wk0, (1.7)

is first produced at each node. Each of node activation then passes through a differentiable,

nonlinear activation function ψ(.) to produce output activations bk = ψ(ak). Commonly

used activation functions are nonlinear sigmoidal functions like the logistic sigmoid or the

‘tanh’ function. Weight wnk is a trainable parameter connecting input node n and hidden

6


node k. wk0 is fixed bias term of the hidden node. Activation outputs of the hidden layer

are then linear combined again to form output unit activations. Each of the M output

nodes receives an activation input

am =K∑

k=1

wkmbm + wm0 (1.8)

to produce an output of the form cm = σ(am), where σ(.) is the ‘softmax’ activation function

defined as

σ(am) =exp(am)∑m exp(am)

, (1.9)

for multi-class classification problems. Using (1.7)-(1.9), the overall network function can

be written as

hm(x,w) = σ( K∑

k=0

wkmψ( N∑

n=0

wnkxn

)). (1.10)

Comparing (1.6) with (1.10) shows how the non-linear basis functions ψ(.) are also now

learnt like the weight parameters. There are different training algorithms to learn these

parameters. In commonly used training methods, model parameters are optimized by us-

ing a cross-entropy error criteria and techniques like error back-propagation. For speech

applications, multilayer perceptrons (MLP) can be used to estimate posterior probabilities

of speech classes like phonemes, conditioned on the input features [3, 14].

1.4 Review of data-driven feature transforms for ASR

Both the transforms reviewed above - transforms with linear basis functions and

transforms with non-linear basis functions form starting points for the development of more

7


Time−frequencyrepresentation

Data−drivenprojections

Data independentprojections

BasisFunctions Functions

TransformsFront−end Feature

based TransformTraining Criteria

based TransformTraining Criteria

Features

Speech Signal

Back−end Feature Transforms

ModelsAcoustic

Features DiscriminativeTraining FrameworkTraining Framework

Maximum Likelihood

Linear Non−linearBasis

Figure 1.1: Broad Classification of Feature Transforms for ASR.

complex data-driven feature transforms and acoustic model backends in speech recognition.

Although transforms like the discrete Fourier transform and the discrete cosine transform

have been used, neither of these transforms are data driven. There has hence been consid-

erable interest to improve these front-ends with more powerful data-driven techniques.

Figure 1.1 is a schematic of how data-driven feature extraction or transformation techniques

for ASR can be broadly classified. There are clearly two distinct sets of transformation

classes - while one set of transforms are strongly tied with the feature extraction module,

8


the second set is strongly coupled with the acoustic model and its training criteria. We call

the first class front-end feature transforms and the second class back-end feature transforms.

1.4.1 Front-end feature transforms

Data-driven feature extractors at the front-end operate directly on time-frequency

representations of speech. As shown in Figure 1.1, these transforms can be further cate-

gorized into two broad groups - data independent projections and data-driven projections.

Examples of data independent projections are the DCT transforms discussed earlier. Al-

though these are a set of fixed cosine basis functions, they are very similar to basis functions

that can be derived from a direct principal component analysis (PCA) [15] on the auditory

spectrum of speech. Principal component analysis or the Karhunen-Loeve transform (KLT)

is a mathematical procedure that transforms a set of observations from possibly correlated

variables into a new set of values corresponding to linearly uncorrelated variables or prin-

cipal components. Figure 1.2 (reproduced from [16]) shows a set of spectral basis derived

using the data-dependent Karhunen-Loeve transform (KLT) on filter bank outputs using 2

hours of speech from the OGI Stories database [17]. The basis functions are very similar to

the cosine functions used in conventional features. The flatness of the first basis function

shows that the variation in the average energy is what contributes the most to the variance

of auditory representations.

LDA using the Fisher discriminant criteria described earlier has been used as a

useful tool in the development of many techniques in the second class of projections - data-

dependent projections. This class is sub-divided further into two groups - a set of transforms

that use linear basis derived by solving a generalized eigenvalue decomposition problem

9


Figure 1.2: Spectral basis functions derived using PCA on the bark-spectrum of speech fromthe OGI stories database - Eigenvalues of the KLT basis, total covariance matrix projectedon the first 8 KLT vectors, first 6 KL spectral basis functions derived by PCA analysis.

and those which use neural network based techniques with non-linear basis functions. In

early work, Brown [18] and Hunt [19] have used LDA on features in speech recognition.

Hunt and his colleagues integrated LDA with Mel-auditory representations of speech in

framework they called IMELDA - integrated Mel-scale representation with LDA [19,20]. A

host of techniques have since been developed based on using LDA with HMM based speech

recognizers to improve recognition performances. These techniques have focused on the use

of different types of output classes like phones, subphones or HMM states and improvements

to LDA limitations - class-conditional distributions are assumed to be normal with equal

covariances matrices. Apart from improving recognition performances, a series of work

10


by Malayath, van Vuuren, Valente and Hermansky [21–23] have analyzed the usefulness

of LDA with phonemes as output classes. Table 1.1 summarizes their key observations of

using LDA with different time-frequency representations of speech. All these techniques

while decorrelating the input feature vectors also maximize the class separability of the

desired output classes, leading to improvements in the recognition performances of ASR

systems.

Time-Frequency Representation Observations

Short-time Fourier spectrum - Discriminant vectors have a non-uniform

LDA is applied to the analysis resolution with frequency - low

to the log-spectra of speech frequency parts of the spectrum are analyzed

with higher resolution than high frequency

parts. This is consistent with the properties

of Mel/Bark filter-bank analysis used in

conventional feature extraction techniques.

Consistent with the properties of hearing,

sensitivity of features derived using

these functions is inversely related to

formant frequencies.

Critical-band spectrum - Unlike the first cosine function,

LDA is applied to total energy of the spectrum is not used.

critical band spectral features Second and third discriminants capture

spectral ripples in the central portion of

11


Time-Frequency Representation Observations

the critical-band spectrum. The fourth basis

uses information above 5 bark period.

Figure 1.3 (reproduced from [22])

shows the important basis functions

compares the basis functions.

Trajectories of Discriminant vectors form a set of FIR filters.

critical-band energies - LDA is Frequency response of the first three discriminant

applied to long segments of vectors are consistent with the RASTA, delta and

time trajectories double-delta features used in ASR.

Figure 1.4 (reproduced from [21]

compares the basis functions with the RASTA

and delta and double-delta filters proposed

by Furui [24]

Table 1.1: LDA with different representations of speech.

While the PCA and LDA techniques described above are useful in describing

transforms in the Euclidean space, manifold based techniques characterize data as being

embedded in a manifold space [25–27]. Several generic manifold learning techniques have

been adopted to be applied on speech data. While learning the manifold structure, several

of these techniques also model both global and local relationships between data points in

the manifold space as constraints. These learning problems are usual solved as optimization

12


Figure 1.3: LDA-derived spectral basis functions of the critical band spectral space derivedfrom the OGI Numbers corpus.

problems or as generalized eigenvector problems.

The second important class of front-end based transforms use neural networks. For acoustic

modeling, multilayer perceptrons (MLP) based systems are trained on different kinds of

feature representations of speech to estimate posterior probabilities of output classes like

phonemes, conditioned on the input features [14]. Neural network based acoustic models

provide several key advantages -

Training criteria - Neural networks are trained to discriminate between output classes

using non-linear basis functions, with its cross-entropy training criteria. This training

can also be scaled efficiently to work on large amounts of training data.

13


Figure 1.4: (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from clean Switchboarddatabase, (b) Frequency and impulse responses of the RASTA filter and the RASTA filtercombined with the delta and double-delta filters.

Input feature assumptions - These networks can model high dimensional input features

without any strong assumptions about the probability distribution of these features.

Several different kinds of correlated feature streams can also be integrated together

14


since there are also no strong assumptions on statistical independence.

Output representations - MLPs trained on large amounts of data from a diverse col-

lection of speakers and environments, can achieve invariance to these unwanted vari-

abilities. Since posterior probabilities are produced by these networks, outputs from

several networks trained on different feature representations can be combined in a

multi-stream fashion to improve the final posterior estimations.

In hybrid HMM/MLP systems [3], these posterior probabilities are used directly

as the scaled likelihoods of sound classes in HMM states instead of conventional state-

emission probabilities from GMM models (discussed in detail in Chapter 2). Alternatively,

these posteriors can be converted to features that replace conventional acoustic features,

in HMM/GMM based system via the Tandem technique [28] (also discussed in detail in

Chapter 2). Features from intermediate layers of neural networks have also been shown to

be useful for speech recognition [29,30].

Pinto et.al [31,32] use a Volterra series based analysis to understand the behavior of

the non-linear transforms that are learned by MLPs trained to estimate phoneme posterior

probabilities. The linear Volterra kernels used to analyze MLPs trained on Mel-filter bank

features reveal interesting spectro-temporal patterns learnt by the trained system for each

phoneme class. An extended study on a hierarchy of MLPs using the same framework,

shows that when a second MLP classifier is trained on posteriors estimated by an initial

MLP, it learns phonetic temporal patterns in the posterior features. These patterns include

phonetic confusions at the output of the first MLP as well as phonotactics of the language

learnt from the training data.

15


1.4.2 Back-end feature transforms

As shown in Figure 1.1, acoustic features after front-end level transforms are used

to train acoustic models. The distributions of basic speech sounds like phones are typically

represented by a Hidden Markov Model (HMM). Phone HMMs are constructed as finite state

machines with typically five states - a start state, three emitting states and an end state,

connected in a simple left-to-right topology. In each of the emitting states, multivariate

continuous density Gaussian mixture models are used to model the emission probability

distribution of feature vectors. To cover the large phonetic variability, separate HMMs

are trained for every basic speech unit, typically a phone, in context with a left and right

neighboring phone. Individual Gaussian parameters along with the mixing coefficients of the

Gaussian mixture models are estimated in a maximum likelihood framework [2]. However

since the number of trainable tri-phone parameters is huge, additional techniques like state-

tying with phonetic decision trees are used. In a second stage of training, the acoustic

models are then discriminatively trained using objective functions such as maximum mutual

information (MMI) [33, 34], minimum phone error (MPE) [35] or minimum classification

error (MCE) [35]. To improve the performance in each of these two passes of acoustic

models training, separate feature transforms which adapt features to each of the training

phases have been proposed. These set of transforms form the second major class of feature

transforms called back-end feature transforms.

In the past linear discriminant analysis has been investigated in several different

settings - to process feature vectors [18], as a transform to improve the discrimination be-

tween HMM states [36] and also a feature rotation and reduction technique in a maximum

16


likelihood setting [37]. Kumar and Andreou generalized LDA with Heteroscedastic linear

discriminant analysis (HLDA) [38] by relaxing the assumption of sharing the same covari-

ance matrix among all output classes. Also developed in a maximum likelihood setting, the

Maximum Likelihood Linear Transform (MLLT) [39] has been shown to be a special case

of HLDA when there is is no dimensionality reduction.

Feature space transforms like fMMI [40] and fMPE [41] on the other hand, are

linear transforms also applied on feature vectors but in a discriminative framework to opti-

mize the MMI/MPE objective functions. Similar to the early work in [42], region dependent

linear transforms (RDLT) [43] are an extension to the fMMI/MPE by first partitions the

feature space into different regions using a GMM. Each feature vector is then transformed

by a linear transform corresponding to the region that the vector belong via posterior prob-

abilities from the pre-trained GMM.

State-of-the-art-system use a combination of both the front-end and back-end

transforms. Studies like [44] have shown that although these transforms are separately

applied at the feature and model level, they can be combined to significantly improve ASR

performances.

1.5 Focus of the thesis

The feature extraction module plays a very crucial “gate-keeper” role in any pat-

tern recognition task. If useful information for classification is discarded by a poorly de-

signed feature extractor from the signal, it cannot be recovered again and the classification

task suffers. On the other hand if the feature extraction module allows irrelevant and re-

17


dundant detail to remain in the features, the classification module has to be additionally

developed to cope up with this. In speech recognition, a similar setting exists - a feature

extraction front-end first produces features for a pattern recognition back-end to recognize

words. To improve the performances in this setting, this thesis focuses on developing better

features for ASR through an efficiently designed front-end.

The review presented above describes one avenue of improvement for current

speech recognition feature front-ends - the development of better data-driven features. Fig-

ure 1.5 reiterates this again. The primary goal of speech recognition is to extract the mes-

sage, the human communicator produced using an inventory of basic speech units. However

the message is embedded with several constituent components of the speech signal as it

passes through a communication channel influenced by both the human speaker, transmis-

sion mechanism and the environment before it is captured by a machine using a microphone.

It is the goal of the feature extractor module to remove these irrelevant variabilities while

extracting useful features for the speech recognition back-end to recover the message.

Current speech recognition front-ends largely rely on information in the short-term spectrum

of speech. This representation is however very fragile and easily corruptible by channel

artifacts. It is hence necessary to extend the scope of information extraction to other

sources of knowledge. The best source of information is the data itself. This thesis hence

puts focus on data-driven techniques to improve features for ASR.

In earlier sections, several techniques that allow data integration into feature ex-

traction were reviewed. Neural networks provide very interesting mechanisms of integrating

information not only because they are discriminatively trained and use non-linear basis func-

tions to transform the data but also because they have been shown to have several other

18


Figure 1.5: Thesis contributions to developing better data-drive neural network features forASR pipeline.

key advantages. For example they can accommodate large feature dimensions and do not

place strong assumptions on the distributions of these features. A very significant advan-

tage is that they can also directly produce posterior probabilities of speech classes, making

the posteriogram representation of speech - the evolution of the posteriors of speech classes

like phonemes over time, a useful source of information for speech recognition (see figure

1.5). As can be seen, this representation is void of speaker and channel variabilities and is

linked more closely to the underlying speech message encoded using basic speech units like

phonemes.

The performance of these data-driven feature extractors is however linked to sev-

19


eral factors. The MLP estimates posterior probabilities of phoneme classes ci conditioned

on the input acoustic features x and the model parameters w as p(ci|x,w). The factors

that hence determine the goodness of the posteriogram representation are -

(a) The input acoustic features: Robust acoustic features which capture information from

the rich spectro-temporal modulations of speech need to be designed.

(b) The amount of training data: Significant amounts of task dependent data needs to be

used to train the parameters of neural network models

(c) Network architectures: Suitable network architectures have to be used to learn the

data-driven transforms.

1.6 Outline of Contributions

The thesis contributes to improvements in each of the above mentioned factors in

developing better data-driven feature front-ends (Figure 1.5).

(a) Exploiting temporal dynamics of speech: We adopt a novel signal processing tech-

nique based on Frequency Domain Linear Prediction (FDLP) to better model sub-band

temporal envelopes of speech. Features from these representations are used to build

data-driven feature front-ends (Chapter 2). In experiments on a variety of ASR tasks -

from a small vocabulary continuous digit recognition task to large vocabulary continu-

ous speech recognition (LVCSR) task, the proposed data-driven features provide about

14% relative reduction of word error rate (WER).

(b) Working with limited amounts of training data: With significant amounts of train-

20


ing data the proposed data-driven features can perform well (Chapter 2). However, in

several real-world scenarios this is not always the case. In the development of ASR

technologies for new languages and domains, for example, very few hours of transcribed

data is available initially. We hence focus on data-driven features in low resource sce-

narios where only up to 1 hour of transcribed task dependent data is available to train

acoustic models. As with the case of every data-driven technique, the performances of

these feature extractors also diminish.

In Chapters 3 and 4 we propose techniques to alleviate these effects. Our proposed

techniques are based on the use of task independent data. In many cases, these sources

of data cannot be used directly. For example, if data from different languages is used to

build ASR systems for a new language, differences in phone sets used to transcribe each

language step in. We propose techniques to deal with these kinds of issues in training

data-driven front-ends.

(c) Neural network architectures for data-driven front-ends: We demonstrate the

use of several neural network architectures to allow task independent data to be used.

Using data transcribed with different phone sets from different languages, these im-

provements allow better neural network models to be built. Our contributions lead to

an absolute WER reduction of about 15% on a low-resource task with only 1 hour of

transcribed training data for acoustic modeling.

(d) Applications of data-driven features: Several new applications of the proposed

data-driven front-ends are presented, apart from using them to generate features for

ASR (Chapters 5). These applications include - speech activity detection in noisy en-

21


vironments, speaker verification using neural networks, term discovery in zero-resource

settings and event detectors for speech recognition.

1.7 Thesis Organization

This thesis is organized as follows:

1. Chapter 2 is an overview of different feature extraction techniques for ASR and in-

troduces a set of new features using Frequency Domain Prediction. The usefulness of

these features is demonstrated using in a series of ASR experiments.

2. In Chapter 3, we discuss a key weakness of data-driven ASR acoustic modeling tech-

niques - performances of these systems drop significantly with only few hours of tran-

scribed training data. We show how this can be compensated using the proposed

data-driven front-ends which also are affected in these scenarios. The proposed ap-

proaches are based on the use of multilingual task independent data.

3. Chapter 4 extends the training approaches introduced in Chapter 3 by employing

wider and deeper neural network architectures in low resource settings. Typically

these kinds of networks cannot be trained well with only few hours of transcribed

training data. We however show how task independent data can be used in these

settings as well.

4. Chapter 5 discusses four different applications of data-driven front-ends outputs - as

features for speech activity detection, probabilities of broad phonetic classes to model

parts of each speakers acoustic space in a neural network based speaker verification

22


system, feature representations for zero-resource applications and event detectors for

speech recognition.

5. Chapter 6 summarizes the thesis.

23

Chapter 2

From Acoustic Features to

Data-driven Features

This chapter introduces a novel acoustic feature extraction technique for ASR. Data-driven

front-ends are developed using these features and evaluated on different ASR tasks. Significant

improvements are demonstrated by using the proposed features with neural network based front-ends.

2.1 Time-Frequency Representations of Speech

Conventional feature extraction techniques start with the short-term spectrum of

speech - a representation derived by applying the Fourier transform on a short segments

of speech. Typically the short-term analysis is performed using a 25 ms Hamming window

every 10 ms. Although speech is a non-stationary signal, over sufficiently short time inter-

vals, the signal can be considered stationary. In each of these analysis windows, the power

spectrum - squared magnitude of the short-term Fourier spectrum is then computed, before

24

CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES

being processed along two dimensions - across frequency and across time. The processing

across frequency attempts to model the gross shape of the spectrum. Temporal dynamics

of the spectrum are, on the other hand, captured by the processing across time.

2.1.1 Processing across the frequency axis

Processing across the frequency axis has two primary objectives. Through a se-

quence of steps the resolution of the spectrum is first modified to be non-uniform instead

of the inherent uniform Fourier transform resolution. The non-uniform resolution has been

shown to be useful for discriminating between basic speech sounds [22]. The spectrum is also

smoothened to capture only its gross shape and remove any rapidly varying fluctuations.

In Perceptual Linear Prediction (PLP) [10], the first objective is achieved through

a set of operations motivated by human auditory perception which convert the power spec-

trum to an auditory-like spectrum. These steps include -

• Using a filter-bank of trapezoidal filters, to warp the power spectrum to a Bark fre-

quency scale. Outputs of these integrators are consistent with the notion of integration

of signal energy in critical bands in the human ear [45].

• Emphasizing each sub-band frequency signal using a scaling function base on the

equal-loudness curve of hearing. This operation has an equivalent effect of pre-

emphasis in the time domain. Pre-emphasis is performed to remove the overall spectral

slope of the spectrum and DC component of the speech signal.

• Compressing the sub-band signals using the cubic root function. This step is moti-

vated by the power law of hearing that relates intensity and perceived loudness.

25


The gross shape of the auditory spectrum of speech is finally approximated using an auto-

regressive model. The prediction coefficients of this model are obtained via a recursive auto-

correlation based method on the inverse Fourier transform of the auditory spectrum [10]. As

described in the previous chapter, the features for ASR are cepstral coefficients obtained by

projecting the smoothened auditory-like representation onto a set of discrete cosine trans-

form (DCT) basis functions. Based on the source-filter interpretation of LPC, smoothing

the spectrum using LPC allows the features to capture vocal tract filter properties which

are useful in characterizing speech sounds. Apart from its decorrelation and dimensionality

reduction properties, truncating the DCT coefficients also removes higher order coefficients

that capture speaker specifics in the spectrum. This further smooths the spectrum.

2.1.2 Processing across the time axis

Just as the gross shape of the spectrum is useful in characterizing speech sounds,

temporal dynamics of the spectrum are also key to classification. Several important obser-

vations [46] useful while capturing these dynamics include facts that -

• speech is produced at a typical rate by vocal tract movements. The rate of change of

non-linguistic components is usually outside this range and

• human perception is more sensitive to relative changes than absolute quantities.

Traditionally temporal dynamics have been captured through first and second or-

der time derivatives of cepstral coefficients [11]. These operations can also be interpreted as

a filtering operation that enhance components around 10 Hz of the modulation spectrum of

speech while suppressing other higher and lower components. In the RASTA processing of

26


speech, the above mentioned observations are explicitly integrated into the PLP pipeline by

filtering the temporal trajectories of the spectrum to suppress constant factors while pre-

serving components of the modulation spectrum between 1 and 12 Hz [46]. In an extension

to the RASTA technique, a bank of bandpass filters with varying resolutions has also been

developed in [47] to process the modulation spectrum.

2.1.3 Integrating speaker and channel invariance

As illustrated in Figure 1.5, the speech signal is often modified by channel and

speaker characteristics before it is processed by the feature extraction module. It is hence

necessary to compensate for these artifacts as well.

Differences in the vocal tract anatomies, lead to significant variability in the spec-

trum between speakers and genders. Other extrinsic characteristics that produce speaker

variabilities are the socio-linguistic background and emotional state of the speakers. While

some of these artifacts are compensated for by techniques like PLP, techniques such as vocal

tract length normalization (VTLN) [48] are often used by state-of-the-art feature extraction

techniques.

Effects from the channel or environment are usually modeled as additive or convo-

lutive distortions. When speech signal is corrupted by additive noise, the recorded speech

signal is expressed as

ns[m] = cs[m] + n[m], (2.1)

where ns[m], cs[m], n[m] are discrete representations of the noisy speech, clean speech and

the corrupting noise respectively. If the speech and noise are assumed to be uncorrelated,

27


in the power spectral domain we can write

Pns(m,ωk) = Pcs(m,ωk) + Pn(m,ωk), (2.2)

where Pns(m,ωk), Pcs(m,ωk), Pn(m,ωk) are the short-term power spectral densities at fre-

quency ωk of the noisy speech, clean speech and noise respectively. Conventional feature

extraction techniques for ASR estimate the short term (1030 ms) power spectral density

(PSD) of speech in bark or mel scale. Hence, most of the recently proposed noise robust

feature extraction techniques apply some kind of spectral subtraction in which an estimate

of the noise PSD is subtracted from the noisy speech PSD. The estimate of noise PSD

is usually computed using a speech activity detector from regions likely to contain only

noise (for example the ETSI front-end [49]). A survey of many other common techniques

is available in [50].

The second class of distortions are convolutive distortions introduced by room

reverberations when speech is recorded using a distant microphone or by telephone com-

munication channels. If the channel effect or room reverberation can be characterized as a

channel impulse response or room impulse response, the noisy speech can be written as

ns[t] = cs[t] ∗ r[t], (2.3)

where ns[m], cs[m], r[m] is the noisy speech, clean speech and the corrupting room or chan-

nel impulse response respectively. These kinds of convolutive distortions are multiplicative

in the spectral domain and additive in the log-spectral domain. However these assumptions

hold true only in appropriate analysis windows.

In cepstral mean subtraction (CMS), the channel impulse response is assumed to

be shorter than the short-term Fourier transform (STFT) analysis window. If the artifact

28


is assumed to be constant in the short analysis window, its effect appears as an offset term

in the final cepstral representation. If the channel remains the same for each recorded ut-

terance, the artifact can then be removed by subtracting the mean of the cepstral sequences

corresponding to the utterance. State-of-the-art systems in addition to cepstral mean nor-

malization (CMN) also perform a variance normalization to improve performances to these

distortions [51]. Usually this is done on a per speaker basis as well to achieve additional

speaker normalization. In the log-DFT mean normalization technique [52], mean subtrac-

tion is done on a linear frequency scale instead of a warped scale as in CMS since the

assumption that response functions might have a constant value in each critical band is not

always valid.

In reverberant environments, to make a similar mean subtraction effective, it is

necessary to estimate the log-spectrum in much longer analysis windows. This is because

room impulse responses characterized by their T60 reverberation times usually have ranges

between 200-800 ms. T60 denotes the amount of time required for the reverberant signal to

reduce by 60 dB from the initial direct component value. Successful techniques like long-

term spectral subtraction (LTLSS) [53] hence use analysis windows as long as 2 seconds to

deal with these artifacts depending on the nature of the reverberation.

2.2 Towards Improved Features for ASR

After the time-frequency representation of speech has been processed as described

above, acoustic features for speech recognition are derived from the representation. The

most common feature representations are cepstral vectors of the processed power spectral

29


envelope derived using 20-30 ms analysis windows every 10 ms. These features are then

typically augmented with time derivatives (first, second and third derivatives). In some sys-

tems instead of using the time derivatives, 9-21 successive frames are concatenated together

and used after a projection to a lower dimension using various transforms [54–56].

Aggregating information from only such a limited temporal context could however

be a reason of lower ASR performances compared to human recognition performances [57].

This argument is further strengthened by information theoretic results showing that features

from longer time intervals (up to several hundred milliseconds) are useful better discrimina-

tion between speech sounds [58]. This limitation has been addressed using several different

feature extraction/signal processing techniques discussed below.

2.2.1 Long-term acoustic features

Through a number of studies it has been shown that speech perception is sensitive

to relatively slow modulations of the temporal envelope of speech. [59, 60]. Most of the

energy in the modulation spectrum peaks around 4 Hz which is also corresponds to syllabic

rate of speech. In the presence of noise although these components are affected [61, 62],

modifying modulation components in the 1-16 Hz range results in significant degradation

of speech intelligibility [59,60].

Information from the modulation spectrum can be derived from a spectral analysis

of temporal trajectories of spectral envelopes of speech [63]. However, in order to achieve

sufficient spectral resolution at the low modulation frequencies frequencies described above,

relatively long segments of speech signal need to be analyzed. For example to capture

modulation spectrum components around 4 Hz, an analysis window of at least 250 ms is

30


necessary. Analysis windows of this length are also consistent with the time intervals of co-

articulation - a speech production phenomena, forward masking - an auditory perception

phenomena and the linguistic concept of syllable [64]. By deriving features for ASR using

these kinds of analysis windows, information about the dynamics of spectral components

are explicitly being captured.

In [65, 66], 1 second long temporal trajectories of individual critical sub-band en-

ergies were used for phoneme recognition experiments. In this multi-stream framework,

separate neural network classifiers were trained on long-term features from each sub-band

before being combined together by a second level neural network. Since features from

each sub-band were used independently, the comparable performance of this feature ex-

traction technique with conventional short-term spectral features, demonstrates that there

is significant information in the local temporal dynamics being captured. These temporal

pattern features (TRAPS) have been extended in different configurations (for example [67])

as modulation features after applying a cosine transform [68] or filtering using modulation

filters [47].

2.2.2 Parametric models of temporal envelopes

The modulation features discussed above are extracted from sub-band energies

of speech using long analysis windows. The sub-band energies are not directly modeled

but are instead produced with an inherent limited resolution as outputs of Bark/Mel scale

integrators on the power spectrum in short-analysis windows every 10 ms (see Section

2.1.1). For more effective features that capture the evolution of the temporal envelopes, it

is necessary to directly model the temporal envelopes.

31


As described in Section 2.1.1, conventional feature extraction techniques use LPC

in time to effectively capture spectral resonances. Based on duality properties, LPC can

similarly be performed in the frequency domain to directly model and capture important

temporal events. This framework is based on the notion that speech can be considered to

be composed of several amplitude modulated signals at different carrier frequencies. The

AM component of each of these signals is the squared magnitude of their corresponding

analytical signals. The squared magnitude of the analytical signal is also called the Hilbert

envelope and is a description of temporal energy. Instead of computing the analytic signal

directly, an auto-regressive modeling approach can be used. This modeling approach also

called Frequency Domain Linear Prediction (FDLP) is the dual of conventional time domain

linear prediction used to model the power spectrum of speech [69,70]. Instead of modeling

the power spectrum, FDLP models the evaluation of signal energy in the time domain

by the application of linear prediction in the frequency domain using the discrete cosine

transform of the signal. This parametric model can be used as an alternate technique to

directly model sub-band envelopes of speech [71,72].

2.2.3 Neural network features

The modulation features described in the earlier sections are typically high dimen-

sional correlated features. Both these limitations prevent them from being used directly

with ASR systems. These features have hence been used in conjunction with neural net-

works which have much more relaxed assumptions on feature distributions. As described

in the previous chapter neural networks can be trained to estimate posterior probabilities

of speech classes. These probabilities can then be used directly as scaled likelihoods in the

32


hybrid HMM-ANN ASR framework.

Another approach to using neural network posterior outputs, is to convert the

posteriors to features similar to traditional acoustic features for ASR systems. In the Tan-

dem processing approach [28], posterior features from neural networks are post-processed

to be decorrelated and approximately have a normal distribution. This is done in a two

step procedure - a log transform is first applied to the posteriors to Gaussianize the vectors

followed by a dimensionality operation using the KL transform. Several other approaches

have been proposed to derive features from the outputs of neural networks. In the HATS

technique [73] non-linear outputs from the penultimate layer of a network have been used.

This has been further extended to deriving features from an intermediate bottleneck layer

which reduces the feature dimension as well [30].

2.2.4 Combination of information from multiple streams

A key benefit from the development of long-term features are the significant

LVCSR gains obtained from combining these features with conventional short-term fea-

tures [73]. The best combination of features is obtained by first training neural networks

using both the long-term modulation features and short-term spectral energy based features

separately. The outputs of the neural networks are then combined using a merger neural

network or using different combination rules before being used as data-driven features for

LVCSR tasks [74]. As discussed in [75] this approach is useful because of several reasons -

• The MLP features derived from neural networks trained on conventional short-term

spectral features and long-term modulation features capture complimentary informa-

33


tion about phone classes.

• Although the MLPs are trained on different inputs, since they have the same target

classes the complimentary outputs can be effectively combined.

• During the training phase the neural networks are able to discriminatively learn

class boundaries and produce data-driven features that are useful for classification

of sounds. These features are also relatively speaker invariant.

• After the application of post-processing techniques like Tandem, the data-driven neu-

ral network features can easily be modeled by HMM-GMM based LVCSR systems.

2.3 Novel Short-term and Long-term Features for Speech

Recognition

We propose a novel feature extraction scheme along the lines of the techniques

described above, to derive two kinds of features - short-term spectral features and long-

term modulation features for ASR. The technique starts by creating a two-dimensional

auditory spectrogram representation of the input signal. This is formed by stacking sub-

band temporal envelopes in frequency instead of stacking short-term spectral estimates in

time.

The sub-band temporal envelopes are obtained by analyzing speech using Fre-

quency Domain Linear Prediction (FDLP). The FDLP technique, as described earlier, fits

an all pole model to the Hilbert envelope of the signal (See Figure 2.1). These representa-

tions of the speech signal are able to capture fine temporal events associated with transient

34


0 0.6 1.3-5000

0

5000(a)

0 0.65 1.30

5000(b)

0 0.65 1.30

5000(c)

Time (s)

Figure 2.1: Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP.

events like stop bursts while at the same time summarize the signals gross temporal evo-

lution [76]. Short-term features are derived by integrating the auditory spectrogram in

short analysis windows. Long-term modulation frequency components are obtained after

the application of the cosine transform on compressed (static and adaptive compression)

sub-band temporal envelopes.

35


2.3.1 FDLP based time-frequency representation

The FDLP time-frequency representation is created through the following steps [72] -

(a) Change of processing domain - The FDLP spectrogram is a 2 dimensional time-frequency

representation of speech constructed by stacking sub-band temporal envelopes of a

speech signal across frequencies. Each of these temporal envelopes corresponds to a

sub-band frequency signal. To facilitate this, the speech signal is first projected into

the frequency domain via the DCT transform.

(b) Analysis of speech into sub-band frequency signals - Sub-band frequency signal are

obtained by windowing the DCT transform using a set of overlapping Gaussian windows

usually placed on a Bark or Mel scale.

(c) Computation of auto-correlation coefficients via a series of dual operations of time do-

main linear prediction (TDLP) - Among the many approaches, one way of applying

TDLP is using the auto-correlation of the time signal. The auto-correlation coeffi-

cients are in turn derived from the power spectrum since the power spectrum and

auto-correlation of the time signal form Fourier transform pairs. In the FDLP case, the

Hilbert envelope and the auto-correlation of the DCT signal form Fourier transform

pairs.

Since the sub-band DCT signals have already been derived in the previous step, their

the auto-correlation coefficients can be computed. We start by computing the squared

magnitude of the inverse discrete Fourier (IDFT) transform of the DCT signal. The

application of a second Fourier transform produces the desired auto-correlation coeffi-

36


cients.

(d) Application of linear prediction - By solving a system of linear equations, the auto-

regressive model of each sub-band Hilbert envelope is finally derived from the auto-

correlation coefficients. Using the set of prediction coefficients {ai} the estimated

Hilbert envelope in each sub-band HEs can be represented as

HEs(n) =G

|∑i=pi=0 aie−i2πkn|2 (2.4)

The parameter G is called the gain of the model. In [77], by normalizing the gain G, the

estimated sub-band envelopes have been shown to become robust to convolutive distor-

tions like reverberations and telephone channel artifacts. Additional robustness to additive

distortions by short-term subtraction of an estimate of noise have also been shown in [78].

There are several parameters that control the temporal resolution of the estimated envelopes

as well as the type and extent of analysis windows for different applications. These have

been elaborated in [72].

Figure 2.2 shows the PLP and FDLP spectrograms for a portion of speech. Criti-

cally spaced sub-bands energies of speech are derived in short-analysis windows in the PLP

case. The representation is hence smooth across frequencies in each analysis windows. In-

dividual sub-bands of speech are directly modeled in FDLP technique, resulting in a better

temporal resolution - for example the transient regions are well captured in this representa-

tion. Two kinds of features are derived from two-dimensional time-frequency representation

of speech formed by sub-band temporal envelopes derived using FDLP.

37


0 0.24 0.45-5000

0

5000(a)

(b)

Bar

k ba

nd

0 0.24 0.45

20

10

0

(c)

Bar

k ba

nd

Time (s)0 0.24 0.45

20

10

0

Figure 2.2: PLP (b) and FDLP (c) spectrograms for a portion of speech (a).

2.3.2 Short-term Features

In conventional feature extraction techniques like PLP, the power spectrum is

first integrated using Mel/Bark integrators in short analysis windows to create sub-band

trajectories of spectral energy. In the FDLP time-frequency representation, instead of the

sub-band trajectories of spectral energy, identical distributions of energy in the time domain

(sub-band Hilbert envelopes), are estimated. Short-term cepstral features can be derived

from these representations.

38


This is done by first integrating the envelopes in short term analysis Hamming

windows (of the order of 25 ms with a shift of 10 ms). The integrated sub-band energies

are then converted to cepstral coefficients by applying the log transform and taking the

DCT transform across the spectral bands in each of the frames. For most applications we

use 13 cepstral coefficients. First and second derivatives of these cepstral coefficients are

also appended to form a 39 dimensional feature vector [79,80], similar to conventional PLP

features. In [72, 81], a set of FDLP modeling parameters that improve the performance

of these short-term features for ASR in noisy environments has been identified. These

parameters and their effects are summarized in Table 2.1. In all these experiments, both

clean and noisy reverberant test data is evaluated on models trained with clean speech.

FDLP Model Parameter Observations

Gain Normalization Gain normalization significantly improves

feature robustness in reverberant environments [77,82].

Using rectangular analysis windows on a Mel

scale for sub-band decomposition also contributes

to robustness by reducing the mismatch between

the clean and noisy reverberant data.

Number of Sub-bands Increasing the spectral resolution improves

robustness in reverberant conditions. The assumptions

made for gain normalization are more valid with

increased number of sub-bands. In reverberant

conditions using up to 96 linear bands has shown to

39



be useful [77].

Model Order Model order relates to the model’s ability

to capture sufficient detail of the envelopes.

In clean conditions, a higher model order

is useful. A lower model order is however better

in reverberant conditions [72,81].

Envelope Expansion Envelope expansion relates to how the all-pole model

models peaks and valleys of the Hilbert

envelope. While envelope expansion is useful

in noisy environments to capture dominant reliable

peaks, no significant gains are observed

in clean conditions [72,81].

Table 2.1: FDLP model parameters that improve robustness

of short-term spectral features.

2.3.3 Long-term Features

In techniques like TRAPS and MRASTA described earlier, modulation frequency

features are derived by analyzing temporal trajectories of spectral energy estimates in indi-

vidual sub-bands using long analysis windows. As described earlier, since FDLP estimates

the temporal envelope in sub-bands, modulation features can be derived from these en-

40


velopes as well [79].

Before we derive the long-term features, we compress the sub-band temporal en-

velopes both statically and dynamically. The envelopes are compressed statically using the

logarithmic function. Dynamic compression of the envelopes is achieved using an adapta-

tion circuit which consists of five consecutive nonlinear adaptation loops proposed in [83].

These loops are designed so that sudden transitions in the sub-band envelope that are fast

compared to the time constants of the adaptation loops are amplified linearly at the out-

put, while the steady state regions of the input signal are compressed logarithmically. The

compressed temporal envelopes are then transformed using the Discrete Cosine Transform

(DCT) in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation

frequency components from each cosine transform, yielding modulation spectrum in the 0-

35 Hz range with a resolution of 2.5 Hz [84]. The static and dynamic modulation frequency

components of each critical band are then stacked together before being used as features.

In [85], the proposed modulation features have been compared with other similar

modulation feature techniques - Modulation Spectrogram (MSG) [86], MRASTA [47] and

Fepstum [87]. In these experiments FDLP based modulations are significantly better than

features derived from the other approaches. An additional set of FDLP modeling parameters

that improve the performance of these long-term features for ASR have also been identified

based on a set of phoneme recognition experiments. These parameters and their effects are

summarized in Table 2.2.

41



Modulation analysis window The analysis window used to derive the

the modulation coefficients can be varied.

The best recognition performance was obtained

using a window of 200 ms, which also corresponds

to the syllabic rate of speech.

Extent of modulations The number of DCT coefficients can be varied to

change the extent of the modulation spectrum.

The best range was found to be using 14 DCT

coefficients covering the 0-35 Hz range.

Type of modulation spectrum As described earlier, two kinds of compression

schemes are used for the modulation features.

While the static log modulation features improve the

the phoneme recognition performances on fricatives

and nasals, the dynamic adaptive loops based features

help in better recognition of plosives and affricatives [85].

A combination of both these features provides significant

improvements to all classes [88].

Table 2.2: FDLP model parameters that improve perfor-

mance of long-term modulation features.

42


��

��

��

��

��

features

Speech FDLPAdaptive

compression

Staticcompression

statically compressed sub−bands envelopes

adaptively compressed sub−bands envelopes

freq

uenc

y

sub−bands envelopestime

auditory

featuresmodulation

Posteriorprobabilityestimator

probabilityestimator

Posterior

probabilitymerger

Posterior TandemProcessing for ASR

Features

Figure 2.3: Schematic of the joint spectral envelope, modulation features for posterior basedASR

2.3.4 Data-driven Features

Each of these acoustic features are converted into data-driven features by using

them to first train two separate 3-layer multilayer perceptrons to estimate posterior prob-

abilities of phoneme classes. Each frame of the short-term spectral envelope features is

used with a context of 9 frames during training. As described earlier, static and dynamic

modulation frequency features of each critical band are stacked together and used to train

a separate MLP network. The spectral envelope and modulation frequency features are

then combined at the phoneme posterior level using the Dempster Shafer (DS) theory of

evidence [74]. These phoneme posteriors are first Gaussianized by using the log function

and then decorrelated using the Karhunen-Loeve Transform (KLT) [28]. This reduces the

dimensionality of the feature vectors by retaining only the feature components which con-

tribute most to the variance of the data. We use 25 dimensional features in our Tandem

43


representations similar to [75]. Figure 2.3 shows the schematic of the proposed feature

extraction technique.

2.4 Speech Recognition Experiments and Results

We perform a set of experiments using Tandem representations of the proposed

spectral envelope and modulation frequency features along with other state-of-the-art fea-

tures for ASR. These include a phoneme recognition task, a small vocabulary continuous

digit recognition task and a large vocabulary continuous speech recognition (LVCSR) task.

For each of these experiments, we train three layered MLPs to estimate phoneme posterior

probabilities using these features. The proposed features are compared with three other

feature extraction techniques - PLP features [10] with a 9 frame context which are similar

to spectral envelope features derived using FDLP (FDLP-S), M-RASTA features [47] and

Modulation Spectro-Gram (MSG) features [86] with a 9 frame context, which are both

similar to modulation frequency features (FDLP-M). We combine FDLP-S features with

FDLP-M features using the DS theory of evidence to obtain a joint spectro-temporal fea-

ture set (FDLP-S+FDLP-M). Similarly, we derive two more feature sets by combining PLP

features with M-RASTA features (PLP+M-RASTA) and MSG features (PLP+MSG). 25

dimensional Tandem representations of these features are used for our experiments. We also

experiment with 39 dimensional PLP features without any Tandem processing (PLP-D).

44


2.4.1 Phoneme Recognition

Our first experiment is to validate the usefulness of Tandem representation of our

features for a phoneme recognition task using HMMs. We perform experiments on the

TIMIT database, excluding ‘sa’ dialect sentences. All speech files are sampled at 16 kHz.

The training data consists of 3000 utterances from 375 speakers, cross validation data set

consists of 696 utterances from 87 speakers and the test data set consists of 1344 utterances

from 168 speakers. The TIMIT database, which is hand-labeled using 61 labels is mapped to

the standard set of 39 phonemes [89]. A three layered MLP is used to estimate the phoneme

posterior probabilities. The network consisting of 1000 hidden neurons, and 39 output

neurons (with soft max nonlinearity) representing the phoneme classes is trained using the

standard back propagation algorithm with cross entropy error criteria. The learning rate

and stopping criterion are controlled by the error in the frame-based phoneme classification

on the cross validation data.

The Tandem representation of each feature set is used along with a decision tree

clustered triphone HMM with 3 states per triphone, trained using standard HTK maximum

likelihood training procedures. The emission probability density in each HMM state is mod-

eled with 11 diagonal covariance Gaussians. We use a simple word-loop grammar model

using the same standard set of 39 phonemes. Table 2.3 shows the results for phoneme recog-

nition accuracies across all individual phoneme classes for these techniques. The proposed

features (FDLP-S+FDLP-M) significantly improve the recognition accuracy compared to

the baseline PLP-D feature set.

45


Table 2.3: Phoneme Recognition Accuracies (%) for different feature extraction techniqueson the TIMIT database

Features Phoneme Rec. Acc. (%)

PLP-D 68.3

PLP 70.1

FDLP-S 70.1

M-RASTA 66.8

MSG 65.1

FDLP-M 70.6

PLP+M-RASTA 71.2

PLP+MSG 71.4

FDLP-S+FDLP-M 72.5

2.4.2 Small Vocabulary Digit Recognition

In our second experiment, we use these features on a small vocabulary continuous

digit recognition (OGI Digits database) to recognize eleven (0-9 and zero) digits with 28

pronunciation variants [47]. MLPs are trained using these features to estimate posterior

probabilities of 29 English phonemes using the whole Stories database plus the training

part of Numbers95 database with approximately 10% of data for cross-validation. Tandem

representation of the features are used along with a phoneme-based HMM system with

22 context-independent three-state phoneme HMMs, each model distribution represented

by 32 Gaussian mixture components [47]. Table 2.4 shows the results for word recognition

accuracies. For this task, the proposed spectral envelope features (FDLP-S) and modulation

46


Table 2.4: Word Recognition Accuracies (%) on the OGI Digits database for different featureextraction techniques

Features Word Recog. Acc. (%)

PLP-D 95.9

PLP 96.2

FDLP-S 96.6

M-RASTA 96.3

MSG 96.0

FDLP-M 96.8

PLP+M-RASTA 97.1

PLP+MSG 97.0

FDLP-S+FDLP-M 97.1

frequency features (FDLP-M) improve word recognition accuracies compared to PLP and

MRASTA features respectively.

2.4.3 Large Vocabulary Continuous Speech Recognition

In our third experiment, we use these features on an LVCSR task using the AMI

LVCSR system for meeting transcription [90]. The training data for this system uses indi-

vidual headset microphone (IHM) data from four meeting corpora; NIST (13 hours), ISL

(10 hours), ICSI (73 hours) and a preliminary part of the AMI corpus (16 hours). MLPs

are trained on the whole training set in order to obtain estimates of phoneme posteriors for

each of the feature sets. Acoustic models are phonetically state tied triphone models trained

using standard HTK maximum likelihood training procedures. The recognition experiments

47


Table 2.5: Word Recognition Accuracies (%) on RT05 Meeting data, for different featureextraction techniques. TOT - total word recognition accuracy (%) for all test sets, AMI,CMU, ICSI, NIST, VT - word recognition accuracies (%) on individual test sets

Features TOT AMI CMU ICSI NIST VT

PLP-D 58.1 57.6 60.6 68.7 49.1 53.6

PLP 53.6 59.1 56.3 70.0 45.3 34.9

FDLP-S 57.5 58.4 58.5 66.9 48.4 54.5

M-RASTA 54.6 53.3 58.4 63.2 46.6 51.0

MSG 55.6 56.1 59.3 65.5 47.9 47.7

FDLP-M 60.5 62.3 66.3 60.6 54.6 58.3

PLP+M-RASTA 59.5 59.5 62.2 71.5 51.1 52.1

PLP+MSG 60.4 61.2 60.7 72.7 53.4 52.4

FDLP-S+FDLP-M 64.1 63.8 65.8 72.2 57.1 61.0

are conducted on the NIST RT05 [91] evaluation data. The AMI-Juicer large vocabulary

decoder is used for recognition with a pruned trigram language model [92]. This is used

along with reference speech segments provided by NIST for decoding and the pronuncia-

tion dictionary used in AMI NIST RT05s system. Table 2.5 shows the results for word

recognition accuracies for these techniques on the RT05 meeting corpus. The proposed fea-

tures (FDLP-S+FDLP-M) obtain a significant relative improvements for the LVCSR task

compared to other feature representations.

48


Table 2.6: Recognition Accuracies (%) of broad phonetic classes obtained from confusionmatrix analysis

Class PLP FDLP-S M-RASTA FDLP-M PLP + FDLP-S +

M-RASTA FDLP-M

Vowel 85.3 84.9 82.4 85.7 86.1 87.3

Diphthong 78.2 79.1 74.2 76.8 78.4 79.8

Plosive 83.8 82.8 81.6 84.1 84.6 85.4

Affricative 73.5 74.4 68.6 75.6 72.9 78.0

Fricative 85.8 85.9 83.5 86.8 86.4 88.0

Semi Vowel 76.2 74.9 72.9 77.1 77.8 79.0

Nasal 84.2 82.8 80.4 84.9 85.8 86.6

Avg. 81.0 80.7 77.7 81.6 81.7 83.4

2.5 Conclusions

In this chapter, we proposed a framework for deriving data-driven features for

ASR. The framework uses 4 key elements -

• A linear prediction technique that models sub-band temporal envelopes of speech -

We outlined the steps involved in building these auto-regressive models. We also

showed that this technique based on FDLP can capture important details in speech

that conventional techniques do not capture.

• Two kinds of acoustic features - a short-term spectral feature and a long-term modula-

tion feature. Table 2.6 shows the results for phoneme recognition accuracies across all

individual phoneme classes for the proposed techniques using the TIMIT database.

49


The FDLP-S features provide comparable results as the PLP features. The mod-

ulation features (FDLP-M) result in broad class recognition rate for all the broad

phonetic classes to other modulation features.

• A combination of the feature streams at the phoneme posterior level - From Table

2.6, the joint spectral envelope and modulation features yield improved broad class

recognition in all cases compared to the baseline systems.

• Data-driven processing of these features with neural networks followed by Tandem

post-processing allows these features to be used for ASR systems. In all our experi-

ments, Tandem representations of the proposed features improve ASR accuracies over

other features.

In the following chapters we will use this data-driven framework in many other

scenarios. The key scenario is a low-resource setting where the amount of training data

is limited, unlike the ASR settings assumed in this chapter where the amount of training

data is not restricted. We devise techniques to improve the effectiveness of the proposed

front-ends in those settings.

50

Chapter 3

Data-driven Features for

Low-resource Scenarios

This chapter presents two novel techniques for building data-driven front-ends in

low-resource settings with very limited amounts of transcribed data for acoustic model train-

ing. Both the techniques improve performance in the low-resource settings using data from

multiple languages circumventing issues with different phone sets used in each language.

3.1 Overview

In LVCSR systems, an important factor that impacts performance is the amount

of available transcribed training data. When LVCSR systems are built for new languages

or domains with only few hours of transcribed data, the performance is lower. To improve

performance, unlabeled data from this new language or domain has been used to increase

51

CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS

the size of the training set [93]. This is done by first recognizing the unlabeled data and

incrementally adding reliable portions to the original training set. For these self-training

techniques to be effective, a low error rate recognizer is required to annotate the unlabeled

data. However in several scenarios like ASR systems for new languages, recognizers built

using limited amounts of training data have very high error rates. Additional improvements

are hence not easily achieved via these techniques.

Another potential solution to this problem is to use transcribed data available from

other languages to build acoustic models which can be shared with the low-resource language

[94,95]. However training such systems requires all the multilingual data to be transcribed

using a common phone set across the different languages. This common phone set can

be derived either in a data driven fashion or using phonetic sets such as the International

Phonetic Alphabet (IPA) [96]. More recently cross-lingual training with Subspace Gaussian

Mixture Models (SGMM) [97,98] have also been proposed for this task.

An alternative approach to this problem moves the focus from using the shared

data to build acoustic models, to training data-driven front-ends. The key element in

this data-driven approach, is a multi-layer perceptron (MLP) which is trained on large

amounts of task independent data. In [99, 100], a task independent approach has been

used to first train MLPs with large amounts of data. Features derived from these nets

are then shown to reduce the requirement of task specific data to train subsequent HMM

stages. In these experiments, although the task specific data comes from the same language

as the task independent data, the data sources are collected in different domains. More

recently this approach has been shown useful also in cross-domain and cross-lingual LVCSR

52


tasks [75,101]. In [101], Tandem features trained on English CTS data are shown to improve

performance when used in other domains (meeting data) within the language and even

in other languages (Mandarin and Arabic). Even though MLPs are trained on different

phone sets in different languages, Tandem features are able to capture common phonetic

distinctions among languages and improve performance of conventional acoustic features.

In this chapter, we investigate two approaches to building neural network based

data-driven front-ends in low-resource settings. We assume the availability of only 1 hour

of transcribed task specific data to train the acoustic models. To improve over the poor

performance of acoustic models using conventional features in these settings, we use data-

driven feature front-ends that integrate the following additional sources of information -

(a) Multilingual task independent data - Transcribed data from other languages other than

the target language are first used to train initial neural networks models. These task-

independent models are then adapted using limited amounts of task-specific data.

(b) Multiple feature representations - Significant gains were demonstrated in the previous

chapter using different feature representations. We show how these features can be

effective also in low-resource settings.

One of the key problems in training neural network systems using data from mul-

tiple domains are differences in how data sources are transcribed. Although there are

phoneme sets like the IPA which can be used to uniformly label data across languages, only

few data sources are labeled using such sets. This chapter proposes techniques that can be

used to train neural networks in such scenarios.

In low-resource settings, the performance of other modules of the ASR pipeline -

53


for example the language model or pronunciation dictionary are also affected. We however

focus our attention only on the feature extraction module and acoustic models.

3.2 Training Using a Combined Phone set

In this section we describe a training approach using two data sets - H and L. H

is a task independent data set with significantly more amounts of training data than the

low-resource data set - L. Both H and L are transcribed using different phoneme sets H

and L. We train a neural network system using the following steps -

(a) Train an initial network using data set H - We start by training a multilayer perceptron

(MLP) on the high resource task independent data set. After it has been trained, this

network estimates posterior probabilities of speech sounds in H, conditioned on the

input feature vectors.

(b) Find a mapping between phoneme sets H and L - If the two phoneme sets share the

same phonetic transcription scheme for example the IPA, it is relatively easy to find

such a mapping. However, this is often not the case.

In the proposed training scheme we investigate the use of a data-driven technique based

on an analysis of confusion matrices to find such a mapping. Confusion matrices have

been used in the past to measure the reliability of human speech recognition [102]. More

recently they have also been used to study the performance of ASR systems [103,104].

We start by forward passing the low-resource task specific data L through the MLP

trained on task independent data in step (a) to obtain phoneme posteriors. To un-

54


derstand the relationship between phonemes, we treat the phoneme recognition system

as a discrete, memory-less, noisy communication channel with the phonemes in L as

source symbols to the system. Using the recognized phonemes belonging to H at the

output of the recognizer as received symbols, confusion matrices that characterize the

data sets are then built.

Each time a feature vector corresponding to phoneme li is passed through the trained

MLP, posterior probabilities corresponding to all phonemes in set H are obtained at

the output of the MLP. We treat each of these posterior probabilities as soft-counts

to populate a phoneme confusion matrix. From a fully populated CM c, the following

counts can be derived. Entry (i, j) of the confusion matrix corresponds to the soft count

aggregate c(i, j) of the total number of times task-specific phoneme li was recognized

as task-independent phoneme hj . Marginal count c(i) of each row is the total number

of times phoneme li occurred in the task-specific data. Similarly count c(j) of each

column is the total number of times phoneme hj of the task-independent data set was

recognized. C is the total number of counts in the confusion matrix.

Given such a CM, we would like to find the best map for every phoneme li among the

phones of H based on these counts. A useful information theoretic quantity that can

be used is the empirical point wise mutual information [105]. In [104], the use of this

quantity in conjunction with confusion matrices has been shown. For an input alphabet

A and output alphabet B, using the count based confusion matrix, the empirical point

wise mutual information between two symbols ai from A and bj from B is expressed as

IAB(ai, bj) = logNij .N

Ni.Nj, (3.1)

55


where Nij is the number of times the joint event (A = ai, B = bj) occurs and Ni, Nj

are marginal counts∑

j Nij and∑

iNij .

Using our soft count based confusion matrix between two phone sets H and L, we

similarly define the empirical point wise mutual information between phoneme pairs

(li, hj) as

I(li, hj) = logc(i, j).Cc(i).c(j)

, (3.2)

using quantities defined earlier. For a given task specific phoneme li we compare

I(li, hj), ∀hj ∈ H. In this comparison since total count C and the monotonically

increasing log function are common, simplified count based measure -

J(li, hj) =c(i, j)c(i).c(j)

(3.3)

is instead used.

Using this measure, for each label li, the more frequently a particular label hj occurs,

higher the value of J(li, hj). We hence map each phoneme li in the task specific phoneme

set to a phoneme hj in the task independent set which has the highest J(li, hj).

If assumptions that there exists a one-to-one mapping between the phoneme sets and

the cardinality of H is greater than L, can be made, multiple phoneme assignments

can be avoided. This can be done by removing an assigned phoneme from the list of

available phonemes once it has been mapped.

(c) Re-transcribe L using a new mapped phone set H - Using the mapping derived using

confusion matrices from above, the task specific data L can now be re-transcribed into

the phone set used to train the initial network.

56


(d) Adapt the network using data set L - The initial task independent neural network can

now be adapted using the task specific data since it has been mapped to the same phone

set. The neural network is adapted by retraining it using the new data after initializing

it with its original weights.

(e) Extract data-driven features - Posterior features are derived for ASR after Tandem

processing the phoneme posterior outputs of these networks.

3.3 Training Using Multiple Output Layers

In this section we propose a second training technique for training neural network

systems across different data sets without having to map all the data using a common

phoneme set. As before we describe the training approach using two data sets - H and

L. H is a task independent data set with significantly more amounts of training data the

low-resource data set - L. Both H and L are transcribed using different phoneme sets H

and L, with cardinalities h and l respectively. The network is trained using an acoustic

representation with dimension d in the following steps -

(a) Train the MLP on the task independent set H - We start by training a 4 layer MLP

of size d×m1×m2×h on the high resource language with randomly initialized weights.

While the input and output nodes are linear, the hidden nodes are non-linear. While

the dimension of m1 is high, m2 is low dimensional and is known as the ‘bottleneck’

layer. We are motivated to introduce the bottleneck layer to allow the network to learn

a common low dimensional representation among the languages.

57


��

��

��

��

��

��

��

��

��

��

Weights initializedfrom a single layer

perceptron

Bottleneck layer

d

Layers commonacross

data−sets

Final outputlayer specific

layer specificIntermediate output

m1

m2

to phoneme set H

h

l

to phoneme set L

Figure 3.1: Schematic of the proposed training technique with multiple output layers

(b) Initialize the network to train on task specific set L - To continue training on the low-

resource data set which has a different phoneme set size, we create a new 4 layer MLP

of size d×m1×m2×l. The first 3 layer weights of this new network are initialized using

weights from the MLP trained on the high resource data set. Instead of using random

weights between the last two layers, we initialize these weights from a separately trained

single layer perceptron. To train the single layer perceptron, non-linear representations

of the low-resource training data are derived by forward passing the data through the

first 3 layers of the MLP. The data is then used to train a single layer network of size

m2×l.

(c) Train the MLP on task specific set L - Once the 4 layer MLP of size d×m1×m2×l

has been initialized, we re-train the MLP on the task specific data. By sharing weights

across data sets the MLP is now able to train better on limited amounts of task specific

data. Figure 3.1 is a schematic of the proposed MLP system.

58


(d) Derive data-driven features - The proposed 4 layer MLP are trained to estimate phoneme

posterior probabilities using the standard back propagation algorithm with cross en-

tropy error criteria. We derive two kinds of features for LVCSR tasks -

A. Tandem features - These features are derived from the posteriors estimated by

the MLP at the fourth layer. When networks are trained on multiple feature repre-

sentations, better posterior estimates can be derived by combining the outputs from

different system using posterior probability combination rules. Phoneme posteriors are

then converted to features by Gaussianizing the posteriors using the log function and

decorrelating them by using the Karhunen-Loeve transform (KLT). A dimensionality

reduction is also performed by retaining only the feature components which contribute

most to the variance of the data.

B. Bottleneck features - Unlike Tandem features, bottleneck features are derived as lin-

ear outputs of the neurons from the bottleneck layer. These outputs are used directly

as features for LVCSR features without applying any transforms. When bottleneck

features are derived from multiple feature representations, these features are appended

together and a dimensionality reduction is performed using KLT to retain only relevant

components.

3.4 Speech Recognition Experiments and Results

3.4.1 Data sets

We use the English, German and Spanish parts of the Callhome corpora collected

by LDC for our experiments [106–108]. The conversational nature of speech along with high

59


out-of-vocabulary rates, use of foreign words and telephone channel distortions make the

task of speech recognition on this database challenging. The English database consists of

120 spontaneous telephone conversations between native English speakers. 80 conversations

corresponding to about 15 hours of speech, form the complete training data. We use 1 hour

of randomly chosen speech covering all the speakers from the complete train set for our

experiments as an example of data from a low-resource language. The English MLPs and

subsequent HMM-GMM systems use this one hour of data. Two sets of 20 conversations,

roughly containing 1.8 hours of speech each, form the test and development sets. Similar to

the English database, the German and Spanish databases consist of 100 and 120 spontaneous

telephone conversation respectively between native speakers. 15 hours of German and 16

hours of Spanish are used as examples of task independent high resource languages for

training the MLPs. Each of these languages use different phoneme sets - 47 phonemes for

English, 46 for German and 28 for Spanish.

3.4.2 Low-resource LVCSR System

We train a single pass HTK [109] based recognizer with 600 tied states and 4

mixtures per state on the 1 hour of data. We use fewer states and mixtures per state since

the amount of training data is low. The recognizer uses a 62K trigram language model with

an OOV rate of 0.4%, built using the SRILM tools. The language model is interpolated from

individual models created using the English Callhome corpus, the Switchboard corpus [110],

the Gigaword corpus [111] and some web data. The web data is obtained by crawling the

web for sentences containing high frequency bigrams and trigrams occurring in the training

text of the Callhome corpus [97]. The 90K PRONLEX dictionary [112] with 47 phones is

60


used as the pronunciation dictionary for the system. The test data is decoded using the

HTK decoder - HDecode, and scored with the NIST scoring scripts [91].

3.4.3 Building Data-driven Front-ends using a Common Phoneme Set

We use the steps described in Section 3.2 to build a data-driven front-end for

low-resource settings.

(a) Build a multilingual task independent MLP - We train cross-lingual MLP systems on

data from two other languages - German and Spanish using a phone set that covers

phonemes from both the languages. We derive spectral envelope and modulation fre-

quency features from 15 hours of German and 16 hours of Spanish data. Even though

these languages have different phonemes from English, they share several common pho-

netic attributes of speech. The cross-lingual MLPs capture these attributes from each

of the different features streams for that language.

(b) Construct the data-driven map for English - One hour of English data is forward passed

using the cross lingual MLP to obtain phoneme posteriors in terms of 52 cross-lingual

phones. The true labels for English data contains 47 English phonemes. Using the

mapping technique described earlier we then determine to which phone in the German-

Spanish set each English phoneme can be mapped. This one-to-one mapping is created

by associating each English phoneme to the phone which gives the highest count based

score in the German-Spanish set.

(c) Build low-resource MLPs using task specific data - We train a set of low resource MLP

systems for each of the feature streams by adapting the cross-lingual system using 1

61


envelopefeatures

Spectral

ASRmerger

Posteriorprobability

Low

resourceMLP

Low

MLPresourceModulation

features

Features for

Trained on German and Spanish data

Trained on German and Spanish dataCross−lingualMLP

Cross−lingualMLP

adapted using 1 hour of English data

Cross−lingual MLPadapted using 1 hour of English data

Cross−lingual MLP

Tandemprocessing

Figure 3.2: Deriving cross-lingual and multi-stream posterior features for low resourceLVCSR systems

hour of English data after mapping it to the new phone set. By adapting the nets it is

observed that the systems are able to discriminate better between phonetic classes of

the low resource language. The primary challenge in adapting an MLP system using

additional data from different language is to effectively map the phonetic units of the

new language to the phone set on which the system has already been trained. We

construct the map as described earlier between the existing and new language phone

sets. This adaptation allows the systems to capture information about phonetic classes

from the acoustic signal enhanced along with common phonetic attributes from the other

languages. We adapt the MLP by retraining it using the new data after initializing it

with its original weights.

(d) Extract data-driven features - We use the two FDLP based acoustic streams proposed in

62


Table 3.1: Word Recognition Accuracies (%) using different Tandem features derived usingonly 1 hour of English data

39 dimensional PLP features used

directly to train HMM-GMM system 28.8

Tandem features derived from PLP

features with 9 frame context 28.7

Tandem features derived from FDLP-S

features with 9 frame context 29.3

Tandem features derived from 476

dimensional FDLP-M features 27.2

the earlier chapter for our experiments. We derive short-term features (FDLP-S) from

sub-band temporal envelopes, modeled using FDLP by integrating the envelopes in short

term frames (of the order of 25 ms with a shift of 10 ms). These short term sub-band

energies are converted into 13 cepstral features along with their first and second deriva-

tives. Each frame of these spectral envelope features is used with a context of 9 frames

for training an MLP network. To extract modulation frequency features (FDLP-M), we

first compress the sub-band temporal envelopes statically using the logarithmic func-

tion and dynamically with an adaptation circuit consisting of five consecutive nonlinear

adaptation loops. The compressed temporal envelopes are then transformed using the

DCT in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation

frequency components from each cosine transform, yielding modulation spectrum in the

0-35 Hz range. The static and dynamic modulation frequency features of each sub-band

are stacked together and used to train an MLP network. For telephone channel speech,

63


Table 3.2: Word Recognition Accuracies (%) using Tandem features enhanced using cross-lingual posterior features

Tandem features derived from

Cross-lingual systems FDLP-S FDLP-M

System 1 - Trained on

German data 30.6 27.9

System 2 - Trained on German

and Spanish data 30.9 29.4

System 3 - Trained on German

(System 1) and adapted 32.3 29.9

with 1 hr of English

System 4 - Trained on German and

Spanish (System 2) further 33.1 30.2

adapted with 1 hr of English

we use 17 bark spaced bands for extracting these features.

Posterior features from the two acoustic streams (FDLP-M and FDLP-S) are combined

at the posterior level. This allows us to obtain more accurate and robust estimates of

posteriors. Posterior features corresponding to 1 hour of data are Gaussianized, decor-

related and dimensionality reduced to 30 dimensional Tandem features. These features

are used to train the subsequent HMM-GMM system. Figure 3.2 is the proposed data-

driven front-end

64


Table 3.3: Word Recognition Accuracies (%) using multi-stream cross-lingual posteriorfeatures

Baseline PLP features 28.8

Multi-stream Cross-lingual

Tandem features 36.5

Table 3.1 summarizes the baseline results for our experiments using different fea-

tures with only 1 hour of English data. In our second set of experiments we derive Tandem

features for the 1 hour of English data from the cross-lingual systems. It is clear that systems

built using low amounts of training data perform very poorly. Our subsequent experiments

aim to improve these performances using multi-stream and cross-lingual data. Table 3.2

shows the experiments using Tandem features derived from the spectral envelope and mod-

ulation features using the cross-lingual systems. These experiment show the improvements

as more cross-lingual data is used. Adapting the systems with the limited amount of task

specific language improves the performance of each system further. As described earlier

posterior streams derived from two different feature representations are now combined to

derive better representations.

Table 3.3 shows the results of combining posterior streams from the final cross-

lingual systems (System 4 of Table 3.2) of both streams the using the Dempster Shafer

(DS) theory of evidence [74]. The results show significant improvements after combining

posterior streams over the results from individual streams compared to the baseline PLP

system.

65


3.4.4 Data-driven Front-ends with MLPs Adapted using Multiple Output

Layers

We use the similar experimental set described in the previous section to demon-

strate the usefulness of the second technique. The primary advantage of this new technique

is that it does not require the multilingual data to be mapped using a common phone set

across various languages.

Training with 2 languages

In our first set of experiments we train a 4 layer MLP system on two languages -

Spanish and English as outlined in Sec. 3.3. We start by training two separate networks

on the task independent language using 16 hours of Spanish. Both these systems have a

first hidden layer of 1000 nodes, a bottleneck layer of 25 nodes and a final output layer

of 28 nodes corresponding to the size of the Spanish phoneme set. 39 dimensional PLP

features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to train

the first network with architecture - 351×1000×25×28. A second system is trained on 476

dimensional modulation features derived using FDLP. These features correspond to 28 static

and dynamic modulation frequency components extracted from 17 bark spaced bands. This

system has an architecture of 476×1000×25×28. Both the systems are trained using the

standard back propagation algorithm with cross entropy error criteria. The learning rate



After the task independent networks have been trained, the task specific networks

66


Table 3.4: Word Recognition Accuracies (%) using two languages - Spanish and English

Baseline PLP features 28.8


Bottleneck features 35.4

to be trained on 1 hour of English are initialized in two stages as discussed in Sec. 3.3. In

the first stage, all weights except the weights between the bottleneck layer and the output

layer are initialized directly from the Spanish network. The second set of weights are

initialized from a single layer network trained on non-linear representations of the 1 hour

of English data derived by forward passing the English data through the Spanish network

till the bottleneck layer. This network has an architecture of 25×47 corresponding to the

dimensionality of the non-linear representations from the bottleneck layer of the Spanish

network and the size of the English phoneme set. These networks are trained on both PLP

and FDLPM features.

Once the networks has been initialized, PLP and FDLPM features derived from 1

hour of English are used to train the new task specific low-resource networks. The networks

trained on PLP and FDLPM features now have an architecture of 351×1000×25×47 and

476×1000×25×47 respectively. 47 dimensional phoneme posteriors from both the networks

are combined using the Dempster Shafer (DS) theory of evidence before deriving the 25

dimensional Tandem set. The 2 sets of 25 dimensional bottleneck features from each of the

networks are appended together before applying a dimensionality reduction to form a final

25 dimensional bottleneck feature vector. Both the Tandem and bottleneck features are

used to train the subsequent low-resource HMM-GMM system on 1 hour of training data.

67


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

reductionDimensionality

25D bottleneckfeatures

25D bottleneckfeatures

Posterior probability

processingmerger and Tandem

trained on SpanishIntermediate output layer

25D Tandem features for ASR

25D bottleneck features for ASR

featuresPLP

Layers common acrosslanguages trained on English

Final output layer

on Germanlayer trained

Intermediate output

modulationFDLP

features

Figure 3.3: Tandem and bottleneck features for low-resource LVCSR systems.

Table 3.4 shows the results of using the proposed MLP based features. We train

the 1 hour HMM-GMM system on 39 dimensional PLP features (13 cepstral + Δ + ΔΔ

features) as our baseline system.

Training with 3 languages

We extend our training on 2 languages to train a multilingual MLP system on 3

languages - Spanish, German and English. The training procedure starts as outlined earlier

with 15 hours of Spanish. The networks are then initialized to train with the German data

in two stages - with weights from the Spanish system till the bottleneck layer and with

weights from single layer network trained to the German data. After the net has been

trained on the German data, we do a re-training using the 1 hour of English data. Figure

3.3 is a schematic of the training and feature extraction procedure. Table 3.5 shows the

results of using the proposed MLP based features.

68


Table 3.5: Word Recognition Accuracies (%) using three languages - Spanish, German andEnglish


Bottleneck features 37.2

The above results show the advantage of the proposed approach to training MLPs

on multilingual data. Unlike in earlier approaches we are able to train on multiple languages

without using a common phoneset among the languages.

3.5 Conclusions

In this chapter we have demonstrated the usefulness of data-driven feature front-

ends over conventional features in low-resource settings. In these settings, data-driven fea-

tures are built using task independent data. However in most cases, this data is transcribed

using different phoneme sets. We have addressed this issue using two methods. Features

extracted using these techniques are used to train LVCSR systems in the low-resource lan-

guage. In our experiments, the proposed features provide a relative improvement of about

30% in an low-resource LVCSR setting with only one hour of training data. In the next

chapter we investigate more complex front-end for these scenarios.

69

Chapter 4

Wide and Deep MLP Architectures

in Low-resource Settings

Significant improvements in ASR performance have been observed when additional

processing layers have been added to neural network front-ends. To train these additional

parameters, large amounts of training data are also required. This chapter explores how

these additional layers can be incorporated in low-resource settings with only few hours of

task specific training data.

4.1 Overview

In the previous chapter, improvements were observed in low-resource settings by

using multiple feature representations of the acoustic signal. To allow these parallel streams

of information to be trained, task independent data from different languages were used in

70

CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS

Feature FeatureStream 2

MLP

Post−processingFeature

Speech

Acoustic Feature Extraction

Data−driven Feaures

Acoustic Feature Extraction

Speech

FeatureStream

MLP MLP

FeaturePost−processing

Data−driven Feaures

MLPInteractions via

intermediateoutputs

(a) (b)

Stream 1

Figure 4.1: (a) Wide and (b) Deep neural network topologies for data-driven features

conjunction with simple neural network topologies. In this chapter, in addition to these

parallel feature streams, we explore if more complex neural network architectures which are

currently being used in state-of-the-art ASR systems can also be trained in low-resource

settings.

In [113], these complex neural network architectures have been broadly classified

into two categories - wide networks and deep networks. In wide networks, several parallel

neural network modules that interact with each other are used. On the other hand, in deep

networks topologies several interacting neural network layers are stacked one after the other

in a serial fashion. Figure 4.1, illustrates these topologies.

71


Several wide network topologies have been used in processing long-term modula-

tion features for example the architectures used in the TRAPS [66] or HATS framework [73].

In a more recent approach [114], modulation features are first divided into two separate

streams as shown in Figure 4.1. The phoneme posterior outputs of a neural network trained

on high modulations (> 10Hz) are then combined with low modulation features to train

a second network. Tandem processed features from the second network are then used for

ASR.

Hierarchical networks where the outputs of one neural network processing stage are

further processed by a second neural network have been used in [100, 115]. More recently,

Deep Belief Networks with several layers (5-6 hidden layers) have been used in acoustic

modeling. In this approach individual layers of the deep network are usually pre-trained

before being assembled together and trained together [116–118].

In this chapter we discuss techniques to train both these classes of complex net-

works in low-resource settings. Faced with limited amounts of task specific data in these

scenarios we demonstrate the use of task independent data to build these networks.

4.2 Wide Network Topologies

4.2.1 Building the Data-driven Front-ends

We use two kinds of task independent data sources in building the proposed front-

end with wide network topologies -

(a) Up to 20 hours of data from the same language collected for a different task. Although

this data has a different genre, it has similar acoustic channel conditions as the low

72


Data driven front−end

Speech (from thelow−resource

setting)

{1/2/5/10/15/20} hours

Same language,

− PLPFeaturesAcoustic

of dataon N hrs

MLP trainedterm detection

for LVCSR/SpokenPosterior features

different genre −

Figure 4.2: Data driven front-end built using data from the same language but from adifferent genre.

resource data.

(b) 200 hours of data from a different language but with similar acoustic channel conditions.

We build two kinds of front-ends on varying amounts of these task independent training

data.

1. A monolingual front-end trained on varying amounts of data from the same language as

the low-resource task. As shown in Figure 4.2, we train different configurations of this

front-end on 1 to 20 hrs of data (N hours). The primary advantage of this kind of a

front-end is that even though the genre is different, the MLP learns useful information

that characterizes the acoustics of the language. This improves as the amount of training

data increases. For our current experiments we also choose task independent data from

similar acoustic conditions as the low resource setting. Features generated using this

front-end are hence enhanced with knowledge about the language and have unwanted

variabilities from the channel and speaker removed. We use conventional short-term

acoustic features to train these nets.

2. A cross-lingual front-end that uses large amounts of data from a different language.

In most low-resource settings, it is less likely to have sufficient transcribed data in the

73


��

��

AcousticFeatures − PLP

− FDLPMFeaturesAcoustic

on M hrs of dataMLP trained

Different language −

MLP trainedon M hrs of data

PosteriorCombination

ProcessingTandem

with multilingual posteriorsAcoustic Features enhanced

Data driven front−end

MLP trained onN hrs of data

Same language, differentAcousticFeatures − PLP

hoursgenre − {1/2/5/10/15/20}

200 hours

for LVCSRfeatures

Posterior

Speech

Figure 4.3: A cross-lingual front-end built with data from the same language and with largeamounts of additional data from a different language but with same acoustic conditions.

same language to train a monolingual front-end. However considerable resources in other

languages might be available. Figure 4.3 outlines the components of the cross-lingual

front-end that we train to include additional data from a different language. This front-

end has two parts. The first part is similar to the monolingual front-end described above

and consists of an MLP trained on various amounts of data from same language but

different genre (N hours). The second part includes a set of MLPs trained on large

amounts of data from a different language (M hours). Outputs from these MLPs are

used to enhance the input acoustic features for the former part.

Although languages have common attributes between them, data from these languages

is transcribed using different phone sets and need to be combined before it can be used.

In the previous chapter, we use two different approaches to deal with this - a count based

74


data driven approach to find a common phone set and an MLP training scheme with

intermediate language specific layers. Both these approaches finally involve adaptation

of multilingual MLPs to the low-resource language. In this chapter, we do not adapt

any MLPs, instead we keep the front-end fixed by using the multilingual MLP to derive

posterior features.

When MLPs trained on a particular language are used to derive phoneme posteriors from

a different language, the language mismatch results in less sharp posteriors than from an

MLP trained on the same language. However an association can still be seen between

similar speech sounds from the different languages. We use this information to enhance

acoustic features of the task specific language. Phoneme posteriors from two compli-

mentary acoustic streams are combined to improve the quality of the posteriors before

they are converted to features using the Tandem technique. The multilingual posterior

features are finally appended to short-term acoustic features to train a second level of

MLPs on varying amounts of data from the same language as the low-resource task. This

procedure is hence similar to the approaches described earlier with modulation features

and the TRAPS/HATS configurations used to build wide neural network topologies (see

Figure 4.1).

4.2.2 Experiments and Evaluations

We train two data-driven front-ends for the low-resource LVCSR task as described

in Sec. 4.2.1. We train the monolingual front-end on a separate task independent training set

of 20 hours from the Switchboard corpus. Although this training set has similar telephone

75


channel conditions, as the low-resource task used for our experiments, it has a different

genre. The phone labels for this set are obtained by force aligning word transcripts to

previously trained HMM/GMM models using a set of 45 phones. 39 dimensional PLP

features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames. We

train separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours to understand how the

amount of task independent data affects performance on these features.

In addition to the Switchboard corpus, we train Spanish MLPs on 200 hours of tele-

phone speech from the LDC Spanish Switchboard and Callhome corpora for the cross-lingual

front-end. Phone labels for this database are obtained by force aligning word transcripts

using BBN’s Byblos recognition system using 27 phones. We use two acoustic features

- short-term 39 dimensional PLP features with 9 frames of context and 476 dimensional

long-term modulation features (FDLPM). When networks are trained on multiple feature

representations, better posterior estimates can be derived by combining the outputs from

different systems using posterior probability combination rules. We use the Dempster-Shafer

rule of combination for our experiments. Posteriors from multiple streams are combined to

reduce the effects of language mismatch and improve posteriors. Phoneme posteriors are

then converted to features by Gaussianizing the posteriors using the log function and decor-

relating them by using the Karhunen-Loeve transform (KLT). A dimensionality reduction

is also performed by retaining only the top 20 feature components which contribute most

to the variance of the data.

The English MLPs in the cross-lingual setting are trained on enhanced acoustic

features. These features are created by appending posterior features derived from the

76


Table 4.1: Word Recognition Accuracies (%) using different amounts of Callhome data totrain the LVCSR system with conventional acoustic features

1hr 2hr 5hr 10hr 15hr

PLP features 28.8 33.60 39.70 43.80 46.50

Spanish MLPs to the PLP features used in monolingual training. We similarly also train

separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours of task independent data.

In our first experiment we use 39 dimensional PLP features directly for the 1

hour Callhome LVCSR task. The acoustic models have a low word accuracy of 28.8%.

These features are then replaced by 25 dimensional posterior features using the monolingual

and cross-lingual front-ends, each trained on varying amounts of task independent data

from the Switchboard corpus. Figure 4.4 shows how the performance changes for both the

monolingual and cross-lingual systems. Using the data-driven front-ends, the word accuracy

improves from 28.8% to 30.1% and 37.1% with just 1 hour of task independent training

data using the monolingual and cross-lingual front-ends respectively. These improvements

continue to 37.2% and 41.5% with the same 1 hour of Callhome LVCSR training data as

the amount of task-independent data is increased for both the front-ends. We draw the

following conclusions from these experiments -

1. With very few hours of task specific training data, posteriors features can provide

significant gains over conventional acoustic features. Table 4.1 shows the work accu-

racies when different amounts of Callhome data are used to train the LVCSR system.

By using the cross-lingual front-end, features from only 1 hour of data perform close

to 5-10 hours of the Callhome data with conventional features. This demonstrates

77


1 5 10 15 2026283032343638404244464850

Amount of Task−independent Training Data

Wor

d ac

cura

cy (%

)

1hr of Acoustic Features only1hr of Posterior Features using Monolingual Front−end1hr of Posterior Features using Crosslingual Front−end

Figure 4.4: LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends

the usefulness of our approach where we use task independent data in low-resource

settings to generate better features.

2. When data from a different language is used, additional gains of 4-7% absolute are

achieved over just using task independent data from the same language. It is interest-

ing to observe that the performance with the cross-lingual front-end starts improving

from the best performance achieved with the monolingual front-end.

4.3 Deep Network Topologies

A deep neural networks (DNN) is multilayer MLPs with several more layers than

traditionally used networks. The layers of a DNN are often initialized using a pretraining

algorithm before the network is trained to completion using the error back-propagation

algorithm [119]. In this section we discuss the development of a DNN for low-resource

scenarios.

78


4.3.1 DNN Pretraining and Initialization

The purpose of the pretraining step is to initialize a DNN network with a better set

of weights than a randomly selected set. Networks trained from these kinds of initial weights

are observed to be well regularized and converge to a better local optimum than a randomly

initialized networks [120, 121]. As with traditional ANNs, deep neural networks have been

used both as acoustic models that directly model context-dependent states of HMMs [117]

and also to derive data-driven features [122, 123]. In both cases, the performances of these

networks are better than traditional shallow networks [117,118].

In the deep belief network (DBN) pretraining procedure [124], by treating layers

of the MLP as restricted Boltzmann machines (RBM), the parameters of the network are

trained in an unsupervised fashion with an approximate contrastive divergence algorithm

[124]. However various approximations in training algorithm, introduce modeling errors

which in turn decreases the effectiveness of this approach when the number of layers is

increased [119].

A different algorithm that has been shown to equally effective for pretraining

DNNs is called discriminative pretraining [119, 125]. This pretraining procedure starts by

training an MLP with 1 hidden layer. After this MLP has been trained discriminatively

with the error back-propagation algorithm, a new randomly initialized hidden layer and

softmax layer are introduced to replace the initial soft-max layer of the first network. The

deeper network is then trained again discriminatively. This procedure is repeated until the

desired number of hidden layers in place.

Although pretraining algorithms are effective in initializing DNNs, the key con-

79


straint in low resource settings is often the insufficient amount of data to train these net-

works. We show that in these scenarios, task independent data can instead be used to

pretrain and initialize a DNN before it is finally adapted and used with limited amounts of

task specific data in a low resource setting.

We outline the training of a 5 layer DNN of size - d×m1×m2×m3×h. The training

algorithm is however general and can be extended to more hidden layers. The MLP has a

linear input layer with a size d corresponding to the dimension of the input feature vector,

followed by three non-linear layers m1,m2,m3 and a final linear layer with a size h corre-

sponding to the phone set of the task independent data the DNN is being trained. While

the dimensions of m1,m2 are quite high, m3 is low bottleneck dimensional layer. Similar

to data driven networks described in the previous chapter, both posterior and bottleneck

features can be derived from the DNN. We use the following steps to pretrain a DNN -

1. Initializing the network - We begin the training procedure by initializing a simple

network with 1 hidden layer - d×m1×h. Starting with randomly initialized the weights

connecting all the layers of the network, we train this network with one pass of the

entire data similar to [119].

2. Growing the network - The d×m1×h network is now grown by inserting a new layer

m2 and a set of random weights connecting m1 −m2 and m2 − h. The new network

is again trained with one pass of the entire data using the standard back-propagation

algorithm. The weights d −m1 are copied from the initialization step and are kept

fixed.

The desired network d×m1×m2×m3×h is finally created by adding the bottleneck

80


layer m3. While weights d − m1, m1 − m2 are copied from the previous step, new

random weights are used to connect m2 −m3 and m3 − h.

3. Final training - With all the layers of the network in place, the complete network is

trained to full convergence.

We use task independent data in all these steps. The DNN is next adapted to the

low-resource setting using limited amounts of task specific data.

4.3.2 DNN Adaptation with task specific data

As described in the previous chapter, one limitation while adapting between do-

mains are differences in the phoneme set. We have proposed a neural network based tech-

nique for this in the previous chapter that replaces the last language specific layer. We use

this technique in the following steps for the adapting the DNN -

1. Initialize the network to train on task specific set - To continue training on the task

specific set which has a different phoneme set size l, we create a new 5 layer DNN of

size d×m1×m2×m3×l. The first 4 layer weights of this new network are initialized

using weights from the DNN trained on the task independent data set. Instead of

using random weights between the last two layers, we initialize these weights from a

separately trained single layer perceptron. To train the single layer perceptron, non-

linear representations of the low-resource training data are derived by forward passing

the data through the first 4 layers of the MLP. The data is then used to train a single

layer network of size m3×l.

2. Train the MLP on task specific set - Once the 4 layer MLP of size d×m1×m2×m3×l

81


has been initialized, we re-train the MLP on the low-resource language. By sharing

weights across languages the MLP is now able to train better on limited amounts of

task specific data.

We derive features from the bottleneck hidden layer of the final DNN as features

for ASR.

4.3.3 Experiments and Evaluations

Similar to low-resource experiments in the previous chapter, we build a cross-

lingual DNN front-end using data from 3 different languages - Spanish, German and English.

Separate DNNs are trained on two different feature representations - PLP and FDLPM.

Bottleneck features from these front-ends are then combined and used for ASR experiments.

DNN pretraining with cross-lingual data

32 hours of cross-lingual data from Spanish (16 hours), German (15 hours) and

English (1 hour) are used to train a 6 layer DNN network with 3 hidden layers. The cross-

lingual data uses a combined phoneme set size of 52 derived from a count-based mapping

scheme (Chapter 3, Section 3.4.3).

Separate DNNs are trained on two feature representations. 39 dimensional PLP

features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to

train the first network with architecture - 351×1000×1000×25×52. A second system is

trained on modulation features derived using FDLP. These features (FDLPM) correspond

to 28 static and dynamic modulation frequency components extracted from 17 bark spaced

bands. A reduced feature set from only 9 alternate odd bands is used to train a system

82


Table 4.2: Word Recognition Accuracies (%) with semi-supervised pre-training

System Word Rec. Acc. (%)

Conventional acoustic features (PLP) using

1 hour of English training data 28.8

Data-driven features using data-driven map - 31 hours of

multilingual data (German + Spanish) and 1 hour of English 36.5

(Chapter 3)

Data-driven features using adaptable last layer for MLP training

31 hours of multilingual data (German + Spanish) and 1 hour 37.2

of English (Chapter 3)

Data-driven features using deep neural network pre-trained using


of English

with an architecture of 252×1000×1000×25×52. Both the systems are trained with the

standard back propagation algorithm and cross entropy error criteria. The learning rate



The DNN networks are build in stages as described in the previous section. For the

DNN trained using PLP features, a three layer MLP (351×1000×52) initialized with random

weights, is first trained using one pass of the cross-lingual data. In the next step, a four

layer MLP (351×1000×1000×52) is trained starting with copied weights from the 351×1000

section of the earlier network and random weights for the 1000×1000×52 section. A single

83


pass of the cross-lingual data is used to train this network keeping the copied weights fixed.

The final 6 layer network (351×1000×1000×25×52) is constructed with copied weights for

the 351×1000×1000 section and random weights for the 1000×25×52 part. The network

is then trained to full convergence. A similar 252×1000×1000×25×52 is trained using the

FDLPM features.

DNN adaptation to low-resource settings

Each of the DNN networks trained on task independent data are then adapted

to the low-resource setting with task-specific 1 hour of English data. The networks are

adapted after the task dependent output layer of the cross-lingual DNN has been replaced.

This is done in two steps.

In the first step, all weights except the weights between the bottleneck layer and

the output layer are initialized directly from the cross-lingual network. The second set

of weights are initialized from a single layer network trained on non-linear representations

of the 1 hour of English data derived by forward passing the English data through the

cross-lingual network till the bottleneck layer. This network has an architecture of 25×47

corresponding to the dimensionality of the non-linear representations from the bottleneck

layer of the cross-lingual network and the size of the English phoneme set.

Once the networks has been initialized, PLP and FDLPM features derived from 1

hour of English are used to train the new low-resource networks. The networks trained on

PLP and FDLPM features now have an architecture of 351×1000×25×47 and 252×1000×25×47

respectively. These networks are then used to derive bottleneck features. The 2 sets of 25

dimensional bottleneck features from each of the networks are appended together before ap-

84


plying a dimensionality reduction to form a final 25 dimensional bottleneck feature vector

for ASR.

ASR Experiments using DNN features

We use the similar ASR setup on Callhome English described earlier. The baseline

HMM-GMM system is trained on 1 hour of data using 39 dimensional PLP features. Table

4.2 shows the recognition accuracies on this task using different approaches. The DNN

features significantly improve ASR accuracies when compared with equivalent systems built

using features from simpler 3 layer MLPs.

4.4 Semi-supervised training in Low-resource Settings

4.4.1 Overview

Semi-supervised training has been effectively used to train acoustic models in

several languages and conditions [93,126–128]. In this section we describe the development

of a semi-supervised approach to improve speech recognition performances in low-resource

settings.

We start by using the best acoustic models trained in the low-resource setting to

decode the available untranscribed data. The decoded data is then used along with the

limited amounts of transcribed training data to train acoustic models in a semi-supervised

fashion.

85


4.4.2 Selecting Reliable Data

In low-resource settings, since the recognition performance of recognizers is low,

the quality of the decoded untranscribed data is also poor. It is hence useful to select

reliable portions of the untranscribed data for semi-supervised training. This selection is

done using confidences scores computed for each decoded utterance. Confidence scores are

computed using two techniques -

1. LVCSR based word confidences - LVCSR lattice outputs can be treated as directed

graphs with arcs representing hypothesized words. Each arc spans a duration of

time (ts, tf ), that the word is hypothesized to be present in the speech signal and is

also associated with acoustic and language model scores. Using these scores, word

posteriors can be computed with the standard forward-backward algorithm [129].

For any given hypothesized word wi, at a given time frame t, several instances of

the word can be present on different lattice arcs simultaneously. A frame-based word

posterior of wi can be computed as

p(wi|t) =∑

j

p(wji |t) (4.1)

where j corresponds to all the different instances of wi that are present at time frame

t [130]. In our proposed selection technique we use a word confidence measure Cmax

based on these frame level word posteriors [130], given as the maximum word confi-

dence of the word in its hypothesized time interval (ts, tf )

Cmax(wi, ts, tf ) = maxtε(ts,tf )

p(wi|t) (4.2)

86


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Phonemes ofword W

tt s+1Time (frames)

t

43p

p21

p

p

s f

Presence ofphoneme

Path along on whichoccurances are

counted

Figure 4.5: MLP posteriogram based phoneme occurrence count

2. MLP posteriogram based phoneme occurrence confidence - Similar to the above men-

tioned confidence from the LVCSR classifier, we also derive confidences scores from

phoneme posterior outputs of a neural network classifier. This confidence measure

uses the posteriogram representation of an utterance, derived by forward passing

acoustic features corresponding to the utterance through the trained MLP classifier.

For each hypothesized word wi in the LVCSR transcripts, we first look up its set of

constituent phonemes {p1, p2 . . . pn} from a pronunciation lexicon. Phoneme posteri-

ors corresponding to each phoneme are then selected for the utterance’s posteriogram

representation and binarized to indicate the phoneme’s presence or absence using a

set threshold. The average number of times the constituent phonemes appear in the

hypothesized time span (ts, tf ) along a Viterbi search path is then used as confidence

measure. The selected path is designed to produced the occurrence count while visit-

ing all constituent phonemes in sequence. The rationale behind this measure is that

if a word is hypothesized correctly, it is likely that all its constituent phonemes will

be present in the posteriogram, hence resulting in a high average occurrence count.

Figure 4.5 is a schematic of the proposed count based measure computed as -

Cocc(wi, ts, tf ) =c

N(4.3)

87


where c is the total number of times phoneme occurrences and N is the total number

of frames in the hypothesized interval (ts, tf ).

The two confidence measures are finally combined using logistic regression. The

regressor is trained to predict a combined confidence using word confidence and phoneme

occurrence confidence scores of a held out data set.

4.4.3 Experiments and Results

For our experiments in low-resource settings, we use a randomly selected 1 hour

of transcribed data from the complete 15 hour Callhome English data set. In our semi-

supervised training experiments we consider the remaining 14 hours as untranscribed data

and attempt to use it.

Data selection

Using the ASR system trained with features from the cross-lingual DNN front-

end, the 14 hour data set is first decoded. Word lattices also produced during the decoding

process are used to generate word-confidences for each hypothesized word as described

above. The cross-lingual DNN front-end is also used to produce phoneme posterior outputs

from which phoneme occurrence based confidences are derived. Combination weights for

these confidence scores are then estimated by training a logistic regressor on a 45 minute

held-out data set with the set’s ground truth transcriptions.

After every hypothesized word in the decoded output has been given a score using

the trained logistic regression module, each utterance is assigned an utterance-level score.

This utterance level score is the average of all word-level scores in the utterance.

88


Table 4.3: Word Recognition Accuracies (%) at different word confidence thresholds

Threshold Word Rec. Acc. % Threshold Word Rec. Acc. %

None 38.75 + 0.2 44.0

- 0.1 39.5 + 0.3 45.5

+ 0.0 41.7 + 0.4 45.4

+ 0.1 42.7 + 0.5 44.6

To evaluate the usefulness of the proposed confidence selection scheme we generate

utterance level scores for the held out data. The word recognition accuracy (%) is then

evaluated on selected sentences at different threshold levels. Table 4.3 shows the word

recognition accuracies at different thresholds. As the threshold increases, only fewer reliable

sentences get selected.

Semi-supervised training of DNNs

The initial cross-lingual DNN training experiments described earlier were based on

only 1 hour of transcribed data. For semi-supervised training of DNNs we include additional

data with noisy transcripts. These utterances are selected from the untranscribed data based

on their utterance level confidences.

To avoid detrimental effects from noisy semi-supervised data during discriminative training

of neural networks, we make the following design choices -

(a) During back-propagation training, the semi-supervised data is de-weighted. This is

done by multiplying the cross-entropy error with a small multiplicative factor during

training,

89


Table 4.4: Word Recognition Accuracies (%) with semi-supervised pre-training

Cross-lingual pre-training 41.0

Cross-lingual pre-training with semi-supervised data 42.7

(b) The semi-supervised data is used only in the final pre-training stage after all the layers

of the DNN have been created,

(c) Only limited amount of semi-supervised data is added.

For our experiments we select about 4.5 hours of data using utterances with a

score of 0.3 and greater. This data is then combined with the cross-lingual pre-training

data set of 15 hours of German, 16 hours of Spanish and 1 hour of English. During the

DNN training, we use a multiplicative factor of 0.3 to de-weight the cross-entropy error

from the semi-supervised data.

The semi-supervised data is used in the final pre-training stage (Section 4.3.1,

step 3) to train both the DNN networks using PLP (351×1000×1000×25×52 network) and

FDLPM (252×1000×1000×25×52 network) features (Section 4.3.3). After pre-training,

both the networks are adapted with 1 hour of English as before. Bottleneck features from

both the networks are combined and used to train the low-resource ASR system with 1 hour

of data as before. Table 4.4 shows the performance of the system after using semi-supervised

data for pre-training.

90


Table 4.5: Word Recognition Accuracies (%) with semi-supervised acoustic model training

Hours of semi-supervised

data added Word Rec. Acc. %

0 42.7

2 43.3

4 44.0

8 44.3

14 44.8

Semi-supervised training of Acoustic Models

Features from the DNN front-end with semi-supervised data are used to extract

data-driven features for semi-supervised training of the ASR system. Similar to the weighing

of semi-supervised data during the DNN training, we also use a simple corpus weighing while

training the ASR systems. This is done by adding the 1 hour of fully supervised data with

accurate transcripts twice.

To understand the effect of the semi-supervised data, we evaluate the recognition

performance using different amounts of semi-supervised data. From Table 4.5 we observe

that as we double the amount of semi-supervised data, there is roughly a 0.5% absolute

increase in performance.

91


4.5 Conclusions

In this chapter we have shown how complex neural network architectures can be

built in low resource settings. Using large amounts of multilingual data, we have show

that task independent data can significantly improve performances in low resource settings.

Training using task independent data compensates for the lack of limited amounts of tran-

scribed task specific data in low resource settings. Both the deep and wide networks trained

in this fashion improve word recognition accuracies significantly.

92

Chapter 5

Applications of Data-driven

Front-end Outputs

In the previous chapters, the outputs of data-driven front-ends were used as features for

automatic speech recognition. In this chapter, we describe how these front-ends can be used

in other applications - to derive features for speech activity detection, combination weights

in neural network based speaker recognition models, feature representations for zero resource

speech applications and event detectors for speech recognition.

5.1 Application 1 - Speech Activity Detection

5.1.1 Overview

Speech activity detection (SAD) is the first step in most speech processing ap-

plications like speech recognition, speech coding and speaker verification. This module is

93

CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS

an important component that helps subsequent processing blocks focus resources on the

speech parts of the signal. In each of these applications, several approaches have been

used to build reliable SAD modules. These techniques are usually variants of decision rules

based on features from the audio signal like signal energy [131], pitch [132], zero crossing

rate [133] or higher order statistics in the LPC residual domain [134]. Acoustic features

have also been used to train multi-layer perceptrons (MLPs) [135] and hidden Markov

models (HMMs) [136] to differentiate between speech and non-speech classes. All these

approaches in essence focus on characteristic attributes of speech which differentiate it from

other acoustic events that can appear in the signal.

5.1.2 Data-driven Features for SAD

Traditionally acoustic features derived from the spectrum of speech have been

used to differentiate between speech and other acoustic events. In a different approach, we

train MLPs on large amounts of data to differentiate between two classes - speech versus

non-speech. Instead of using these models to directly produce S/NS decisions, the models

are used as a data-driven front-ends to derive features for SAD.

The proposed front-end has a multi-stream architecture with several levels of MLPs

[137]. The motivation behind this multi-stream front-end is to use parallel streams of data

that carry complementary or redundant information while at the same time degrading

differently in noisy environments [138]. We form 3 feature streams by dividing the sub-

band trajectories derived using FDLP on a mel-scale with 45 filters equally into 3 groups.

Similar to deriving short-term spectral features, we then integrate the envelopes in short

term frames (of the order of 25 ms with a shift of 10 ms). We also use a context of

94


about 1 second by appending 50 frames from the right and left with each sub-band feature

vector to form TRAP like features [65]. The two other streams are formed by dividing the

14 modulation features into 2 groups - the first 5 DCT coefficients corresponding to slow

modulations and the remaining 5 coefficients corresponding to fast modulations.


Speech activity detection is carried out on the proposed features in three main

steps. In the first step, the input frame-level features are projected to a lower-dimensional

space. The reduced features are then used to compute per-frame log likelihood scores with

respect to speech and non-speech classes, each class being represented separately by a GMM.

The frame level log likelihood scores are mapped to S/NS classification decisions to produce

final segmentation outputs in the last step. Figure 5.1 is a brief schematic of the proposed

approach and the processing pipeline for SAD. Each of these steps is described in detail

in [139].

The proposed features are evaluated in terms of speech activity detection (SAD)

accuracy on noisy radio communications audio provided by the Linguistic Data Consortium

(LDC) for the DARPA RATS program [140, 141]. The audio data for the DARPA RATS

program is collected under both controlled and uncontrolled field conditions over highly

degraded, weak and/or noisy communication channels making the SAD task very challeng-

ing [140]. Most of the RATS data released for SAD were obtained by retransmitting existing

audio collections - such as the DARPA EARS Levantine/English Fisher conversational tele-

phone speech (CTS) corpus - over eight radio channels, labeled A through H covering a

wide range of radio channel transmission effects.

95


Figure 5.1: Schematic of (a) features and (b) the processing pipeline for speech activitydetection.

The development corpus used in our SAD experiments consists of 11 hours of

audio from the Arabic Levantine and English Fisher CTS corpus, retransmitted over the

eight channels. The training corpus consists of 73 hours of audio (62 hours from the Fisher

collection, and 11 from new RATS collection). Although the entire data was also retrans-

mitted over eight channels, since some data from channel F was unusable, all data from

that channel was excluded from both training and development.

The MLPs used for extracting data-driven features are trained on close to 660

hours of audio from the RATS development corpus using LDC provided S/NS annotations.

Outputs from these 5 sub-systems are then fused by a merger MLP at the second level to

derive the final S/NS posterior features. These features are derived from the pre-softmax

96


Dimensionality Equal Error Rate (%) on different channels

#Dims. #Frame Total

Features /Frame Context #Dims. A B C D E G H All

PLP 15 31 465 3.55 3.00 5.03 2.51 2.75 3.48 2.34 3.34

FDLPS 15 31 465 3.42 3.10 4.46 2.42 2.78 3.40 2.29 3.20

FDLPM 340 1 340 3.88 3.80 4.12 3.26 3.52 3.60 2.51 4.15

MLP 2 31 62 3.05 2.96 3.76 2.20 2.71 3.35 2.10 3.17

PLP+MLP 17 31 527 3.10 2.84 3.20 2.25 2.63 2.96 2.07 2.84

FDLPS+MLP 17 31 527 3.15 2.94 3.04 2.17 2.67 2.89 1.93 2.82

FDLPM+MLP 402 1 402 3.02 2.90 3.73 2.26 2.84 2.42 1.89 2.88

Table 5.1: Equal Error Rate (%) on different channels using different acoustic features andcombinations

outputs of the final layer.

SAD models are trained both acoustic and data-driven features, as well as on

feature combinations. In each case, HLDA was used to reduce dimensionality prior to

GMM training. Table 5.1 shows the dimensionality of the original space, prior to the

application of HLDA, for each feature type used. A context of 31 frames was used for

short-term features. In all cases, the output dimensionality of HLDA was set to 45. A single

Gaussian was used to represent each of the two classes (speech, non-speech) during HLDA

estimation. After the dimensionality reduction, a 512-component GMMs is trained for

S/NS classification. The number of contextual frames, HLDA dimensionality, and number

of GMM components were optimized using separate experiments [142]. The derived SAD

models were evaluated on the development set in terms of equal error rate (EER%), which is

97


the operating point at which the falsely rejected speech rate (probability of missed speech)

is equal to the falsely accepted non-speech rate (probability of false alarm). The results

are shown in Table 5.1 for conventional features (PLP), short-term features derived using

FDLP (FDLPS), long-term modulation features (FDLPM) and data-driven features (MLP).

Although each of the feature sets have varying performance in each of the individual noisy

channels, they are comparable to each other in terms of overall SAD performance. In a

second set of experiments, acoustic and data-driven features which capture various kinds

of information about speech, are combined. We observe close to 15% relative improvement

when the acoustic features are used in conjunction with the data-driven features. We draw

the following conclusions from these experiments -

1. MLP based models, which are traditionally used to directly produce S/NS decisions,

can be used as data-driven front-ends to produce complementary data-driven features.

2. Acoustic and data-driven features capture complementary attributes. When combined

these lead to further performance improvements.

5.2 Application 2 - Neural Network based Speaker Verifica-

tion

5.2.1 Overview

The goal of speaker verification is to verify the truth of a speakers claimed identity.

Majority of current speaker verification systems model overall acoustic feature vector space

using a GMM based Universal Background Model (UBM), trained on large amounts of data

98


from multiple speakers [143,144]. In this section we discuss the development of a mixture of

AANNs for speaker verification. The mixture consists of several AANNs tied using posterior

probabilities of various broad phoneme classes derived from MLPs.

5.2.2 AANN Models for Speaker Verification

Modeling Speaker Data

AANNs are feed-forward neural networks with several layers trained to reconstruct

the input at its output through a hidden compression layer. This is typically done by

modifying the parameters of the network using the back-propagation algorithm such that

the average squared error between the input and output is minimized over the entire training

data. More formally, for an input vector x, the network produces an output x(x,W) which

depends both on the input x and the parameters W of the network (the set of weights and

biases). For simplicity, we denote the network output as x(W). The training process then

adjusts the parameters such that -

min{W}

E[‖x − x(W)‖2

]. (5.1)

This method of training ensures that for a well trained network, the average reconstruction

error of input vectors that are drawn from the distribution of the training data will be small

compared to vectors drawn from a different distribution [145]. The likelihood of the data x

given the model can then be linked to the error as -

p (x;W) ∝ exp(−E[‖x − x(W)‖2

]). (5.2)

In [146, 147], these properties have been used to model acoustic data for speaker

verification. A single AANN is first trained as a universal background model (UBM) on

99


acoustic features from large amounts of data containing multiple speakers. Since data from

many speakers are used, the AANN model learns a speaker independent distribution of the

acoustic vectors. For each speaker in the enrollment set, the UBM-AANN is then adapted to

learn speaker dependent distributions by retraining the entire network using each speaker’s

enrollment data. During the test phase, the average reconstruction error of the test data is

computed using both the UBM-AANN and the claimed speaker AANN model. In an ideal

case, if the claim is true, the average reconstruction error under the speaker specific model

will be smaller than under the UBM-AANN and vice versa if false.

This approach is similar to conventional UBM-GMM techniques [144] except for

the maximum a posteriori probability (MAP) adaptation to obtain speaker specific models.

In the MAP adaptation of GMMs, only those components that are well represented in the

adaptation data get significantly modified. However in the case of neural networks, there

is no similar mechanism by which only parts of the model can be adapted. This limits the

ability of a single AANN to capture the distribution of acoustic vectors especially when

the space of speakers is large. To address this issue, we introduce a mixture of AANNs as

described in the following section.

Mixture of AANNs

A mixture of AANNs is composed of several independent AANNs each modeling

a separate part of the acoustic feature space [148]. In our experiments we partition the

acoustic space into 5 classes corresponding to the broad phoneme classes of speech - vowels,

fricatives, nasals, stops and silence. The assignment of a feature vector to one of these classes

is done using posterior probabilities of these classes estimated using a separate multilayer

100


perceptron (MLP). This additional information is incorporated into the objective function

in Eqn. (5.1) as -

c∑j=1

min{Wj}

E[P (Cj/x) ‖x − x(Wj)‖2

](5.3)

where c denotes the number of mixture components or number of broad phoneme classes,

and the set Wj consists of parameters of the jth AANN of the mixture. P (Cj/x) is the

posterior probability of jth broad phonetic class Cj given x estimated using the MLP. During

back propagation training, since the error is weighted with class posterior probabilities, each

mixture component is trained only on frames corresponding to a particular broad phonetic

class.

Similar to the single AANN case, a UBM-AANN is first trained on large amounts of

data. For each speaker in the enrollment, the UBM is then adapted using speaker specific

enrollment data. Broad class phoneme posteriors are used in both these cases to guide

the training of each class specific mixture component on appropriate set of frames. This

approach helps to alleviate the limitation of a single AANN model described earlier since

only parts of the UBM-AANN are now adapted based on the speaker data.

Using the mixture of AANNs, the average reconstruction error of data D =

{x1, . . . ,xn} is given by

e (D;W1, . . . ,Wc) =

n∑i=1

c∑j=1

P (Cj/xi) ‖xi − xi(Wj)‖2

n. (5.4)

During the test phase, likelihood scores based on reconstruction errors from both the UBM-

AANN and the claimed speaker models are used to make a decision. In our experiments,

since the amount of adaptation data is usually limited, we adapt only the last layer weights

101


of each AANN component. We also restrict the number of nodes of the third hidden layer

to the size of the output layer.


As described earlier, we train a mixture of AANNs with five components on suf-

ficiently large amounts of data to serve as UBM. Gender specific UBMs are trained on a

telephone development data set consisting of audio from the NIST 2004 speaker recognition

database, the Switchboard II Phase III corpora and the NIST 2006 speaker recognition

database. We use only 400 male and 400 female utterances each corresponding to about 17

hours of speech. The acoustic features used in our experiments are 39 dimensional FDLP

features [149].

Posteriors to train the UBM are derived from an MLP trained on 300 hours of

conversational telephone speech (CTS) [88]. The 45 phoneme posteriors are combined

appropriately to obtain 5 broad phonetic class posteriors corresponding to vowels, fricatives,

plosives, nasals and silence.

Each AANN component of the UBM has a linear input and a linear output layer

along with three nonlinear (tanh nonlinearity) hidden layers. Both input and output layers

have 39 nodes corresponding to the dimensionality of the input FDLP features. We use 160

nodes in the first hidden layer, 20 nodes in the compression layer and 39 nodes in the third

hidden layer. Speaker specific models are obtained by adapting (retraining) only the last

layer weights (39×39 parameters) of each AANN component.

Once the UBMs and speaker models have been trained, a score for a trial is

computed as difference between the average reconstruction error (given by (5.4)) values

102


of test utterance under the UBM model and the claimed model.

As a baseline, we train a gender independent UBM-GMM system with 1024 com-

ponents on FDLP features. The UBM-GMM is trained using the entire development data

described in above section. The speaker specific GMM models are obtained by MAP adapt-

ing the UBM-GMM with a relevance factor 16. As a second baseline, we train gender specific

AANN systems. These systems use 160 nodes in both the second and fourth hidden layers

and 20 nodes in the compression layer. The UBMs are trained using the same development

data that is used for training the mixture of AANNs.

System C6 C7 C8

GMM (1024 comp.) 84.4 (17.3) 60.8 (11.7) 69.1 (14.3)

Baseline AANN 88.3 ( 28.7) 75.9 (20.6) 77.0 (25.7)

Mixture of AANNs 86.7 (22.5) 60.4 (11.8) 57.3 (12.8)

Mixture of AANNs + GMM 81.3 (16.4) 51.9 (10.9) 54.4 (11.4)

Table 5.2: Performance in terms of Min DCF (×103) and EER (%) in parentheses ondifferent NIST-08 conditions

The performance is evaluated on a subset of the NIST-08 telephone core conditions

(C6, C7 and C8) consisting of 3851 trials from 188 speakers. Table 5.2 lists both minimum

detection cost function (DCF) and equal error rate (EER) of various systems. The proposed

mixture of AANNs system performs much better than the baseline AANN system and

yields comparable results to the conventional GMM system. The score combination (equal

weighting) of GMM baseline and the proposed system further improves the performance.

However, state-of-the-art GMM systems use factor analysis to obtain much better gains.

103


In [150,151], the AANN based approach has been further developed to use factor analysis.

5.3 Application 3 - Zero Resource Settings

In zero resource settings, tasks such as spoken term discovery attempt to auto-

matically identify repeated words and phrases in speech without any transcriptions [152].

In recent approaches [152–154] to address this task, a dynamic time warping (DTW) search

of the speech corpus is performed against itself to discover repeated patterns. With no

transcripts to guide the process, results of the search largely depend on the quality of the

underlying speech representation being used. In [155], multiple information retrieval met-

rics have been proposed to evaluate the quality of different speech representations on this

task. These metrics operate by using a large collection of pre-segmented word examples to

first compute the DTW distance between all example pairs and then quantify how well the

DTW distances can differentiate between same or different the example pairs. Better scores

with these metrics are indicative of good speaker independence and high word discrim-

inability of feature representations. Since these are also desirable properties of features for

other downstream recognition applications, these metrics are also predictive of how different

features will perform in those applications. We evaluate posterior features from both the

multilingual and cross-lingual front-ends (Chapter 4, Sec. 4.2) for spoken term discovery

with information retrieval metrics used in [155].

The evaluation metric uses 11K words from the Switchboard corpus resulting in

60.7M word pairs of which 96K are same word pairs [155]. Similarity between word pairs

(wi, wj) are measured using minimum DTW alignment cost - DTW (wi, wj) between wi and

104


1 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Amount of Task−independent Training Data

Ave

rage

Pre

cisi

on

Acoustic Features onlyPosterior Features using Monolingual Front−endPosterior Features using Crosslingual Front−end

Figure 5.2: Average precision for different configuration of the wide topology front-ends

wj . For a particular threshold τ , wi and wj are predicted to be the same if DTW (wi, wj) ≤

τ . Computing DTW distances also requires a distance metric to be defined between feature

vector frames that make up words. For this evaluation cosine distance is used for comparing

frames of raw acoustic features corresponding to words. A more meaningful symmetric KL-

divergence is used for accessing similarities on phoneme posteriors vectors generated by the

proposed front-ends for words.

The entire set of word pairs is now used in the context of an information retrieval

task where the goal is to retrieve same word pairs from different word impostors for each

front-end configuration. Sweeping τ allows us to create a standard precision-recall curve

for each setting. The precision-recall curves can then be characterized by several criteria.

We use the average precision metric defined as the area under the precision-recall curve for

our experiments, which summarizes the system performance across all operating points.

Figure 5.2 shows the average precision scores for the two front-ends with varying

105


amounts of training data. The plot shows that posterior features perform significantly

better than the raw acoustic features (39D PLP features with zero mean/unit variance)

which have a very low score of only 0.177. As in the LVCSR case (Chapter 4, Sec. 4.2),

posterior features from the cross-lingual front-end perform even better. Both front-ends

improve as the amount of task independent data increases. Since this evaluation metric is

based on DTW distances over a moderately large set of words, improved performances on

this metric imply more accurate spoken term discovery. These experiments clearly show

the potential of data-driven front-ends not only in low-resource settings but zero-resource

settings.

5.4 Application 4 - Event detectors for Speech Recognition

In [156], we present a new application of phoneme posteriors for ASR. We use MLP

based phoneme posteriors to detect phonetic events in the acoustic signal. These phoneme

detectors are then used along with Segmental Conditional Random Fields (SCRFs) [157] to

represent the information in the underlying audio signal.

5.4.1 Building Phoneme Detectors

Multilayer perceptrons are used to estimate the posterior probability of phonemes

given the acoustic evidence. Each output unit of the MLP is associated with a particular

HMM state to allow these probabilities to be used as emission probabilities of a HMM

system. The Viterbi algorithm is then applied on the hybrid system to decode phoneme

sequences. Each time frame in the acoustic signal is associated with a phoneme in the

106


decoded output. We use the output phonemes along with their corresponding time stamps

as a collection of phoneme detections. A phoneme detection is registered at the mid-point

of the time span in which a phoneme is present. These phoneme detections are subsequently

used in the SCARF framework.

To derive reliable detections corresponding to the underlying acoustic signal, pos-

terior probabilities of phonetic sound classes are estimated using a hierarchical configuration

of MLPs. We use both short-term spectral and long-term modulation acoustic features as

input along with the hierarchical configuration to identify phonetic events.

5.4.2 Integrating Detectors with SCARF

An important characteristic of the SCRF approach is that it allows a set of features

from multiple information sources to be integrated together to model the probability of a

word sequence using a log-linear model. SCARF [158] uses four basic kinds of features

to describe the events present in the observation stream to the words being hypothesized.

These include - expectation features, Levenshtein features, existence features, language

model features and baseline features. The expectation and Levenshtein features measure

the similarity between expected and observed phoneme strings, while the existence features

indicate simple co-occurrence between words and phonemes. The baseline feature indicates

(dis)agreement between the label on a lattice link, and the word which occurs in the same

time span in a baseline decoding sequence.

The phoneme detections that we now include capture phonetic events that occur

in the underlying acoustic signal. During the training process SCARF learns weights for

each of the features. In the testing phase, SCARF uses the inputs from the detectors to

107


search the constrained space of possible hypothesis.

Systems WER% on dev04f

Baseline system (LDA + MLLT+ VTLN

+ fMLLR + MLLR + fMMI + mMMI + wide beams) 16.3

SCARF with baseline features 16.0

SCARF + Word Detector 15.3

SCARF + Word Detectors + Phoneme Detectors 15.1

Table 5.3: Integrating MLP based event detectors with ASR

We use the SCARF along with the earlier described event detectors on the Broad-

cast News task [159]. Table 5.3 shows the results of using the word detector stream along

with all the phoneme detector streams in combination. In this experiment we observe fur-

ther improvements with the phoneme detectors even after the word detectors have been

used. Both the experiments clearly show that additional information in the underlying

acoustic signal is being captured by the detectors and hence the further reduction in error

rates. It should be noted that these improvements are on top of results using state-of-the-art

recognition systems.

5.5 Conclusions

In this chapter, we have demonstrated the use of outputs of data-driven front-ends

for four different applications. For speech activity detection, the data-driven front-ends are

used to derive features which improve speech detection in very noisy environments. In the

108


second application we use broad class posteriors to improve neural network based speaker

verification. By the introduction of this side information, a mixture of neural networks can

be trained similar to conventional GMM based models. This technique improves the neural

network framework significantly and make its performance comparable with state-of-the-art

systems.

We have demonstrated the usefulness of data-driven features for zero-resource

speech applications like spoken term discovery, which operate without any transcribed

speech to train systems. The proposed features provide significant gains over conventional

acoustic features on various information retrieval metrics for this task. In chapter we have

also explored a different application of phoneme posteriors - as phonetic event detectors for

speech recognition. We show how these detectors can be built to reliably capture phonetic

events in the acoustic signal by integrating both acoustic and phonetic information about

sound classes in along with segmental conditional random fields.

109

Chapter 6

Conclusions

This chapter summarizes the important contributions made in this thesis.

6.1 Contributions

In this thesis we have proposed novel data-driven feature front-ends for differ-

ent speech applications. This approach is different from conventional feature extraction

techniques which derive information only from the spectrum of speech in short analysis

windows.

To build effective data-driven front-ends, we have investigated the use of novel

features based on auto-regressive modeling of sub-band envelopes of speech. In conjunction

with these features, we have explored the use of various data-driven front-ends in different

scenarios, especially in low-resource settings. Several novel neural network architectures

and adaptation techniques have been proposed to improve the performance of these front-

ends when only limited amounts of task specific transcribed data is available. We have also

110

CHAPTER 6. CONCLUSIONS

demonstrated the use of these front-ends for other speech applications like speech activity

detection and speaker verification.

The novel contributions made in this thesis can be summarized as -

1. Exploiting temporal dynamics of speech

Data-driven features for speech recognition (Chap. 2, Sec. 2.3) - We have pro-

posed a new set of data-driven features for speech recognition. These features are derived

by combining posterior outputs of MLPs trained on FDLP based short-term spectral and

long-term modulation features. The proposed data-driven features significantly improve

performances on various ASR tasks - phoneme recognition, digit recognition and large

vocabulary continuous speech recognition. [79,80,82,160,161].

2. Working with limited amounts of training data

Techniques to combine data transcribed using different phoneme sets (Chap.

3, Sec. 3.2) - We have developed a count based technique to map between phoneme

classes used to transcribe data in different languages and domains. This technique is

based on a measure that uses posteriors of phoneme classes as soft counts. We have

demonstrated the use of this approach in combining data from three languages - English,

Spanish and German, to train neural network systems. Significant gains are observed

when data-driven features derived using these multilingual MLPs are used in low-resource

settings [162].

3. Neural network architectures for data-driven front-ends

a. Neural network adaptation scheme using multiple output layers (Chap. 3,

Sec. 3.3) - Instead of using a mapping scheme to combine data from different sources

111


before training, we have developed an approach to train neural networks using domain

specific output layers that are modified as training progresses across different domains.

This approach has been shown to be useful in sharing trained network layers across

different domains especially in low-resource settings [163]. Both the above mentioned

techniques address a key issue usually encountered while training neural networks

with data transcribed using different phoneme sets from multiple sources.

b. Wide neural network topology using data from multiple languages (Chap.

4, Sec. 4.2) - We have explored the use of a wide neural network topology that uses

several MLPs trained on large amounts of task independent data for low-resource and

zero-resource speech applications. Results using these front-ends demonstrate that

when task dependent training data is scarce, task independent multi-lingual data can

be used to compensate for performance drops [164].

c. Deep neural network with pre-training using task independent data (Chap.

4, Sec. 4.3) - To allow deep neural networks to be effectively trained in low resource

settings, we have investigated the use of multilingual data for initialization and train-

ing. By using deep neural networks, significant gains are observed on a low-resource

task using only 1 hour of training data. We also illustrate the use of unsupervised

acoustic model training in these settings. Table 6.1 summarizes the gains obtained by

using the proposed techniques in a low-resource experimental setup with only 1 hour

of transcribed training data.

112


System Word Accuracy (%)

Conventional Acoustic Features (PLP) features using

1 hour of English training data (Baseline system) 28.8

Data-driven features using count based map - 31 hours of

multilingual data (German + Spanish) and 1 hour of English 36.5

(Contribution 2)

Data-driven features using adaptable last layer for MLP training

(Contribution 3a) 37.2

Data-driven features using wide network topology - 200 hours

Spanish MLP and 20 hours of English MLP from different domain 41.5

(Contribution 3b)

Data-driven features using deep neural network pre-trained using


of English (Contribution 3c)

Semi-supervised acoustic training with DNN features

(Contribution 3c) 44.8

Conventional acoustic features (PLP) with all the

available 15 hours of training data used for acoustic 46.5

model training (Baseline system)

Table 6.1: Performances in a low-resource setting using dif-

ferent data-driven front-ends proposed in the thesis.

113


The above results clearly show that data-driven features are able to improve recog-

nition accuracies in low resources settings significantly. With only a small fraction

of task specific training data, the proposed approaches are able to achieve perfor-

mances (44.8%) very close to those obtained with conventional features when all of

the training training data is used (46.5%).

4. Applications of data-driven features

• Multi-stream data-driven features for speech activity detection (Chap. 5,

Sec. 5.1) - Neural networks have traditionally been used only as acoustic models

for speech activity detection. We have proposed the use of data-driven features

derived using MLPs for this task. When combined with acoustic features, significant

improvements are observed on speech activity task in noisy environments [139].

• Mixture of AANNs using MLP posteriors for speaker verification (Chap.

5, Sec. 5.2) - To allow neural network models to be able to effectively capture the

distribution of acoustic vectors, a mixture of AANNs has been proposed. Several

independent AANNs are trained on different parts of the acoustic space correspond-

ing to broad phoneme classes of speech. The assignment of a feature vector to one

of these classes is done using posterior probabilities of these classes estimated us-

ing an MLP. Experiments show significant improvements by using the mixtures of

AANNs to model speakers. This novel approach is comparable with conventional

GMM based modeling approaches for this task [148,150].

• Data-driven features for zero-resource settings (Chap. 5, Sec. 5.3) -

Data-driven features provide significant gains over conventional acoustic features

114


on various information retrieval metrics in zero-resource speech applications like

spoken term discovery which features provide significant gains over conventional

acoustic features on this task [164].

• Event detectors for speech recognition (Chap. 5, Sec. 5.4) - Phoneme

posterior probabilities estimated using MLPs are extensively used both as scaled

likelihoods (HMM-ANN framework) and features (Tandem approach) for speech

recognition. We explore a different application of these posteriors - as phonetic

event detectors for speech recognition [156].

6.2 Summary

In this chapter, we have summarized the contributions of this thesis. Although the

proposed data-driven feature extraction techniques have been shown to be useful in many

applications they have limitations related to their training and use. These include -

1. Labeled training data - For the neural network systems to be trained, sufficient data

with frame level phonetic transcriptions are required. These labels are often produced

from alignments generated by an LVCSR system. In low-resource settings or zero-

resource settings where no such transcripts are available, building neural network

based front-end systems will be difficult.

2. Mismatch conditions - Neural networks are sensitive to mismatches in train and test

conditions. Neural network based front-ends can be useful for deriving features only

in matched training conditions.

115


These current limitations open up several interesting avenues for future work. It

would be interesting to see if any of the techniques currently being developed for unsuper-

vised sub-word acoustic model training using universal background models [165], successive

state splitting algorithms for HMMs [166], estimation of sub-word HMMs [167], discrimi-

native clustering objectives [168], non-parametric Bayesian estimation of HMMs [169], au-

tomatically discovered context independent sub-word units [170] can be used to build data-

driven front-ends when transcribed data is unavailable. An interesting paradigm to deal

with mismatch conditions is multi-stream speech recognition. The multi-stream recognition

paradigm for processing of corrupted signals has been studied for more than a decade [137].

In these approaches a number of different representations of the signal would be processed

and classified in separate processing channels in order to provide for a possibility to adap-

tively alleviate the corrupted channels while preserving the uncorrupted channels for further

processing. It would be interesting to see if robust data-driven front-ends [138,171,172] can

be built using this technique to deal with unexpected or unseen noise environments.

116

Bibliography

[1] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the

IEEE, vol. 64, no. 4, pp. 532–556, 1976.

[2] ——, Statistical methods for speech recognition. MIT press, 1998.

[3] H. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach.

Springer, 1994, vol. 247.

[4] L. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring

in the statistical analysis of probabilistic functions of Markov chains,” The Annals of

Mathematical Statistics, pp. 164–171, 1970.

[5] S. Katz, “Estimation of probabilities from sparse data for the language model com-

ponent of a speech recognizer,” IEEE Transactions on Acoustics, Speech and Signal

Processing, vol. 35, no. 3, pp. 400–401, 1987.

[6] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63,

no. 4, pp. 561–580, 1975.

117

BIBLIOGRAPHY

[7] A. Oppenheim and R. Schafer, “Homomorphic analysis of speech,” IEEE Transactions

on Audio and Electroacoustics, vol. 16, no. 2, pp. 221–226, 1968.

[8] R. Schafer and L. Rabiner, “Digital representations of speech signals,” Proceedings of

the IEEE, vol. 63, no. 4, pp. 662–677, 1975.

[9] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-

syllabic word recognition in continuously spoken sentences,” IEEE Transactions on

Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

[10] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” The Journal

of the Acoustical Society of America, vol. 87, p. 1738, 1990.

[11] S. Furui, “Speaker-independent isolated word recognition using dynamic features of

speech spectrum,” IEEE Transactions on Acoustics, Speech and Signal Processing,

vol. 34, no. 1, pp. 52–59, 1986.

[12] R. Fukunaga, Statistical pattern recognition. Academic Press., 1990.

[13] C. Bishop, Pattern recognition and machine learning. Springer-Verlag, 2006.

[14] M. Richard and R. Lippmann, “Neural network classifiers estimate Bayesian a poste-

riori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991.

[15] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.

[16] H. Hermansky and N. Malayath, “Spectral basis functions from discriminant analy-

sis,” in Proceedings of ICSLP. ISCA, 1998.

118

BIBLIOGRAPHY

[17] R. Cole, M. Fanty, M. Noel, and T. Lander, “Telephone speech corpus development

at CSLU,” in Proceedings of ICSLP. ISCA, 1994.

[18] P. Brown, “The acoustic-modeling problem in automatic speech recognition,” Ph.D.

dissertation, Carnegie-Mellon University, 1987.

[19] M. Hunt, “A statistical approach to metrics for word and syllable recognition,” The

Journal of The Acoustical Society of America, vol. 66, no. S1, pp. S35–S36, 1979.

[20] M. Hunt and C. Lefebvre, “A comparison of several acoustic representations for speech

recognition with degraded and undegraded speech,” in Proceedings of ICASSP. IEEE,

1989.

[21] S. Van Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in

Proceedings of Eurospeech. ESCA, 1997.

[22] N. Malayath and H. Hermansky, “Data-driven spectral basis functions for automatic

speech recognition,” Speech communication, vol. 40, no. 4, pp. 449–466, 2003.

[23] F. Valente and H. Hermansky, “Discriminant linear processing of time-frequency

plane,” in Proceedings of INTERSPEECH. ISCA, 2006.

[24] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE

Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272,

1981.

[25] A. Jansen and P. Niyogi, “Intrinsic Fourier analysis on the manifold of speech sounds,”

in Proceedings of ICASSP. IEEE, 2006.

119

BIBLIOGRAPHY

[26] V. Jain and L. Saul, “Exploratory analysis and visualization of speech and music by

locally linear embedding,” in Proceedings of ICASSP. IEEE, 2004.

[27] A. Errity and J. McKenna, “An investigation of manifold learning for speech analysis,”

in Proceedings of ICSLP. ISCA, 2006.

[28] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist feature extraction for

conventional HMM systems,” in Proceedings of ICASSP. IEEE, 2000.

[29] B. Chen, S. Chang, and S. Sivadas, “Learning discriminative temporal patterns in

speech: Development of novel TRAPS-like classifiers,” in Proceedings of Eurospeech,

vol. 242. ESCA, 2003.

[30] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic and bottle-neck

features for LVCSR of meetings,” in Proceedings of ICASSP. IEEE, 2007.

[31] J. Pinto, S. Garimella, M. Magimai-Doss, H. Hermansky, and H. Bourlard, “Analysis

of MLP-based hierarchical phoneme posterior probability estimator,” IEEE Transac-

tions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225–241, 2011.

[32] J. Pinto, G. Sivaram, H. Hermansky, and M. Magimai-Doss, “Volterra series for ana-

lyzing MLP based phoneme posterior estimator,” in Proceedings of ICASSP. IEEE,

2009.

[33] L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information

estimation of hidden markov model parameters for speech recognition,” in Proceedings

of ICASSP. IEEE, 1986.

120

BIBLIOGRAPHY

[34] P. Woodland, D. Povey et al., “Large scale MMIE training for conversational telephone

speech recognition,” in Proceedings of Speech Transcription Workshop, 2000.

[35] E. McDermott, “Discriminative training for speech recognition,” Ph.D. dissertation,

Waseda, Japan, 1997.

[36] G. Doddington, “Phonetically sensitive discriminants for improved speech recogni-

tion,” in Proceedings of ICASSP. IEEE, 1989.

[37] E. Schukat-Talamazzini, J. Hornegger, and H. Niemann, “Optimal linear feature

transformations for semi-continuous hidden Markov models,” in Proceedings of

ICASSP. IEEE, 1995.

[38] N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and reduced rank

HMMs for improved speech recognition,” Speech communication, vol. 26, no. 4, pp.

283–297, 1998.

[39] R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classi-

fication,” in Proceedings of ICASSP. IEEE, 1998.

[40] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and

K. Visweswariah, “Boosted MMI for model and feature-space discriminative train-

ing,” in Proceedings of ICASSP. IEEE, 2008.

[41] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Dis-

criminatively trained features for speech recognition,” in Proceedings of ICASSP.

IEEE, 2005.

121

BIBLIOGRAPHY

[42] N. Kambhatla and T. Leen, “Dimension reduction by local principal component anal-

ysis,” Neural Computation, vol. 9, no. 7, pp. 1493–1516, 1997.

[43] B. Zhang, S. Matsoukas, and R. Schwartz, “Discriminatively trained region dependent

feature transforms for speech recognition,” in Proceedings of ICASSP. IEEE, 2006.

[44] J. Zheng, O. Cetin, M. Hwang, X. Lei, A. Stolcke, and N. Morgan, “Combining

discriminative feature, transform, and model training for large vocabulary speech

recognition,” in Proceedings of ICASSP. IEEE, 2007.

[45] E. Zwicker, G. Flottorp, and S. Stevens, “Critical band width in loudness summation,”

The Journal of the Acoustical Society of America, vol. 29, no. 5, pp. 548–557, 1957.

[46] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on

Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.

[47] H. Hermansky and P. Fousek, “Multi-resolution RASTA filtering for TANDEM-based

ASR,” in Proceedings of Interspeech, 2005.

[48] L. Lee and R. Rose, “Speaker normalization using efficient frequency warping proce-

dures,” in Proceedings of ICASSP. IEEE, 1996.

[49] ETSI, “Speech processing, transmission and quality aspects (STQ); Distributed

speech recognition; Advanced front-end feature extraction algorithm; Compression

algorithms,” ETSI ES, vol. 202, no. 050, p. v1, 2007.

[50] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech communica-

tion, vol. 16, no. 3, pp. 261–291, 1995.

122

BIBLIOGRAPHY

[51] M. Gales and S. Young, “The application of hidden Markov models in speech recog-

nition,” Signal Processing, vol. 1, no. 3, pp. 195–304, 2007.

[52] C. Avendano, “Temporal processing of speech in a time-feature space,” Ph.D. disser-

tation, Oregon Graduate Institute, 1997.

[53] D. Gelbart and N. Morgan, “Double the trouble: handling noise and reverberation in

far-field automatic speech recognition,” in Proceedings of ICSLP. ISCA, 2002.

[54] N. Morgan, H. Bourlard, C. Wooters, P. Kohn, and M. Cohen, “Phonetic context in

hybrid HMM/MLP continuous speech recognition,” in Second European Conference

on Speech Communication and Technology, 1991.

[55] B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz, “Long span features and minimum

phoneme error heteroscedastic linear discriminant analysis,” in Proceedings of EARS

RT-04 Workshop, 2004.

[56] S. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, and G. Zweig,

“Advances in speech transcription at IBM under the DARPA EARS program,” IEEE

Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1596–

1608, 2006.

[57] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf,

P. Jain, H. Hermansky, D. Ellis et al., “Pushing the envelope - aside: Beyond the

spectral envelope as the fundamental representation for speech recognition,” IEEE

Signal Processing Magazine, vol. 22, no. 5, pp. 81–88, 2005.

123

BIBLIOGRAPHY

[58] H. Yang, S. Vuuren, S. Sharma, and H. Hermansky, “Relevance of time-frequency fea-

tures for phonetic and speaker-channel classification,” Speech Communication, vol. 31,

no. 1, pp. 35–50, 2000.

[59] R. Drullman, J. Festen, and R. Plomp, “Effect of reducing slow temporal modulations

on speech reception,” The Journal of the Acoustical Society of America, vol. 95, p.

2670, 1994.

[60] T. Arai, M. Pavel, H. Hermansky, and C. Avendano, “Syllable intelligibility for tem-

porally filtered LPC cepstral trajectories,” The Journal of the Acoustical Society of

America, vol. 105, p. 2783, 1999.

[61] T. Houtgast and H. Steeneken, “The modulation transfer function in room acoustics as

a predictor of speech intelligibility,” The Journal of the Acoustical Society of America,

vol. 54, no. 2, pp. 557–557, 1973.

[62] T. Chi, Y. Gao, M. Guyton, P. Ru, and S. Shamma, “Spectro-temporal modulation

transfer functions and speech intelligibility,” The Journal of the Acoustical Society of

America, vol. 106, p. 2719, 1999.

[63] T. Houtgast and H. Steeneken, “A review of the MTF concept in room acoustics and

its use for estimating speech intelligibility in auditoria,” The Journal of the Acoustical

Society of America, vol. 77, p. 1069, 1985.

[64] H. Hermansky, “The modulation spectrum in the automatic recognition of speech,”

in Proceedings of ASRU. IEEE, 1997.

124

BIBLIOGRAPHY

[65] H. Hermansky and S. Sharma, “TRAPS-classifiers of temporal patterns,” in Proceed-

ings of ICSLP. ISCA, 1998.

[66] ——, “Temporal patterns (TRAPS) in ASR of noisy speech,” in Proceedings of

ICASSP. IEEE, 1999.

[67] P. Schwarz, “Phoneme recognition based on long temporal context,” Ph.D. disserta-

tion, Brno University of Technology, 2009.

[68] P. Jain and H. Hermansky, “Beyond a single critical-band in TRAP based ASR,” in

Eighth European Conference on Speech Communication and Technology, 2003.

[69] J. Herre and J. Johnston, “Enhancing the performance of perceptual audio coders by

using temporal noise shaping (TNS),” in 101st AES Convention, 1996.

[70] R. Kumaresan and A. Rao, “Model-based approach to envelope and positive instan-

taneous frequency estimation of signals with speech applications,” The Journal of the

Acoustical Society of America, vol. 105, p. 1912, 1999.

[71] M. Athineos, “Linear prediction of temporal envelopes for speech and audio applica-

tions,” Ph.D. dissertation, Columbia University, 2008.

[72] S. Ganapathy, “Signal analysis using autoregressive models of amplitude modulaiton,”

Ph.D. dissertation, The Johns Hopkins University, 2012.

[73] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in LVCSR

using neural networks,” in Proceedings of ICSLP. IEEE, 2004.

125

BIBLIOGRAPHY

[74] F. Valente and H. Hermansky, “Combination of acoustic classifiers based on Dempster-

Shafer theory of evidence,” in Proceedings of ICASSP. IEEE, 2007.

[75] Q. Zhu, B. Chen, N. Morgan, A. Stolcke et al., “On using MLP features in LVCSR,”

in Proceedings of ICSLP. ISCA, 2004.

[76] M. Athineos and D. Ellis, “Autoregressive modeling of temporal envelopes,” IEEE

Transactions on Signal Processing, vol. 55, no. 11, pp. 5237–5245, 2007.

[77] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech

using frequency domain linear prediction,” IEEE Signal Processing Letters, vol. 15,

pp. 681–684, 2008.

[78] S. Ganapathy, S. Thomas, and H. Hermansky, “Temporal envelope compensation for

robust phoneme recognition using modulation spectrum,” The Journal of the Acous-

tical Society of America, vol. 128, p. 3769, 2010.

[79] S. Thomas, S. Ganapathy, and H. Hermansky, “Spectro-temporal features for auto-

matic speech recognition using linear prediction in spectral domain,” in Proceedings

of EUSIPCO. EURASIP, 2008.

[80] ——, “Phoneme recognition using spectral envelope and modulation frequency fea-

tures,” in Proceedings of ICASSP. IEEE, 2009.

[81] S. Mallidi, S. Ganapathy, and H. Hermansky, “Modulation spectrum analysis for

recognition of reverberant speech,” in Proceedings of INTERSPEECH. ISCA, 2011.

[82] S. Thomas, S. Ganapathy, and H. Hermansky, “Hilbert envelope based features for

126

BIBLIOGRAPHY

far-field speech recognition,” Machine Learning for Multimodal Interaction, pp. 119–

124, 2008.

[83] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative model of the effective signal

processing in the auditory system. i. model structure,” The Journal of the Acoustical

Society of America, vol. 99, p. 3615, 1996.

[84] S. Ganapathy, S. Thomas, and H. Hermansky, “Modulation frequency features for

phoneme recognition in noisy speech,” The Journal of the Acoustical Society of Amer-

ica, vol. 125, no. 1, pp. EL8–EL12, 2008.

[85] ——, “Comparison of modulation features for phoneme recognition,” in Proceedings


[86] B. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the

modulation spectrogram,” Speech Communication, vol. 25, no. 1, pp. 117–132, 1998.

[87] V. Tyagi and C. Wellekens, “Fepstrum representation of speech signal,” in Proceedings

of ASRU. IEEE, 2005.

[88] S. Ganapathy, S. Thomas, and H. Hermansky, “Static and dynamic modulation spec-

trum for speech recognition,” in Proceedings of INTERSPEECH. ISCA, 2009.

[89] K. Lee and H. Hon, “Speaker-independent phone recognition using hidden Markov

models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37,

no. 11, pp. 1641–1648, 1989.

[90] T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiat, M. Lincoln, I. McCowan,

127

BIBLIOGRAPHY

D. Moore, V. Wan, R. Ordelman et al., “The 2005 AMI system for the transcription

of speech in meetings,” Machine Learning for Multimodal Interaction, pp. 450–462,

2006.

[91] J. Fiscus, N. Radde, J. Garofolo, A. Le, J. Ajot, and C. Laprun, “The rich transcrip-

tion 2005 spring meeting recognition evaluation,” Machine Learning for Multimodal

Interaction, pp. 369–389, 2006.

[92] D. Moore, J. Dines, M. Doss, J. Vepa, O. Cheng, and T. Hain, “Juicer: A weighted

finite-state transducer speech decoder,” Machine Learning for Multimodal Interaction,

pp. 285–296, 2006.

[93] G. Zavaliagkos, M. Siu, T. Colthurst, and J. Billa, “Using untranscribed training data

to improve performance,” in Proceedings of ICSLP. ESCA, 1998.

[94] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C. Lee, “A study on multilingual

acoustic modeling for large vocabulary ASR,” in Proceedings of ICASSP. IEEE,

2009.

[95] D. Imseng, H. Bourlard, and P. Garner, “Using KL-divergence and multilingual infor-

mation to improve ASR for under-resourced languages,” in Proceedings of ICASSP.

IEEE, 2012.

[96] IPA, Handbook of the International Phonetic Association: A guide to the use of the

International Phonetic Alphabet. Cambridge University Press, 1999.

[97] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek,

N. Goel, M. Karafiat, D. Povey et al., “Multilingual acoustic modeling for speech

128

BIBLIOGRAPHY

recognition based on subspace Gaussian mixture models,” in Proceedings of ICASSP.

IEEE, 2010.

[98] Y. Qian, D. Povey, and J. Liu, “State-level data borrowing for low-resource speech

recognition based on subspace GMMs,” in Proceedings of INTERSPEECH. ISCA,

2011.

[99] S. Sivadas and H. Hermansky, “On use of task independent training data in tandem

feature extraction,” in Proceedings of ICASSP. IEEE, 2004.

[100] J. Pinto, “Multilayer perceptron based hierarchical acoustic modeling for automatic

speech recognition,” Ph.D. dissertation, Ecole Polytechnique Feddrale de Lausanne,

2010.

[101] A. Stolcke, F. Grezl, M. Hwang, X. Lei, N. Morgan, and D. Vergyri, “Cross-domain

and cross-language portability of acoustic features estimated by multilayer percep-

trons,” in Proceedings of ICASSP. IEEE, 2006.

[102] G. Miller and P. Nicely, “An analysis of perceptual confusions among some English

consonants,” The Journal of the Acoustical Society of America, vol. 27, no. 2, pp.

338–352, 1955.

[103] A. Lovitt, J. Pinto, and H. Hermansky, “On confusions in a phoneme recognizer,”

IDIAP Research Report, IDIAP-RR-07-10, Tech. Rep., 2007.

[104] C. Pelaez-Moreno, A. Garcıa-Moral, and F. Valverde-Albacete, “Analyzing phonetic

confusions using formal concept analysis,” The Journal of the Acoustical Society of

America, vol. 128, p. 1377, 2010.

129

BIBLIOGRAPHY

[105] R. Fano, Transmission of Information: A Statistical Theory of Communication. MIT

press, 1961.

[106] A. Canavan, D. Graff, and G. Zipperlen, “CALLHOME American English speech,”

Linguistic Data Consortium, 1997.

[107] ——, “CALLHOME German speech,” Linguistic Data Consortium, 1997.

[108] A. Canavan and G. Zipperlen, “CALLHOME Spanish speech,” Linguistic Data Con-

sortium, 1997.

[109] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev,

and P. Woodland, “The HTK book,” Cambridge University Engineering Department,

vol. 3, 2002.

[110] J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech

corpus for research and development,” in Proceedings of ICASSP. IEEE, 1992.

[111] D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword,” Linguistic Data

Consortium, Philadelphia, 2003.

[112] P. Kingsbury, S. Strassel, C. McLemore, and R. MacIntyre, “CALLHOME American

English lexicon (PRONLEX),” Linguistic Data Consortium, Philadelphia, 1997.

[113] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE

Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 7–13,

2012.

130

BIBLIOGRAPHY

[114] F. Valente and H. Hermansky, “Hierarchical and parallel processing of modulation

spectrum for ASR applications,” in Proceedings of ICASSP. IEEE, 2008.

[115] G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme recogni-

tion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,

pp. 23–29, 2012.

[116] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief net-

works,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,

pp. 14–22, 2012.

[117] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neu-

ral networks for large vocabulary speech recognition,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.

[118] T. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed,

“Making deep belief networks effective for large vocabulary continuous speech recog-

nition,” in Proceedings of ASRU. IEEE, 2011.

[119] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep

neural networks for conversational speech transcription,” in Proceedings of ASRU.

IEEE, 2011.

[120] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why

does unsupervised pre-training help deep learning?” The Journal of Machine Learning

Research, vol. 11, pp. 625–660, 2010.

131

BIBLIOGRAPHY

[121] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-

dependent DBN-HMMs for real-world speech recognition,” in Proceedings of NIPS

Workshop on Deep Learning and Unsupervised Feature Learning, 2010.

[122] D. Yu and M. Seltzer, “Improved bottleneck features using pretrained deep neural

networks,” in Proceedings of INTERSPEECH. ISCA, 2011.

[123] T. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features

using deep belief networks,” in Proceedings of ICASSP. IEEE, 2012.

[124] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,”

Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[125] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of

deep networks,” Advances in neural information processing systems, vol. 19, p. 153,

2007.

[126] T. Kemp and A. Waibel, “Unsupervised training of a speech recognizer: Recent ex-

periments,” in Proceedings of EUROSPEECH. ESCA, 1999.

[127] L. Lamel, J. Gauvain, and G. Adda, “Unsupervised acoustic model training,” in

Proceedings of ICASSP. IEEE, 2002.

[128] J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervised training on large

amounts of broadcast news data,” in Proceedings of ICASSP. IEEE, 2006.

[129] T. Kemp and T. Schaaf, “Estimating confidence using word lattices,” in Proceedings

of EUROSPEECH. ESCA, 1997.

132

BIBLIOGRAPHY

[130] L. Burget, P. Schwarz, P. Matejka, M. Hannemann, A. Rastrow, C. White, S. Khu-

danpur, H. Hermansky, and J. Cernocky, “Combination of strongly and weakly con-

strained recognizers for reliable detection of OOVs,” in Proceedings of ICASSP.

IEEE, 2008.

[131] K. Woo, T. Yang, K. Park, and C. Lee, “Robust voice activity detection algorithm

for estimating noise spectrum,” IET Electronics Letters, vol. 36, no. 2, pp. 180–181,

2000.

[132] R. Chengalvarayan, “Robust energy normalization using speech/non-speech discrim-

inator for German connected digit recognition,” in Proceedings of EUROSPEECH.

ISCA, 1999.

[133] A. Benyassine, E. Shlomot, H. Su, D. Massaloux, C. Lamblin, and J. Petit, “ITU-

T recommendation g. 729 annex B: a silence compression scheme for use with G.

729 optimized for V. 70 digital simultaneous voice and data applications,” IEEE

Communications Magazine, vol. 35, no. 9, pp. 64–73, 1997.

[134] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity detection using

higher-order statistics in the LPC residual domain,” IEEE Transactions on Speech

and Audio Processing, vol. 9, no. 3, pp. 217–231, 2001.

[135] J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting record-

ings for automatic speech recognition,” in Proceedings of INTERSPEECH. ISCA,

2006.

[136] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan, and R. Sarikaya, “Robust speech

133

BIBLIOGRAPHY

recognition in noisy environments: The 2001 IBM SPINE evaluation system,” in


[137] H. Hermansky, S. Tibrewala, and M. Pavel, “Towards ASR on partially corrupted

speech,” in Proceedings of ICSLP. ISCA, 1996.

[138] N. Mesgarani, S. Thomas, and H. Hermansky, “Toward optimizing stream fusion in

multistream recognition of speech,” The Journal of the Acoustical Society of America

- Electronic Letters, 2011.

[139] S. Thomas, S. Mallidi, T. Janu, H. Hermansky, N. Mesgarani, X. Zhou, S. Shamma,

T. Ng, B. Zhang, L. Nguyen et al., “Acoustic and data-driven features for robust

speech activity detection,” in Proceedings of INTERSPEECH. ISCA, 2012.

[140] K. Walker and S. Strassel, “The RATS radio traffic collection system,” in Proceedings

of Odyssey. ISCA, 2012.

[141] X. Ma, D. Graff, and K. Walker, “RATS - first incremental SAD audio delivery,”

Linguistic Data Consortium, 2011.

[142] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, K. Vesely, P. Matejka, X. Zhu, and

N. Mesgarani, “Developing a speech activity detection system for the DARPA RATS

program,” in Proceedings of INTERSPEECH. ISCA, 2012.

[143] D. Reynolds and R. Rose, “Robust text-independent speaker identification using

Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Pro-

cessing, vol. 3, no. 1, pp. 72–83, 1995.

134

BIBLIOGRAPHY

[144] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian

mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.

[145] B. Yegnanarayana and S. Kishore, “AANN: an alternative to GMM for pattern recog-

nition,” Neural Networks, vol. 15, no. 3, pp. 459–469, 2002.

[146] M. Shajith Ikbal, H. Misra, and B. Yegnanarayana, “Analysis of autoassociative map-

ping neural networks,” in Proceedings of IJCNN. IEEE, 1999.

[147] K. Murty and B. Yegnanarayana, “Combining evidence from residual phase and

MFCC features for speaker recognition,” IEEE Signal Processing Letters, vol. 13,

no. 1, pp. 52–55, 2006.

[148] G. Sivaram, S. Thomas, and H. Hermansky, “Mixture of auto-associative neural net-

works for speaker verification,” in Proceedings of INTERSPEECH. ISCA, 2011.

[149] S. Ganapathy, J. Pelecanos, and M. Omar, “Feature normalization for speaker verifi-

cation in room reverberation,” in Proceedings of ICASSP. IEEE, 2011.

[150] S. Thomas, S. Mallidi, S. Ganapathy, and H. Hermansky, “Adaptation transforms of

auto-associative neural networks as features for speaker verification,” in Proceedings

of Odyssey. ISCA, 2012.

[151] S. Garimella, “Alternative regularized neural network architectures for speech and

speaker recognition,” Ph.D. dissertation, The Johns Hopkins University, 2012.

[152] A. Jansen, K. Church, and H. Hermansky, “Towards spoken term discovery at scale

with zero resources,” in Proceedings of INTERSPEECH. ISCA, 2010.

135

BIBLIOGRAPHY

[153] A. Muscariello, G. Gravier, F. Bimbot et al., “Audio keyword extraction by unsuper-

vised word discovery,” in Proceedings of INTERSPEECH. ISCA, 2009.

[154] Y. Zhang and J. Glass, “Towards multi-speaker unsupervised speech pattern discov-

ery,” in Proceedings of ICASSP. IEEE, 2010.

[155] M. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech

representations for spoken term discovery,” in Proceedings of INTERSPEECH. ISCA,

2011.

[156] S. Thomas, P. Nguyen, G. Zweig, and H. Hermansky, “MLP based phoneme detectors

for automatic speech recognition,” in Proceedings of ICASSP. IEEE, 2011.

[157] G. Zweig and P. Nguyen, “A segmental CRF approach to large vocabulary continuous

speech recognition,” in Proceedings of ASRU. IEEE, 2009.

[158] ——, “SCARF: A segmental conditional random field toolkit for speech recognition,”

in Proceedings of INTERSPEECH. ISCA, 2010.

[159] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell,

M. Wang, F. Sha, H. Hermansky et al., “Speech recognition with segmental con-

ditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in


[160] S. Thomas, S. Ganapathy, and H. Hermansky, “Hilbert envelope based spectro-

temporal features for phoneme recognition in telephone speech,” in Proceedings of

INTERSPEECH. ISCA, 2008.

136

BIBLIOGRAPHY

[161] ——, “Tandem representations of spectral envelope and modulation frequency fea-

tures for ASR,” in Proceedings of INTERSPEECH. ISCA, 2009.

[162] ——, “Cross-lingual and multistream posterior features for low resource LVCSR sys-

tems,” in Proceedings of INTERSPEECH. ISCA, 2010, pp. 877–880.

[163] ——, “Multilingual MLP features for low-resource LVCSR systems,” in Proceedings


[164] S. Thomas, S. Ganapathy, A. Jansen, and H. Hermansky, “Data-driven posterior

features for low resource speech recognition applications,” in Proceedings of INTER-

SPEECH. ISCA, 2012.

[165] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting via segmental DTW

on Gaussian posteriorgrams,” in Proceedings of ASRU. IEEE, 2009.

[166] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised learning of acoustic

sub-word units,” in Proceedings of HLT. ACL, 2008.

[167] M. Siu, H. Gish, S. Lowe, and A. Chan, “Unsupervised audio patterns discovery using

HMM-based self-organized units,” in Proceedings of INTERSPEECH. ISCA, 2011.

[168] X. Anguera, “Speaker independent discriminant feature extraction for acoustic

pattern-matching,” in Proceedings of ICASSP. IEEE, 2012.

[169] C. Lee and J. Glass, “A non-parametric Bayesian approach to acoustic model discov-

ery,” in Proceedings of ACL, 2012.

137

BIBLIOGRAPHY

[170] A. Jansen and K. Church, “Towards unsupervised training of speaker independent

acoustic models,” in Proceedings of INTERSPEECH. ISCA, 2011.

[171] N. Mesgarani, S. Thomas, and H. Hermansky, “Adaptive stream fusion in multistream

recognition of speech,” in Proceedings of INTERSPEECH. ISCA, 2011.

[172] ——, “A multistream multiresolution framework for phoneme recognition,” in Pro-

ceedings of INTERSPEECH. ISCA, 2010.

138

Vita

Samuel Thomas received his Bachelor of Technology degree in Computer

Science and Engineering from Cochin University of Science and Technology, Kerala,

India in 2000 and Master of Science by Research degree from the Indian Institute of

Technology, Madras in 2006. He completed his PhD. in Electrical and Computer Sci-

ence while being affiliated to the Center for Language and Speech Processing (CLSP)

at the Johns Hopkins University in 2012. He is currently a post-doctoral researcher

at the IBM T.J. Watson Research Center, Yorktown Heights, USA. His research in-

terests include speech recognition, speaker recognition, speech synthesis and machine

learning. In the past he has been part of several summer workshops at the CLSP and

has also worked at IDIAP Research, Switzerland and the IBM India Research Lab,

New Delhi, India.

139

data-driven neural network based feature front-ends for

Documents