data-driven neural network based feature front-ends for
TRANSCRIPT
DATA-DRIVEN NEURAL NETWORK BASED FEATURE
FRONT-ENDS FOR AUTOMATIC SPEECH RECOGNITION
by
Samuel Thomas
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy.
Baltimore, Maryland
December, 2012
c© Samuel Thomas 2012
All rights reserved
Abstract
Speech contains information about at least three constituent elements - (1)
the message that is being communicated, (2) the speakers who are communicating and
(3) the environment in which the communication occurs. Depending on the final goal,
information about each of these elements is processed by a feature extraction front-end
before being used for subsequent pattern recognition applications. Feature extraction
front-ends for automatic speech recognition (ASR) are designed to derive features that
characterize underlying speech sounds in the signal that are useful in recognizing the
spoken message. Irrelevant variability from speakers and the environment should also
be alleviated to the extent possible.
In this thesis, we improve conventional feature extraction techniques by de-
veloping a data-driven feature extraction approach. The key element in this ap-
proach is a feed-forward neural network trained on large amounts of data to recog-
nize phonemes which are basic speech units of speech occurring at intervals of 5-10
milliseconds. We show how these data-driven features can benefit significantly by
combining information from multiple acoustic features derived using novel signal pro-
ii
ABSTRACT
cessing techniques. In experiments on a variety of ASR tasks - from a small vocabulary
continuous digit recognition task to a large vocabulary continuous speech recognition
(LVCSR) task, the proposed features provide about 14% relative reduction of word
error rate (WER).
The other problem we address in this thesis relates to the development of
LVCSR systems with only few hours of training data. In conventional systems, the
performance degrades considerably when the amount of training data is reduced. We
propose several techniques to deal in these low resource scenarios by using features
from data-driven feature extractors trained on data from different languages and
domains. The proposed techniques allow the feature front-ends to be trained on
multilingual data transcribed using different phoneme sets. Our approaches show that
with this kind of prior training at the feature extraction level, data-driven features can
compensate significantly for the lack of large amounts of training data in downstream
speech applications. We demonstrate an absolute WER reduction of about 15% on a
low-resource task with only 1 hour of transcribed training data for acoustic modeling.
Apart from being used to generate features, we also show how outputs from
the proposed data-driven front-ends can be used for a host of other speech appli-
cations. In noisy environments we show how data-driven features can be used for
speech activity detection on acoustic data from multiple languages transmitted over
noisy radio communication channels. In a novel speaker recognition model using neu-
ral networks, posteriors of speech classes are used to model parts of each speakers
iii
ABSTRACT
acoustic space, via a training objective function based on posterior probabilities of
broad phonetic classes. In zero resource settings, tasks such as spoken term discovery
attempt to automatically discover repeated words and phrases in speech without any
transcriptions. With no transcripts to guide the process, results of the search largely
depend on the quality of the underlying speech representation being used. Our ex-
periments show that in these settings significant improvements can be obtained using
phoneme posterior outputs derived using the proposed front-ends. We also explore a
different application of these posteriors - as phonetic event detectors for speech recog-
nition. These event detectors are used along with Segmental Conditional Random
Fields (SCRFs) to improve the performance of speech recognition systems.
Thesis Committee
Prof. Mounya Elhilali, Prof. Aren Jansen (Reader) and Prof. Hynek Hermansky
(Reader and Advisor)
iv
Acknowledgments
This thesis would never have been in place without so many great people
around me. I would like to thank my advisor, Prof. Hynek Hermansky for his
guidance and support. He always allowed me to learn, contribute and collaborate on
different projects and work with other research groups. Thank you very much for all
the mentoring!
I owe much to Sriram for always being there for me as a great friend and
collaborator. We have worked together on many interesting ideas and projects, several
of which form the core of this thesis. He has always been around to help - many thanks
for also reading this thesis! My sincere thanks to my colleagues - Sivaram, Harish,
Keith, Vijay, Feipeng, Ehsan, Janu, Kailash, Sridhar, Mike, Bala, Deepu, Joel, Fabio,
Petr, Mathew, John, Tamara, Lakshmi, Hari, Weifeng and Phil. Graduate school
would never have been as it was, without all of you! Thank you very much for the
collaborations and help!
I was fortunate to work with several researchers at UMD (Shihab, Nima,
Xinhui, Daniel, Dmitry and Ramani), BBN (Spyros, Stavros, Tim, Long and Bing)
v
ACKNOWLEDGMENTS
and BUT (Lukas, Petr, Pavel, Martin, Ondrej and Honza) on the IAPRA BEST,
BABEL and DARPA RATS programs. Thank you very much!
I spent three different summers working with CLSP summer workshop teams
lead by Dan, Nagendra, Lukas and Richard in 2009, Geoff and Patrick in 2010 and
Aren, Mike and Ken in 2012. These were great workshops! Thanks for having me
on your teams! My sincere thanks to Sanjeev for organizing these workshops and the
CLSP, HLTCOE and ECE for supporting me on various grants.
My sincere thanks to Aren and Mounya for being on my committees from
my GBO to final defense. Thank you very much Aren for the great collaboration,
advise and comments on my thesis!
I pulled through all of this because of the love and prayers of my family -
my son Joshua, wife Jamie and our parents. No words can express my thanks to the
part you have in all this!
vi
Dedication
This thesis is dedicated to my family.
vii
Contents
Abstract ii
Acknowledgments v
List of Tables xiii
List of Figures xv
1 Introduction 1
1.1 Overview of Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 1
1.2 Conventional Feature Extraction Techniques for ASR . . . . . . . . . . . . . 3
1.3 Integrating Training Data with Feature Extraction . . . . . . . . . . . . . . 4
1.4 Review of data-driven feature transforms for ASR . . . . . . . . . . . . . . 7
1.4.1 Front-end feature transforms . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Back-end feature transforms . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Focus of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Outline of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
viii
CONTENTS
2 From Acoustic Features to Data-driven Features 24
2.1 Time-Frequency Representations of Speech . . . . . . . . . . . . . . . . . . 24
2.1.1 Processing across the frequency axis . . . . . . . . . . . . . . . . . . 25
2.1.2 Processing across the time axis . . . . . . . . . . . . . . . . . . . . . 26
2.1.3 Integrating speaker and channel invariance . . . . . . . . . . . . . . 27
2.2 Towards Improved Features for ASR . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 Long-term acoustic features . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Parametric models of temporal envelopes . . . . . . . . . . . . . . . 31
2.2.3 Neural network features . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.4 Combination of information from multiple streams . . . . . . . . . . 33
2.3 Novel Short-term and Long-term Features for Speech Recognition . . . . . . 34
2.3.1 FDLP based time-frequency representation . . . . . . . . . . . . . . 36
2.3.2 Short-term Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Long-term Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.4 Data-driven Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Speech Recognition Experiments and Results . . . . . . . . . . . . . . . . . 44
2.4.1 Phoneme Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.2 Small Vocabulary Digit Recognition . . . . . . . . . . . . . . . . . . 46
2.4.3 Large Vocabulary Continuous Speech Recognition . . . . . . . . . . 47
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Data-driven Features for Low-resource Scenarios 51
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ix
CONTENTS
3.2 Training Using a Combined Phone set . . . . . . . . . . . . . . . . . . . . . 54
3.3 Training Using Multiple Output Layers . . . . . . . . . . . . . . . . . . . . 57
3.4 Speech Recognition Experiments and Results . . . . . . . . . . . . . . . . . 59
3.4.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2 Low-resource LVCSR System . . . . . . . . . . . . . . . . . . . . . . 60
3.4.3 Building Data-driven Front-ends using a Common Phoneme Set . . 61
3.4.4 Data-driven Front-ends with MLPs Adapted using Multiple Output
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Training with 2 languages . . . . . . . . . . . . . . . . . . . . . . . . 66
Training with 3 languages . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Wide and Deep MLP Architectures in Low-resource Settings 70
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Wide Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Building the Data-driven Front-ends . . . . . . . . . . . . . . . . . . 72
4.2.2 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Deep Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 DNN Pretraining and Initialization . . . . . . . . . . . . . . . . . . . 79
4.3.2 DNN Adaptation with task specific data . . . . . . . . . . . . . . . . 81
4.3.3 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . 82
DNN pretraining with cross-lingual data . . . . . . . . . . . . . . . . 82
DNN adaptation to low-resource settings . . . . . . . . . . . . . . . 84
x
CONTENTS
ASR Experiments using DNN features . . . . . . . . . . . . . . . . . 85
4.4 Semi-supervised training in Low-resource Settings . . . . . . . . . . . . . . . 85
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Selecting Reliable Data . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 88
Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Semi-supervised training of DNNs . . . . . . . . . . . . . . . . . . . 89
Semi-supervised training of Acoustic Models . . . . . . . . . . . . . 91
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Applications of Data-driven Front-end Outputs 93
5.1 Application 1 - Speech Activity Detection . . . . . . . . . . . . . . . . . . . 93
5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1.2 Data-driven Features for SAD . . . . . . . . . . . . . . . . . . . . . . 94
5.1.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Application 2 - Neural Network based Speaker Verification . . . . . . . . . . 98
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.2 AANN Models for Speaker Verification . . . . . . . . . . . . . . . . . 99
Modeling Speaker Data . . . . . . . . . . . . . . . . . . . . . . . . . 99
Mixture of AANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Application 3 - Zero Resource Settings . . . . . . . . . . . . . . . . . . . . . 104
5.4 Application 4 - Event detectors for Speech Recognition . . . . . . . . . . . . 106
xi
CONTENTS
5.4.1 Building Phoneme Detectors . . . . . . . . . . . . . . . . . . . . . . 106
5.4.2 Integrating Detectors with SCARF . . . . . . . . . . . . . . . . . . . 107
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusions 110
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography 117
Vita 139
xii
List of Tables
1.1 LDA with different representations of speech. . . . . . . . . . . . . . . . . . 12
2.1 FDLP model parameters that improve robustness of short-term spectral fea-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 FDLP model parameters that improve performance of long-term modulationfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3 Phoneme Recognition Accuracies (%) for different feature extraction tech-niques on the TIMIT database . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Word Recognition Accuracies (%) on the OGI Digits database for differentfeature extraction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Word Recognition Accuracies (%) on RT05 Meeting data, for different featureextraction techniques. TOT - total word recognition accuracy (%) for alltest sets, AMI, CMU, ICSI, NIST, VT - word recognition accuracies (%) onindividual test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6 Recognition Accuracies (%) of broad phonetic classes obtained from confusionmatrix analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Word Recognition Accuracies (%) using different Tandem features derivedusing only 1 hour of English data . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Word Recognition Accuracies (%) using Tandem features enhanced usingcross-lingual posterior features . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 Word Recognition Accuracies (%) using multi-stream cross-lingual posteriorfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Word Recognition Accuracies (%) using two languages - Spanish and English 673.5 Word Recognition Accuracies (%) using three languages - Spanish, German
and English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Word Recognition Accuracies (%) using different amounts of Callhome datato train the LVCSR system with conventional acoustic features . . . . . . . 77
4.2 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 834.3 Word Recognition Accuracies (%) at different word confidence thresholds . 894.4 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 90
xiii
LIST OF TABLES
4.5 Word Recognition Accuracies (%) with semi-supervised acoustic model training 91
5.1 Equal Error Rate (%) on different channels using different acoustic featuresand combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Performance in terms of Min DCF (×103) and EER (%) in parentheses ondifferent NIST-08 conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Integrating MLP based event detectors with ASR . . . . . . . . . . . . . . . 108
6.1 Performances in a low-resource setting using different data-driven front-endsproposed in the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xiv
List of Figures
1.1 Broad Classification of Feature Transforms for ASR. . . . . . . . . . . . . . 81.2 Spectral basis functions derived using PCA on the bark-spectrum of speech
from the OGI stories database - Eigenvalues of the KLT basis, total covari-ance matrix projected on the first 8 KLT vectors, first 6 KL spectral basisfunctions derived by PCA analysis. . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 LDA-derived spectral basis functions of the critical band spectral space de-rived from the OGI Numbers corpus. . . . . . . . . . . . . . . . . . . . . . . 13
1.4 (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from cleanSwitchboard database, (b) Frequency and impulse responses of the RASTAfilter and the RASTA filter combined with the delta and double-delta filters. 14
1.5 Thesis contributions to developing better data-drive neural network featuresfor ASR pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP. 35
2.2 PLP (b) and FDLP (c) spectrograms for a portion of speech (a). . . . . . . 382.3 Schematic of the joint spectral envelope, modulation features for posterior
based ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Schematic of the proposed training technique with multiple output layers . 583.2 Deriving cross-lingual and multi-stream posterior features for low resource
LVCSR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Tandem and bottleneck features for low-resource LVCSR systems. . . . . . 68
4.1 (a) Wide and (b) Deep neural network topologies for data-driven features . 714.2 Data driven front-end built using data from the same language but from a
different genre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3 A cross-lingual front-end built with data from the same language and with
large amounts of additional data from a different language but with sameacoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xv
LIST OF FIGURES
4.4 LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 MLP posteriogram based phoneme occurrence count . . . . . . . . . . . . . 87
5.1 Schematic of (a) features and (b) the processing pipeline for speech activitydetection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Average precision for different configuration of the wide topology front-ends 105
xvi
Chapter 1
Introduction
This chapter introduces the automatic speech recognition problem and machinery.
The theme of the thesis - developing data-driven feature extractors for speech recognition, is
motivated along with a discussion on techniques that have been developed in the past. The
chapter also outlines the thesis and its contributions.
1.1 Overview of Automatic Speech Recognition
Automatic speech recognition is the process of transcribing speech into text. Cur-
rent speech recognition systems solve this task in a probabilistic setting using four key
components: a feature extraction module, an acoustic model, a pronunciation dictionary
and a language model. In a word recognition task, given an acoustic signal corresponding
to a sequence of words X = x1x2 . . . xn, the feature extraction module first generates a
compact representation of the input as sequence of feature vectors Y = y1y2 . . . yt. The
1
CHAPTER 1. INTRODUCTION
acoustic model, pronunciation dictionary and a language model are then used to find the
most probable word sequence X given these feature vectors. This is done by expressing the
desired probability p(X|Y ) using Bayes theorem as
X = arg maxX
p(X|Y ) = arg maxX
p(Y |X)p(X)p(Y )
(1.1)
p(X) is the a priori probability of observing a sequence of words in the language, inde-
pendent of any acoustic evidence and is modeled using the language model component.
p(Y |X) corresponds to the likelihood of the acoustic features Y being generated given the
word sequence X.
In current ASR systems, both the language model and the acoustic model are
stochastic models trained using large amounts training data [1, 2]. Hidden Markov Models
(HMMs) or a hybrid combination of neural networks and HMMs [3] are typically used as
acoustic models.
For large vocabulary speech recognition, not all words have adequate number of
acoustic examples in the training data. The acoustic data also covers only a limited vo-
cabulary of words. Instead of modeling incorrect probability distributions of entire words
or utterances using limited examples, acoustic models for basic speech sounds are instead
built. By using these basic units, recognizers can also recognize words without acoustic
training examples.
To compute the likelihood p(Y |X), each word in the hypothesized word sequenceX
is first broken down into its constituent phones using the pronunciation dictionary. A single
composite model for the hypothesis is then constructed by combining individual phone
HMMs. In practice, to account for the large variability of basic speech sounds, HMMs
2
CHAPTER 1. INTRODUCTION
of context dependent speech units with continuous density output distributions are used.
There exist efficient algorithms like the Baum-Welch algorithm to learn the parameters of
these acoustic models from training data [4].
N -grams, typically bi-grams or tri-grams, are used as language models to generate
the a priori probability p(X) [2]. Although p(X) is the probability of a sequence of words,
N -grams model this probability assuming the probability of any word xi depends on only
N-1 preceding words. These probability distributions are estimated from simple frequency
counts that can be directly obtained from large amounts of text. To account for the inability
to estimate counts for all possible N -gram sequences techniques like discounting and back-off
are used [5].
1.2 Conventional Feature Extraction Techniques for ASR
Front-ends for ASR which have traditionally evolved from coding techniques like
linear predictive coding (LPC) [6] start by performing a short-term analysis of the speech
signal. Based on the assumption that speech is stationary in sufficiently short-time intervals,
the power spectrum (squared magnitude of the short-time Fourier spectrum) of the signal
is computed every 10 ms in overlapping Hamming analysis windows of 25 ms duration
[7, 8]. This spectral representation of speech is then transformed into an auditory-like
representation by warping the frequency axis to the Mel or Bark scale and applying a non-
linear cubic root or logarithmic compression. Mel-frequency Cepstral Coefficients (MFCC)
[9] or Perceptual Linear Prediction (PLP) [10] features for speech recognition are cepstral
coefficients derived by projecting the auditory-like representation onto a set of discrete cosine
3
CHAPTER 1. INTRODUCTION
transform (DCT) basis functions. Since these techniques analyze the speech signal only in
short analysis windows, information about local dynamics of the underlying speech signal is
often provided by augmenting these features with derivatives of the cepstral trajectories at
each instant [11]. In speech recognition applications, the first 13 cepstral coefficients along
with their delta and double deltas are typically used.
1.3 Integrating Training Data with Feature Extraction
In practical classification settings, the goal of a classifier is to assign one of J class
labels to an entity given an N dimensional feature vector x. One approach to this problem,
involves inferring posterior class probabilities p(Cj |x) of each class given the features. The
entity is then assigned to the class that gives the highest class posterior probability [12].
The posterior probability p(Cj |x) of each class can be estimated in multiple ways.
In a Bayesian formulation, p(Cj |x) can be expanded as p(x|Cj)p(Cj)p(x) . Each of the quantities
p(x|Cj) and p(Cj) are then separately computed from generative models trained to capture
these distributions from data. The probability p(Cj |x) can also be estimated directly from
a parametric model, whose parameters have also been optimized using the training data.
A non-probabilistic approach to the classification problem involves discriminant
functions that predict the class label of the input [13]. In this framework, classification
is viewed as partitioning of the input feature space into different classes using decision
boundaries or surfaces. For a simple two class problem, a linear discriminant function can
be constructed as the linear combination of the input feature vector with a weight vector
4
CHAPTER 1. INTRODUCTION
w as
f(x,w) = wTx + w0. (1.2)
In the N-dimensional input space, the function f(x,w) = wTx + w0 forms an N-1 di-
mensional hyperplane that assigns x to class C1 if f(x,w) ≥ 0 and to class C2 otherwise.
Discriminant functions can be further extended as generalized linear discriminant functions,
of the form
f(x,w) = wTφ(x) + w0, (1.3)
where φ(.) is a fixed linear or non-linear vector function of the original input vector x. Using
these functions, for the J class problem we can design for example, a J-class discriminant
with J linear functions of the form
fj(x,wj) = wTj φ(x) + w0. (1.4)
x is assigned to class Ck if fk > fj for all k �= j.
From a feature extraction perspective, discriminant functions provide an interest-
ing avenue for integrating information from the data through the data dependent transfor-
mations of the input features. An example of a linear discriminant function is Fisher’s linear
discriminant. In this method, instead of using the linear combination of the input vector
to form a hyperplane for class assignment, the linear combination is used as a dimension-
ality reduction technique. The weight vector w is designed as a set of basis functions that
projects the feature vector x to a lower dimension such that there is maximal separation
between class means and variance within each class is minimum. A common criteria used
5
CHAPTER 1. INTRODUCTION
for this objective is defined as
F(w) = trace(S−1w Sb), (1.5)
where Sw and Sb are within-in class and between class covariance matrices of the data. If
the dimensionality of the new projection space is M , the weight vector can be shown to
be the set of basis functions corresponding to M eigenvectors of S−1w Sb with the largest
eigenvalues [12].
More powerful discriminant functions can be designed by using non-linear basis
functions. In feed-forward neural networks, which are classic examples of these models, the
generalized linear discriminant function is modified as
f(x,w) = g(K∑
k=1
wkφk(x)), (1.6)
where g(.) is a non-linear activation function and φk is now a non-linear. During the training
phase, the basis functions and the weights are adjusted using the training data [13].
In a two layer neural network for example, the processing starts by creating linear
combinations of the N dimensional feature vector at each of the K hidden layer units. With
each of the hidden nodes being connected to every input node through a set of weights, an
activation input of the form
ak =N∑
n=1
wnkxn + wk0, (1.7)
is first produced at each node. Each of node activation then passes through a differentiable,
nonlinear activation function ψ(.) to produce output activations bk = ψ(ak). Commonly
used activation functions are nonlinear sigmoidal functions like the logistic sigmoid or the
‘tanh’ function. Weight wnk is a trainable parameter connecting input node n and hidden
6
CHAPTER 1. INTRODUCTION
node k. wk0 is fixed bias term of the hidden node. Activation outputs of the hidden layer
are then linear combined again to form output unit activations. Each of the M output
nodes receives an activation input
am =K∑
k=1
wkmbm + wm0 (1.8)
to produce an output of the form cm = σ(am), where σ(.) is the ‘softmax’ activation function
defined as
σ(am) =exp(am)∑m exp(am)
, (1.9)
for multi-class classification problems. Using (1.7)-(1.9), the overall network function can
be written as
hm(x,w) = σ( K∑
k=0
wkmψ( N∑
n=0
wnkxn
)). (1.10)
Comparing (1.6) with (1.10) shows how the non-linear basis functions ψ(.) are also now
learnt like the weight parameters. There are different training algorithms to learn these
parameters. In commonly used training methods, model parameters are optimized by us-
ing a cross-entropy error criteria and techniques like error back-propagation. For speech
applications, multilayer perceptrons (MLP) can be used to estimate posterior probabilities
of speech classes like phonemes, conditioned on the input features [3, 14].
1.4 Review of data-driven feature transforms for ASR
Both the transforms reviewed above - transforms with linear basis functions and
transforms with non-linear basis functions form starting points for the development of more
7
CHAPTER 1. INTRODUCTION
Time−frequencyrepresentation
Data−drivenprojections
Data independentprojections
BasisFunctions Functions
TransformsFront−end Feature
based TransformTraining Criteria
based TransformTraining Criteria
Features
Speech Signal
Back−end Feature Transforms
ModelsAcoustic
Features DiscriminativeTraining FrameworkTraining Framework
Maximum Likelihood
Linear Non−linearBasis
Figure 1.1: Broad Classification of Feature Transforms for ASR.
complex data-driven feature transforms and acoustic model backends in speech recognition.
Although transforms like the discrete Fourier transform and the discrete cosine transform
have been used, neither of these transforms are data driven. There has hence been consid-
erable interest to improve these front-ends with more powerful data-driven techniques.
Figure 1.1 is a schematic of how data-driven feature extraction or transformation techniques
for ASR can be broadly classified. There are clearly two distinct sets of transformation
classes - while one set of transforms are strongly tied with the feature extraction module,
8
CHAPTER 1. INTRODUCTION
the second set is strongly coupled with the acoustic model and its training criteria. We call
the first class front-end feature transforms and the second class back-end feature transforms.
1.4.1 Front-end feature transforms
Data-driven feature extractors at the front-end operate directly on time-frequency
representations of speech. As shown in Figure 1.1, these transforms can be further cate-
gorized into two broad groups - data independent projections and data-driven projections.
Examples of data independent projections are the DCT transforms discussed earlier. Al-
though these are a set of fixed cosine basis functions, they are very similar to basis functions
that can be derived from a direct principal component analysis (PCA) [15] on the auditory
spectrum of speech. Principal component analysis or the Karhunen-Loeve transform (KLT)
is a mathematical procedure that transforms a set of observations from possibly correlated
variables into a new set of values corresponding to linearly uncorrelated variables or prin-
cipal components. Figure 1.2 (reproduced from [16]) shows a set of spectral basis derived
using the data-dependent Karhunen-Loeve transform (KLT) on filter bank outputs using 2
hours of speech from the OGI Stories database [17]. The basis functions are very similar to
the cosine functions used in conventional features. The flatness of the first basis function
shows that the variation in the average energy is what contributes the most to the variance
of auditory representations.
LDA using the Fisher discriminant criteria described earlier has been used as a
useful tool in the development of many techniques in the second class of projections - data-
dependent projections. This class is sub-divided further into two groups - a set of transforms
that use linear basis derived by solving a generalized eigenvalue decomposition problem
9
CHAPTER 1. INTRODUCTION
Figure 1.2: Spectral basis functions derived using PCA on the bark-spectrum of speech fromthe OGI stories database - Eigenvalues of the KLT basis, total covariance matrix projectedon the first 8 KLT vectors, first 6 KL spectral basis functions derived by PCA analysis.
and those which use neural network based techniques with non-linear basis functions. In
early work, Brown [18] and Hunt [19] have used LDA on features in speech recognition.
Hunt and his colleagues integrated LDA with Mel-auditory representations of speech in
framework they called IMELDA - integrated Mel-scale representation with LDA [19,20]. A
host of techniques have since been developed based on using LDA with HMM based speech
recognizers to improve recognition performances. These techniques have focused on the use
of different types of output classes like phones, subphones or HMM states and improvements
to LDA limitations - class-conditional distributions are assumed to be normal with equal
covariances matrices. Apart from improving recognition performances, a series of work
10
CHAPTER 1. INTRODUCTION
by Malayath, van Vuuren, Valente and Hermansky [21–23] have analyzed the usefulness
of LDA with phonemes as output classes. Table 1.1 summarizes their key observations of
using LDA with different time-frequency representations of speech. All these techniques
while decorrelating the input feature vectors also maximize the class separability of the
desired output classes, leading to improvements in the recognition performances of ASR
systems.
Time-Frequency Representation Observations
Short-time Fourier spectrum - Discriminant vectors have a non-uniform
LDA is applied to the analysis resolution with frequency - low
to the log-spectra of speech frequency parts of the spectrum are analyzed
with higher resolution than high frequency
parts. This is consistent with the properties
of Mel/Bark filter-bank analysis used in
conventional feature extraction techniques.
Consistent with the properties of hearing,
sensitivity of features derived using
these functions is inversely related to
formant frequencies.
Critical-band spectrum - Unlike the first cosine function,
LDA is applied to total energy of the spectrum is not used.
critical band spectral features Second and third discriminants capture
spectral ripples in the central portion of
11
CHAPTER 1. INTRODUCTION
Time-Frequency Representation Observations
the critical-band spectrum. The fourth basis
uses information above 5 bark period.
Figure 1.3 (reproduced from [22])
shows the important basis functions
compares the basis functions.
Trajectories of Discriminant vectors form a set of FIR filters.
critical-band energies - LDA is Frequency response of the first three discriminant
applied to long segments of vectors are consistent with the RASTA, delta and
time trajectories double-delta features used in ASR.
Figure 1.4 (reproduced from [21]
compares the basis functions with the RASTA
and delta and double-delta filters proposed
by Furui [24]
Table 1.1: LDA with different representations of speech.
While the PCA and LDA techniques described above are useful in describing
transforms in the Euclidean space, manifold based techniques characterize data as being
embedded in a manifold space [25–27]. Several generic manifold learning techniques have
been adopted to be applied on speech data. While learning the manifold structure, several
of these techniques also model both global and local relationships between data points in
the manifold space as constraints. These learning problems are usual solved as optimization
12
CHAPTER 1. INTRODUCTION
Figure 1.3: LDA-derived spectral basis functions of the critical band spectral space derivedfrom the OGI Numbers corpus.
problems or as generalized eigenvector problems.
The second important class of front-end based transforms use neural networks. For acoustic
modeling, multilayer perceptrons (MLP) based systems are trained on different kinds of
feature representations of speech to estimate posterior probabilities of output classes like
phonemes, conditioned on the input features [14]. Neural network based acoustic models
provide several key advantages -
Training criteria - Neural networks are trained to discriminate between output classes
using non-linear basis functions, with its cross-entropy training criteria. This training
can also be scaled efficiently to work on large amounts of training data.
13
CHAPTER 1. INTRODUCTION
Figure 1.4: (a) Frequency and impulse responses of the first three discriminant vectorsderived by applying LDA on trajectories of critical-band energies from clean Switchboarddatabase, (b) Frequency and impulse responses of the RASTA filter and the RASTA filtercombined with the delta and double-delta filters.
Input feature assumptions - These networks can model high dimensional input features
without any strong assumptions about the probability distribution of these features.
Several different kinds of correlated feature streams can also be integrated together
14
CHAPTER 1. INTRODUCTION
since there are also no strong assumptions on statistical independence.
Output representations - MLPs trained on large amounts of data from a diverse col-
lection of speakers and environments, can achieve invariance to these unwanted vari-
abilities. Since posterior probabilities are produced by these networks, outputs from
several networks trained on different feature representations can be combined in a
multi-stream fashion to improve the final posterior estimations.
In hybrid HMM/MLP systems [3], these posterior probabilities are used directly
as the scaled likelihoods of sound classes in HMM states instead of conventional state-
emission probabilities from GMM models (discussed in detail in Chapter 2). Alternatively,
these posteriors can be converted to features that replace conventional acoustic features,
in HMM/GMM based system via the Tandem technique [28] (also discussed in detail in
Chapter 2). Features from intermediate layers of neural networks have also been shown to
be useful for speech recognition [29,30].
Pinto et.al [31,32] use a Volterra series based analysis to understand the behavior of
the non-linear transforms that are learned by MLPs trained to estimate phoneme posterior
probabilities. The linear Volterra kernels used to analyze MLPs trained on Mel-filter bank
features reveal interesting spectro-temporal patterns learnt by the trained system for each
phoneme class. An extended study on a hierarchy of MLPs using the same framework,
shows that when a second MLP classifier is trained on posteriors estimated by an initial
MLP, it learns phonetic temporal patterns in the posterior features. These patterns include
phonetic confusions at the output of the first MLP as well as phonotactics of the language
learnt from the training data.
15
CHAPTER 1. INTRODUCTION
1.4.2 Back-end feature transforms
As shown in Figure 1.1, acoustic features after front-end level transforms are used
to train acoustic models. The distributions of basic speech sounds like phones are typically
represented by a Hidden Markov Model (HMM). Phone HMMs are constructed as finite state
machines with typically five states - a start state, three emitting states and an end state,
connected in a simple left-to-right topology. In each of the emitting states, multivariate
continuous density Gaussian mixture models are used to model the emission probability
distribution of feature vectors. To cover the large phonetic variability, separate HMMs
are trained for every basic speech unit, typically a phone, in context with a left and right
neighboring phone. Individual Gaussian parameters along with the mixing coefficients of the
Gaussian mixture models are estimated in a maximum likelihood framework [2]. However
since the number of trainable tri-phone parameters is huge, additional techniques like state-
tying with phonetic decision trees are used. In a second stage of training, the acoustic
models are then discriminatively trained using objective functions such as maximum mutual
information (MMI) [33, 34], minimum phone error (MPE) [35] or minimum classification
error (MCE) [35]. To improve the performance in each of these two passes of acoustic
models training, separate feature transforms which adapt features to each of the training
phases have been proposed. These set of transforms form the second major class of feature
transforms called back-end feature transforms.
In the past linear discriminant analysis has been investigated in several different
settings - to process feature vectors [18], as a transform to improve the discrimination be-
tween HMM states [36] and also a feature rotation and reduction technique in a maximum
16
CHAPTER 1. INTRODUCTION
likelihood setting [37]. Kumar and Andreou generalized LDA with Heteroscedastic linear
discriminant analysis (HLDA) [38] by relaxing the assumption of sharing the same covari-
ance matrix among all output classes. Also developed in a maximum likelihood setting, the
Maximum Likelihood Linear Transform (MLLT) [39] has been shown to be a special case
of HLDA when there is is no dimensionality reduction.
Feature space transforms like fMMI [40] and fMPE [41] on the other hand, are
linear transforms also applied on feature vectors but in a discriminative framework to opti-
mize the MMI/MPE objective functions. Similar to the early work in [42], region dependent
linear transforms (RDLT) [43] are an extension to the fMMI/MPE by first partitions the
feature space into different regions using a GMM. Each feature vector is then transformed
by a linear transform corresponding to the region that the vector belong via posterior prob-
abilities from the pre-trained GMM.
State-of-the-art-system use a combination of both the front-end and back-end
transforms. Studies like [44] have shown that although these transforms are separately
applied at the feature and model level, they can be combined to significantly improve ASR
performances.
1.5 Focus of the thesis
The feature extraction module plays a very crucial “gate-keeper” role in any pat-
tern recognition task. If useful information for classification is discarded by a poorly de-
signed feature extractor from the signal, it cannot be recovered again and the classification
task suffers. On the other hand if the feature extraction module allows irrelevant and re-
17
CHAPTER 1. INTRODUCTION
dundant detail to remain in the features, the classification module has to be additionally
developed to cope up with this. In speech recognition, a similar setting exists - a feature
extraction front-end first produces features for a pattern recognition back-end to recognize
words. To improve the performances in this setting, this thesis focuses on developing better
features for ASR through an efficiently designed front-end.
The review presented above describes one avenue of improvement for current
speech recognition feature front-ends - the development of better data-driven features. Fig-
ure 1.5 reiterates this again. The primary goal of speech recognition is to extract the mes-
sage, the human communicator produced using an inventory of basic speech units. However
the message is embedded with several constituent components of the speech signal as it
passes through a communication channel influenced by both the human speaker, transmis-
sion mechanism and the environment before it is captured by a machine using a microphone.
It is the goal of the feature extractor module to remove these irrelevant variabilities while
extracting useful features for the speech recognition back-end to recover the message.
Current speech recognition front-ends largely rely on information in the short-term spectrum
of speech. This representation is however very fragile and easily corruptible by channel
artifacts. It is hence necessary to extend the scope of information extraction to other
sources of knowledge. The best source of information is the data itself. This thesis hence
puts focus on data-driven techniques to improve features for ASR.
In earlier sections, several techniques that allow data integration into feature ex-
traction were reviewed. Neural networks provide very interesting mechanisms of integrating
information not only because they are discriminatively trained and use non-linear basis func-
tions to transform the data but also because they have been shown to have several other
18
CHAPTER 1. INTRODUCTION
Figure 1.5: Thesis contributions to developing better data-drive neural network features forASR pipeline.
key advantages. For example they can accommodate large feature dimensions and do not
place strong assumptions on the distributions of these features. A very significant advan-
tage is that they can also directly produce posterior probabilities of speech classes, making
the posteriogram representation of speech - the evolution of the posteriors of speech classes
like phonemes over time, a useful source of information for speech recognition (see figure
1.5). As can be seen, this representation is void of speaker and channel variabilities and is
linked more closely to the underlying speech message encoded using basic speech units like
phonemes.
The performance of these data-driven feature extractors is however linked to sev-
19
CHAPTER 1. INTRODUCTION
eral factors. The MLP estimates posterior probabilities of phoneme classes ci conditioned
on the input acoustic features x and the model parameters w as p(ci|x,w). The factors
that hence determine the goodness of the posteriogram representation are -
(a) The input acoustic features: Robust acoustic features which capture information from
the rich spectro-temporal modulations of speech need to be designed.
(b) The amount of training data: Significant amounts of task dependent data needs to be
used to train the parameters of neural network models
(c) Network architectures: Suitable network architectures have to be used to learn the
data-driven transforms.
1.6 Outline of Contributions
The thesis contributes to improvements in each of the above mentioned factors in
developing better data-driven feature front-ends (Figure 1.5).
(a) Exploiting temporal dynamics of speech: We adopt a novel signal processing tech-
nique based on Frequency Domain Linear Prediction (FDLP) to better model sub-band
temporal envelopes of speech. Features from these representations are used to build
data-driven feature front-ends (Chapter 2). In experiments on a variety of ASR tasks -
from a small vocabulary continuous digit recognition task to large vocabulary continu-
ous speech recognition (LVCSR) task, the proposed data-driven features provide about
14% relative reduction of word error rate (WER).
(b) Working with limited amounts of training data: With significant amounts of train-
20
CHAPTER 1. INTRODUCTION
ing data the proposed data-driven features can perform well (Chapter 2). However, in
several real-world scenarios this is not always the case. In the development of ASR
technologies for new languages and domains, for example, very few hours of transcribed
data is available initially. We hence focus on data-driven features in low resource sce-
narios where only up to 1 hour of transcribed task dependent data is available to train
acoustic models. As with the case of every data-driven technique, the performances of
these feature extractors also diminish.
In Chapters 3 and 4 we propose techniques to alleviate these effects. Our proposed
techniques are based on the use of task independent data. In many cases, these sources
of data cannot be used directly. For example, if data from different languages is used to
build ASR systems for a new language, differences in phone sets used to transcribe each
language step in. We propose techniques to deal with these kinds of issues in training
data-driven front-ends.
(c) Neural network architectures for data-driven front-ends: We demonstrate the
use of several neural network architectures to allow task independent data to be used.
Using data transcribed with different phone sets from different languages, these im-
provements allow better neural network models to be built. Our contributions lead to
an absolute WER reduction of about 15% on a low-resource task with only 1 hour of
transcribed training data for acoustic modeling.
(d) Applications of data-driven features: Several new applications of the proposed
data-driven front-ends are presented, apart from using them to generate features for
ASR (Chapters 5). These applications include - speech activity detection in noisy en-
21
CHAPTER 1. INTRODUCTION
vironments, speaker verification using neural networks, term discovery in zero-resource
settings and event detectors for speech recognition.
1.7 Thesis Organization
This thesis is organized as follows:
1. Chapter 2 is an overview of different feature extraction techniques for ASR and in-
troduces a set of new features using Frequency Domain Prediction. The usefulness of
these features is demonstrated using in a series of ASR experiments.
2. In Chapter 3, we discuss a key weakness of data-driven ASR acoustic modeling tech-
niques - performances of these systems drop significantly with only few hours of tran-
scribed training data. We show how this can be compensated using the proposed
data-driven front-ends which also are affected in these scenarios. The proposed ap-
proaches are based on the use of multilingual task independent data.
3. Chapter 4 extends the training approaches introduced in Chapter 3 by employing
wider and deeper neural network architectures in low resource settings. Typically
these kinds of networks cannot be trained well with only few hours of transcribed
training data. We however show how task independent data can be used in these
settings as well.
4. Chapter 5 discusses four different applications of data-driven front-ends outputs - as
features for speech activity detection, probabilities of broad phonetic classes to model
parts of each speakers acoustic space in a neural network based speaker verification
22
CHAPTER 1. INTRODUCTION
system, feature representations for zero-resource applications and event detectors for
speech recognition.
5. Chapter 6 summarizes the thesis.
23
Chapter 2
From Acoustic Features to
Data-driven Features
This chapter introduces a novel acoustic feature extraction technique for ASR. Data-driven
front-ends are developed using these features and evaluated on different ASR tasks. Significant
improvements are demonstrated by using the proposed features with neural network based front-ends.
2.1 Time-Frequency Representations of Speech
Conventional feature extraction techniques start with the short-term spectrum of
speech - a representation derived by applying the Fourier transform on a short segments
of speech. Typically the short-term analysis is performed using a 25 ms Hamming window
every 10 ms. Although speech is a non-stationary signal, over sufficiently short time inter-
vals, the signal can be considered stationary. In each of these analysis windows, the power
spectrum - squared magnitude of the short-term Fourier spectrum is then computed, before
24
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
being processed along two dimensions - across frequency and across time. The processing
across frequency attempts to model the gross shape of the spectrum. Temporal dynamics
of the spectrum are, on the other hand, captured by the processing across time.
2.1.1 Processing across the frequency axis
Processing across the frequency axis has two primary objectives. Through a se-
quence of steps the resolution of the spectrum is first modified to be non-uniform instead
of the inherent uniform Fourier transform resolution. The non-uniform resolution has been
shown to be useful for discriminating between basic speech sounds [22]. The spectrum is also
smoothened to capture only its gross shape and remove any rapidly varying fluctuations.
In Perceptual Linear Prediction (PLP) [10], the first objective is achieved through
a set of operations motivated by human auditory perception which convert the power spec-
trum to an auditory-like spectrum. These steps include -
• Using a filter-bank of trapezoidal filters, to warp the power spectrum to a Bark fre-
quency scale. Outputs of these integrators are consistent with the notion of integration
of signal energy in critical bands in the human ear [45].
• Emphasizing each sub-band frequency signal using a scaling function base on the
equal-loudness curve of hearing. This operation has an equivalent effect of pre-
emphasis in the time domain. Pre-emphasis is performed to remove the overall spectral
slope of the spectrum and DC component of the speech signal.
• Compressing the sub-band signals using the cubic root function. This step is moti-
vated by the power law of hearing that relates intensity and perceived loudness.
25
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The gross shape of the auditory spectrum of speech is finally approximated using an auto-
regressive model. The prediction coefficients of this model are obtained via a recursive auto-
correlation based method on the inverse Fourier transform of the auditory spectrum [10]. As
described in the previous chapter, the features for ASR are cepstral coefficients obtained by
projecting the smoothened auditory-like representation onto a set of discrete cosine trans-
form (DCT) basis functions. Based on the source-filter interpretation of LPC, smoothing
the spectrum using LPC allows the features to capture vocal tract filter properties which
are useful in characterizing speech sounds. Apart from its decorrelation and dimensionality
reduction properties, truncating the DCT coefficients also removes higher order coefficients
that capture speaker specifics in the spectrum. This further smooths the spectrum.
2.1.2 Processing across the time axis
Just as the gross shape of the spectrum is useful in characterizing speech sounds,
temporal dynamics of the spectrum are also key to classification. Several important obser-
vations [46] useful while capturing these dynamics include facts that -
• speech is produced at a typical rate by vocal tract movements. The rate of change of
non-linguistic components is usually outside this range and
• human perception is more sensitive to relative changes than absolute quantities.
Traditionally temporal dynamics have been captured through first and second or-
der time derivatives of cepstral coefficients [11]. These operations can also be interpreted as
a filtering operation that enhance components around 10 Hz of the modulation spectrum of
speech while suppressing other higher and lower components. In the RASTA processing of
26
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
speech, the above mentioned observations are explicitly integrated into the PLP pipeline by
filtering the temporal trajectories of the spectrum to suppress constant factors while pre-
serving components of the modulation spectrum between 1 and 12 Hz [46]. In an extension
to the RASTA technique, a bank of bandpass filters with varying resolutions has also been
developed in [47] to process the modulation spectrum.
2.1.3 Integrating speaker and channel invariance
As illustrated in Figure 1.5, the speech signal is often modified by channel and
speaker characteristics before it is processed by the feature extraction module. It is hence
necessary to compensate for these artifacts as well.
Differences in the vocal tract anatomies, lead to significant variability in the spec-
trum between speakers and genders. Other extrinsic characteristics that produce speaker
variabilities are the socio-linguistic background and emotional state of the speakers. While
some of these artifacts are compensated for by techniques like PLP, techniques such as vocal
tract length normalization (VTLN) [48] are often used by state-of-the-art feature extraction
techniques.
Effects from the channel or environment are usually modeled as additive or convo-
lutive distortions. When speech signal is corrupted by additive noise, the recorded speech
signal is expressed as
ns[m] = cs[m] + n[m], (2.1)
where ns[m], cs[m], n[m] are discrete representations of the noisy speech, clean speech and
the corrupting noise respectively. If the speech and noise are assumed to be uncorrelated,
27
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
in the power spectral domain we can write
Pns(m,ωk) = Pcs(m,ωk) + Pn(m,ωk), (2.2)
where Pns(m,ωk), Pcs(m,ωk), Pn(m,ωk) are the short-term power spectral densities at fre-
quency ωk of the noisy speech, clean speech and noise respectively. Conventional feature
extraction techniques for ASR estimate the short term (1030 ms) power spectral density
(PSD) of speech in bark or mel scale. Hence, most of the recently proposed noise robust
feature extraction techniques apply some kind of spectral subtraction in which an estimate
of the noise PSD is subtracted from the noisy speech PSD. The estimate of noise PSD
is usually computed using a speech activity detector from regions likely to contain only
noise (for example the ETSI front-end [49]). A survey of many other common techniques
is available in [50].
The second class of distortions are convolutive distortions introduced by room
reverberations when speech is recorded using a distant microphone or by telephone com-
munication channels. If the channel effect or room reverberation can be characterized as a
channel impulse response or room impulse response, the noisy speech can be written as
ns[t] = cs[t] ∗ r[t], (2.3)
where ns[m], cs[m], r[m] is the noisy speech, clean speech and the corrupting room or chan-
nel impulse response respectively. These kinds of convolutive distortions are multiplicative
in the spectral domain and additive in the log-spectral domain. However these assumptions
hold true only in appropriate analysis windows.
In cepstral mean subtraction (CMS), the channel impulse response is assumed to
be shorter than the short-term Fourier transform (STFT) analysis window. If the artifact
28
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
is assumed to be constant in the short analysis window, its effect appears as an offset term
in the final cepstral representation. If the channel remains the same for each recorded ut-
terance, the artifact can then be removed by subtracting the mean of the cepstral sequences
corresponding to the utterance. State-of-the-art systems in addition to cepstral mean nor-
malization (CMN) also perform a variance normalization to improve performances to these
distortions [51]. Usually this is done on a per speaker basis as well to achieve additional
speaker normalization. In the log-DFT mean normalization technique [52], mean subtrac-
tion is done on a linear frequency scale instead of a warped scale as in CMS since the
assumption that response functions might have a constant value in each critical band is not
always valid.
In reverberant environments, to make a similar mean subtraction effective, it is
necessary to estimate the log-spectrum in much longer analysis windows. This is because
room impulse responses characterized by their T60 reverberation times usually have ranges
between 200-800 ms. T60 denotes the amount of time required for the reverberant signal to
reduce by 60 dB from the initial direct component value. Successful techniques like long-
term spectral subtraction (LTLSS) [53] hence use analysis windows as long as 2 seconds to
deal with these artifacts depending on the nature of the reverberation.
2.2 Towards Improved Features for ASR
After the time-frequency representation of speech has been processed as described
above, acoustic features for speech recognition are derived from the representation. The
most common feature representations are cepstral vectors of the processed power spectral
29
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
envelope derived using 20-30 ms analysis windows every 10 ms. These features are then
typically augmented with time derivatives (first, second and third derivatives). In some sys-
tems instead of using the time derivatives, 9-21 successive frames are concatenated together
and used after a projection to a lower dimension using various transforms [54–56].
Aggregating information from only such a limited temporal context could however
be a reason of lower ASR performances compared to human recognition performances [57].
This argument is further strengthened by information theoretic results showing that features
from longer time intervals (up to several hundred milliseconds) are useful better discrimina-
tion between speech sounds [58]. This limitation has been addressed using several different
feature extraction/signal processing techniques discussed below.
2.2.1 Long-term acoustic features
Through a number of studies it has been shown that speech perception is sensitive
to relatively slow modulations of the temporal envelope of speech. [59, 60]. Most of the
energy in the modulation spectrum peaks around 4 Hz which is also corresponds to syllabic
rate of speech. In the presence of noise although these components are affected [61, 62],
modifying modulation components in the 1-16 Hz range results in significant degradation
of speech intelligibility [59,60].
Information from the modulation spectrum can be derived from a spectral analysis
of temporal trajectories of spectral envelopes of speech [63]. However, in order to achieve
sufficient spectral resolution at the low modulation frequencies frequencies described above,
relatively long segments of speech signal need to be analyzed. For example to capture
modulation spectrum components around 4 Hz, an analysis window of at least 250 ms is
30
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
necessary. Analysis windows of this length are also consistent with the time intervals of co-
articulation - a speech production phenomena, forward masking - an auditory perception
phenomena and the linguistic concept of syllable [64]. By deriving features for ASR using
these kinds of analysis windows, information about the dynamics of spectral components
are explicitly being captured.
In [65, 66], 1 second long temporal trajectories of individual critical sub-band en-
ergies were used for phoneme recognition experiments. In this multi-stream framework,
separate neural network classifiers were trained on long-term features from each sub-band
before being combined together by a second level neural network. Since features from
each sub-band were used independently, the comparable performance of this feature ex-
traction technique with conventional short-term spectral features, demonstrates that there
is significant information in the local temporal dynamics being captured. These temporal
pattern features (TRAPS) have been extended in different configurations (for example [67])
as modulation features after applying a cosine transform [68] or filtering using modulation
filters [47].
2.2.2 Parametric models of temporal envelopes
The modulation features discussed above are extracted from sub-band energies
of speech using long analysis windows. The sub-band energies are not directly modeled
but are instead produced with an inherent limited resolution as outputs of Bark/Mel scale
integrators on the power spectrum in short-analysis windows every 10 ms (see Section
2.1.1). For more effective features that capture the evolution of the temporal envelopes, it
is necessary to directly model the temporal envelopes.
31
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
As described in Section 2.1.1, conventional feature extraction techniques use LPC
in time to effectively capture spectral resonances. Based on duality properties, LPC can
similarly be performed in the frequency domain to directly model and capture important
temporal events. This framework is based on the notion that speech can be considered to
be composed of several amplitude modulated signals at different carrier frequencies. The
AM component of each of these signals is the squared magnitude of their corresponding
analytical signals. The squared magnitude of the analytical signal is also called the Hilbert
envelope and is a description of temporal energy. Instead of computing the analytic signal
directly, an auto-regressive modeling approach can be used. This modeling approach also
called Frequency Domain Linear Prediction (FDLP) is the dual of conventional time domain
linear prediction used to model the power spectrum of speech [69,70]. Instead of modeling
the power spectrum, FDLP models the evaluation of signal energy in the time domain
by the application of linear prediction in the frequency domain using the discrete cosine
transform of the signal. This parametric model can be used as an alternate technique to
directly model sub-band envelopes of speech [71,72].
2.2.3 Neural network features
The modulation features described in the earlier sections are typically high dimen-
sional correlated features. Both these limitations prevent them from being used directly
with ASR systems. These features have hence been used in conjunction with neural net-
works which have much more relaxed assumptions on feature distributions. As described
in the previous chapter neural networks can be trained to estimate posterior probabilities
of speech classes. These probabilities can then be used directly as scaled likelihoods in the
32
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
hybrid HMM-ANN ASR framework.
Another approach to using neural network posterior outputs, is to convert the
posteriors to features similar to traditional acoustic features for ASR systems. In the Tan-
dem processing approach [28], posterior features from neural networks are post-processed
to be decorrelated and approximately have a normal distribution. This is done in a two
step procedure - a log transform is first applied to the posteriors to Gaussianize the vectors
followed by a dimensionality operation using the KL transform. Several other approaches
have been proposed to derive features from the outputs of neural networks. In the HATS
technique [73] non-linear outputs from the penultimate layer of a network have been used.
This has been further extended to deriving features from an intermediate bottleneck layer
which reduces the feature dimension as well [30].
2.2.4 Combination of information from multiple streams
A key benefit from the development of long-term features are the significant
LVCSR gains obtained from combining these features with conventional short-term fea-
tures [73]. The best combination of features is obtained by first training neural networks
using both the long-term modulation features and short-term spectral energy based features
separately. The outputs of the neural networks are then combined using a merger neural
network or using different combination rules before being used as data-driven features for
LVCSR tasks [74]. As discussed in [75] this approach is useful because of several reasons -
• The MLP features derived from neural networks trained on conventional short-term
spectral features and long-term modulation features capture complimentary informa-
33
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
tion about phone classes.
• Although the MLPs are trained on different inputs, since they have the same target
classes the complimentary outputs can be effectively combined.
• During the training phase the neural networks are able to discriminatively learn
class boundaries and produce data-driven features that are useful for classification
of sounds. These features are also relatively speaker invariant.
• After the application of post-processing techniques like Tandem, the data-driven neu-
ral network features can easily be modeled by HMM-GMM based LVCSR systems.
2.3 Novel Short-term and Long-term Features for Speech
Recognition
We propose a novel feature extraction scheme along the lines of the techniques
described above, to derive two kinds of features - short-term spectral features and long-
term modulation features for ASR. The technique starts by creating a two-dimensional
auditory spectrogram representation of the input signal. This is formed by stacking sub-
band temporal envelopes in frequency instead of stacking short-term spectral estimates in
time.
The sub-band temporal envelopes are obtained by analyzing speech using Fre-
quency Domain Linear Prediction (FDLP). The FDLP technique, as described earlier, fits
an all pole model to the Hilbert envelope of the signal (See Figure 2.1). These representa-
tions of the speech signal are able to capture fine temporal events associated with transient
34
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
0 0.6 1.3-5000
0
5000(a)
0 0.65 1.30
5000(b)
0 0.65 1.30
5000(c)
Time (s)
Figure 2.1: Illustration of the all-pole modeling property of FDLP. (a) a portion of thespeech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP.
events like stop bursts while at the same time summarize the signals gross temporal evo-
lution [76]. Short-term features are derived by integrating the auditory spectrogram in
short analysis windows. Long-term modulation frequency components are obtained after
the application of the cosine transform on compressed (static and adaptive compression)
sub-band temporal envelopes.
35
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
2.3.1 FDLP based time-frequency representation
The FDLP time-frequency representation is created through the following steps [72] -
(a) Change of processing domain - The FDLP spectrogram is a 2 dimensional time-frequency
representation of speech constructed by stacking sub-band temporal envelopes of a
speech signal across frequencies. Each of these temporal envelopes corresponds to a
sub-band frequency signal. To facilitate this, the speech signal is first projected into
the frequency domain via the DCT transform.
(b) Analysis of speech into sub-band frequency signals - Sub-band frequency signal are
obtained by windowing the DCT transform using a set of overlapping Gaussian windows
usually placed on a Bark or Mel scale.
(c) Computation of auto-correlation coefficients via a series of dual operations of time do-
main linear prediction (TDLP) - Among the many approaches, one way of applying
TDLP is using the auto-correlation of the time signal. The auto-correlation coeffi-
cients are in turn derived from the power spectrum since the power spectrum and
auto-correlation of the time signal form Fourier transform pairs. In the FDLP case, the
Hilbert envelope and the auto-correlation of the DCT signal form Fourier transform
pairs.
Since the sub-band DCT signals have already been derived in the previous step, their
the auto-correlation coefficients can be computed. We start by computing the squared
magnitude of the inverse discrete Fourier (IDFT) transform of the DCT signal. The
application of a second Fourier transform produces the desired auto-correlation coeffi-
36
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
cients.
(d) Application of linear prediction - By solving a system of linear equations, the auto-
regressive model of each sub-band Hilbert envelope is finally derived from the auto-
correlation coefficients. Using the set of prediction coefficients {ai} the estimated
Hilbert envelope in each sub-band HEs can be represented as
HEs(n) =G
|∑i=pi=0 aie−i2πkn|2 (2.4)
The parameter G is called the gain of the model. In [77], by normalizing the gain G, the
estimated sub-band envelopes have been shown to become robust to convolutive distor-
tions like reverberations and telephone channel artifacts. Additional robustness to additive
distortions by short-term subtraction of an estimate of noise have also been shown in [78].
There are several parameters that control the temporal resolution of the estimated envelopes
as well as the type and extent of analysis windows for different applications. These have
been elaborated in [72].
Figure 2.2 shows the PLP and FDLP spectrograms for a portion of speech. Criti-
cally spaced sub-bands energies of speech are derived in short-analysis windows in the PLP
case. The representation is hence smooth across frequencies in each analysis windows. In-
dividual sub-bands of speech are directly modeled in FDLP technique, resulting in a better
temporal resolution - for example the transient regions are well captured in this representa-
tion. Two kinds of features are derived from two-dimensional time-frequency representation
of speech formed by sub-band temporal envelopes derived using FDLP.
37
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
0 0.24 0.45-5000
0
5000(a)
(b)
Bar
k ba
nd
0 0.24 0.45
20
10
0
(c)
Bar
k ba
nd
Time (s)0 0.24 0.45
20
10
0
Figure 2.2: PLP (b) and FDLP (c) spectrograms for a portion of speech (a).
2.3.2 Short-term Features
In conventional feature extraction techniques like PLP, the power spectrum is
first integrated using Mel/Bark integrators in short analysis windows to create sub-band
trajectories of spectral energy. In the FDLP time-frequency representation, instead of the
sub-band trajectories of spectral energy, identical distributions of energy in the time domain
(sub-band Hilbert envelopes), are estimated. Short-term cepstral features can be derived
from these representations.
38
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
This is done by first integrating the envelopes in short term analysis Hamming
windows (of the order of 25 ms with a shift of 10 ms). The integrated sub-band energies
are then converted to cepstral coefficients by applying the log transform and taking the
DCT transform across the spectral bands in each of the frames. For most applications we
use 13 cepstral coefficients. First and second derivatives of these cepstral coefficients are
also appended to form a 39 dimensional feature vector [79,80], similar to conventional PLP
features. In [72, 81], a set of FDLP modeling parameters that improve the performance
of these short-term features for ASR in noisy environments has been identified. These
parameters and their effects are summarized in Table 2.1. In all these experiments, both
clean and noisy reverberant test data is evaluated on models trained with clean speech.
FDLP Model Parameter Observations
Gain Normalization Gain normalization significantly improves
feature robustness in reverberant environments [77,82].
Using rectangular analysis windows on a Mel
scale for sub-band decomposition also contributes
to robustness by reducing the mismatch between
the clean and noisy reverberant data.
Number of Sub-bands Increasing the spectral resolution improves
robustness in reverberant conditions. The assumptions
made for gain normalization are more valid with
increased number of sub-bands. In reverberant
conditions using up to 96 linear bands has shown to
39
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
FDLP Model Parameter Observations
be useful [77].
Model Order Model order relates to the model’s ability
to capture sufficient detail of the envelopes.
In clean conditions, a higher model order
is useful. A lower model order is however better
in reverberant conditions [72,81].
Envelope Expansion Envelope expansion relates to how the all-pole model
models peaks and valleys of the Hilbert
envelope. While envelope expansion is useful
in noisy environments to capture dominant reliable
peaks, no significant gains are observed
in clean conditions [72,81].
Table 2.1: FDLP model parameters that improve robustness
of short-term spectral features.
2.3.3 Long-term Features
In techniques like TRAPS and MRASTA described earlier, modulation frequency
features are derived by analyzing temporal trajectories of spectral energy estimates in indi-
vidual sub-bands using long analysis windows. As described earlier, since FDLP estimates
the temporal envelope in sub-bands, modulation features can be derived from these en-
40
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
velopes as well [79].
Before we derive the long-term features, we compress the sub-band temporal en-
velopes both statically and dynamically. The envelopes are compressed statically using the
logarithmic function. Dynamic compression of the envelopes is achieved using an adapta-
tion circuit which consists of five consecutive nonlinear adaptation loops proposed in [83].
These loops are designed so that sudden transitions in the sub-band envelope that are fast
compared to the time constants of the adaptation loops are amplified linearly at the out-
put, while the steady state regions of the input signal are compressed logarithmically. The
compressed temporal envelopes are then transformed using the Discrete Cosine Transform
(DCT) in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation
frequency components from each cosine transform, yielding modulation spectrum in the 0-
35 Hz range with a resolution of 2.5 Hz [84]. The static and dynamic modulation frequency
components of each critical band are then stacked together before being used as features.
In [85], the proposed modulation features have been compared with other similar
modulation feature techniques - Modulation Spectrogram (MSG) [86], MRASTA [47] and
Fepstum [87]. In these experiments FDLP based modulations are significantly better than
features derived from the other approaches. An additional set of FDLP modeling parameters
that improve the performance of these long-term features for ASR have also been identified
based on a set of phoneme recognition experiments. These parameters and their effects are
summarized in Table 2.2.
41
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
FDLP Model Parameter Observations
Modulation analysis window The analysis window used to derive the
the modulation coefficients can be varied.
The best recognition performance was obtained
using a window of 200 ms, which also corresponds
to the syllabic rate of speech.
Extent of modulations The number of DCT coefficients can be varied to
change the extent of the modulation spectrum.
The best range was found to be using 14 DCT
coefficients covering the 0-35 Hz range.
Type of modulation spectrum As described earlier, two kinds of compression
schemes are used for the modulation features.
While the static log modulation features improve the
the phoneme recognition performances on fricatives
and nasals, the dynamic adaptive loops based features
help in better recognition of plosives and affricatives [85].
A combination of both these features provides significant
improvements to all classes [88].
Table 2.2: FDLP model parameters that improve perfor-
mance of long-term modulation features.
42
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
�����������
�����������
������������������������
������������������������
��������������������������������
features
Speech FDLPAdaptive
compression
Staticcompression
statically compressed sub−bands envelopes
adaptively compressed sub−bands envelopes
freq
uenc
y
sub−bands envelopestime
auditory
featuresmodulation
Posteriorprobabilityestimator
probabilityestimator
Posterior
probabilitymerger
Posterior TandemProcessing for ASR
Features
Figure 2.3: Schematic of the joint spectral envelope, modulation features for posterior basedASR
2.3.4 Data-driven Features
Each of these acoustic features are converted into data-driven features by using
them to first train two separate 3-layer multilayer perceptrons to estimate posterior prob-
abilities of phoneme classes. Each frame of the short-term spectral envelope features is
used with a context of 9 frames during training. As described earlier, static and dynamic
modulation frequency features of each critical band are stacked together and used to train
a separate MLP network. The spectral envelope and modulation frequency features are
then combined at the phoneme posterior level using the Dempster Shafer (DS) theory of
evidence [74]. These phoneme posteriors are first Gaussianized by using the log function
and then decorrelated using the Karhunen-Loeve Transform (KLT) [28]. This reduces the
dimensionality of the feature vectors by retaining only the feature components which con-
tribute most to the variance of the data. We use 25 dimensional features in our Tandem
43
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
representations similar to [75]. Figure 2.3 shows the schematic of the proposed feature
extraction technique.
2.4 Speech Recognition Experiments and Results
We perform a set of experiments using Tandem representations of the proposed
spectral envelope and modulation frequency features along with other state-of-the-art fea-
tures for ASR. These include a phoneme recognition task, a small vocabulary continuous
digit recognition task and a large vocabulary continuous speech recognition (LVCSR) task.
For each of these experiments, we train three layered MLPs to estimate phoneme posterior
probabilities using these features. The proposed features are compared with three other
feature extraction techniques - PLP features [10] with a 9 frame context which are similar
to spectral envelope features derived using FDLP (FDLP-S), M-RASTA features [47] and
Modulation Spectro-Gram (MSG) features [86] with a 9 frame context, which are both
similar to modulation frequency features (FDLP-M). We combine FDLP-S features with
FDLP-M features using the DS theory of evidence to obtain a joint spectro-temporal fea-
ture set (FDLP-S+FDLP-M). Similarly, we derive two more feature sets by combining PLP
features with M-RASTA features (PLP+M-RASTA) and MSG features (PLP+MSG). 25
dimensional Tandem representations of these features are used for our experiments. We also
experiment with 39 dimensional PLP features without any Tandem processing (PLP-D).
44
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
2.4.1 Phoneme Recognition
Our first experiment is to validate the usefulness of Tandem representation of our
features for a phoneme recognition task using HMMs. We perform experiments on the
TIMIT database, excluding ‘sa’ dialect sentences. All speech files are sampled at 16 kHz.
The training data consists of 3000 utterances from 375 speakers, cross validation data set
consists of 696 utterances from 87 speakers and the test data set consists of 1344 utterances
from 168 speakers. The TIMIT database, which is hand-labeled using 61 labels is mapped to
the standard set of 39 phonemes [89]. A three layered MLP is used to estimate the phoneme
posterior probabilities. The network consisting of 1000 hidden neurons, and 39 output
neurons (with soft max nonlinearity) representing the phoneme classes is trained using the
standard back propagation algorithm with cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
on the cross validation data.
The Tandem representation of each feature set is used along with a decision tree
clustered triphone HMM with 3 states per triphone, trained using standard HTK maximum
likelihood training procedures. The emission probability density in each HMM state is mod-
eled with 11 diagonal covariance Gaussians. We use a simple word-loop grammar model
using the same standard set of 39 phonemes. Table 2.3 shows the results for phoneme recog-
nition accuracies across all individual phoneme classes for these techniques. The proposed
features (FDLP-S+FDLP-M) significantly improve the recognition accuracy compared to
the baseline PLP-D feature set.
45
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.3: Phoneme Recognition Accuracies (%) for different feature extraction techniqueson the TIMIT database
Features Phoneme Rec. Acc. (%)
PLP-D 68.3
PLP 70.1
FDLP-S 70.1
M-RASTA 66.8
MSG 65.1
FDLP-M 70.6
PLP+M-RASTA 71.2
PLP+MSG 71.4
FDLP-S+FDLP-M 72.5
2.4.2 Small Vocabulary Digit Recognition
In our second experiment, we use these features on a small vocabulary continuous
digit recognition (OGI Digits database) to recognize eleven (0-9 and zero) digits with 28
pronunciation variants [47]. MLPs are trained using these features to estimate posterior
probabilities of 29 English phonemes using the whole Stories database plus the training
part of Numbers95 database with approximately 10% of data for cross-validation. Tandem
representation of the features are used along with a phoneme-based HMM system with
22 context-independent three-state phoneme HMMs, each model distribution represented
by 32 Gaussian mixture components [47]. Table 2.4 shows the results for word recognition
accuracies. For this task, the proposed spectral envelope features (FDLP-S) and modulation
46
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.4: Word Recognition Accuracies (%) on the OGI Digits database for different featureextraction techniques
Features Word Recog. Acc. (%)
PLP-D 95.9
PLP 96.2
FDLP-S 96.6
M-RASTA 96.3
MSG 96.0
FDLP-M 96.8
PLP+M-RASTA 97.1
PLP+MSG 97.0
FDLP-S+FDLP-M 97.1
frequency features (FDLP-M) improve word recognition accuracies compared to PLP and
MRASTA features respectively.
2.4.3 Large Vocabulary Continuous Speech Recognition
In our third experiment, we use these features on an LVCSR task using the AMI
LVCSR system for meeting transcription [90]. The training data for this system uses indi-
vidual headset microphone (IHM) data from four meeting corpora; NIST (13 hours), ISL
(10 hours), ICSI (73 hours) and a preliminary part of the AMI corpus (16 hours). MLPs
are trained on the whole training set in order to obtain estimates of phoneme posteriors for
each of the feature sets. Acoustic models are phonetically state tied triphone models trained
using standard HTK maximum likelihood training procedures. The recognition experiments
47
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.5: Word Recognition Accuracies (%) on RT05 Meeting data, for different featureextraction techniques. TOT - total word recognition accuracy (%) for all test sets, AMI,CMU, ICSI, NIST, VT - word recognition accuracies (%) on individual test sets
Features TOT AMI CMU ICSI NIST VT
PLP-D 58.1 57.6 60.6 68.7 49.1 53.6
PLP 53.6 59.1 56.3 70.0 45.3 34.9
FDLP-S 57.5 58.4 58.5 66.9 48.4 54.5
M-RASTA 54.6 53.3 58.4 63.2 46.6 51.0
MSG 55.6 56.1 59.3 65.5 47.9 47.7
FDLP-M 60.5 62.3 66.3 60.6 54.6 58.3
PLP+M-RASTA 59.5 59.5 62.2 71.5 51.1 52.1
PLP+MSG 60.4 61.2 60.7 72.7 53.4 52.4
FDLP-S+FDLP-M 64.1 63.8 65.8 72.2 57.1 61.0
are conducted on the NIST RT05 [91] evaluation data. The AMI-Juicer large vocabulary
decoder is used for recognition with a pruned trigram language model [92]. This is used
along with reference speech segments provided by NIST for decoding and the pronuncia-
tion dictionary used in AMI NIST RT05s system. Table 2.5 shows the results for word
recognition accuracies for these techniques on the RT05 meeting corpus. The proposed fea-
tures (FDLP-S+FDLP-M) obtain a significant relative improvements for the LVCSR task
compared to other feature representations.
48
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.6: Recognition Accuracies (%) of broad phonetic classes obtained from confusionmatrix analysis
Class PLP FDLP-S M-RASTA FDLP-M PLP + FDLP-S +
M-RASTA FDLP-M
Vowel 85.3 84.9 82.4 85.7 86.1 87.3
Diphthong 78.2 79.1 74.2 76.8 78.4 79.8
Plosive 83.8 82.8 81.6 84.1 84.6 85.4
Affricative 73.5 74.4 68.6 75.6 72.9 78.0
Fricative 85.8 85.9 83.5 86.8 86.4 88.0
Semi Vowel 76.2 74.9 72.9 77.1 77.8 79.0
Nasal 84.2 82.8 80.4 84.9 85.8 86.6
Avg. 81.0 80.7 77.7 81.6 81.7 83.4
2.5 Conclusions
In this chapter, we proposed a framework for deriving data-driven features for
ASR. The framework uses 4 key elements -
• A linear prediction technique that models sub-band temporal envelopes of speech -
We outlined the steps involved in building these auto-regressive models. We also
showed that this technique based on FDLP can capture important details in speech
that conventional techniques do not capture.
• Two kinds of acoustic features - a short-term spectral feature and a long-term modula-
tion feature. Table 2.6 shows the results for phoneme recognition accuracies across all
individual phoneme classes for the proposed techniques using the TIMIT database.
49
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The FDLP-S features provide comparable results as the PLP features. The mod-
ulation features (FDLP-M) result in broad class recognition rate for all the broad
phonetic classes to other modulation features.
• A combination of the feature streams at the phoneme posterior level - From Table
2.6, the joint spectral envelope and modulation features yield improved broad class
recognition in all cases compared to the baseline systems.
• Data-driven processing of these features with neural networks followed by Tandem
post-processing allows these features to be used for ASR systems. In all our experi-
ments, Tandem representations of the proposed features improve ASR accuracies over
other features.
In the following chapters we will use this data-driven framework in many other
scenarios. The key scenario is a low-resource setting where the amount of training data
is limited, unlike the ASR settings assumed in this chapter where the amount of training
data is not restricted. We devise techniques to improve the effectiveness of the proposed
front-ends in those settings.
50
Chapter 3
Data-driven Features for
Low-resource Scenarios
This chapter presents two novel techniques for building data-driven front-ends in
low-resource settings with very limited amounts of transcribed data for acoustic model train-
ing. Both the techniques improve performance in the low-resource settings using data from
multiple languages circumventing issues with different phone sets used in each language.
3.1 Overview
In LVCSR systems, an important factor that impacts performance is the amount
of available transcribed training data. When LVCSR systems are built for new languages
or domains with only few hours of transcribed data, the performance is lower. To improve
performance, unlabeled data from this new language or domain has been used to increase
51
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
the size of the training set [93]. This is done by first recognizing the unlabeled data and
incrementally adding reliable portions to the original training set. For these self-training
techniques to be effective, a low error rate recognizer is required to annotate the unlabeled
data. However in several scenarios like ASR systems for new languages, recognizers built
using limited amounts of training data have very high error rates. Additional improvements
are hence not easily achieved via these techniques.
Another potential solution to this problem is to use transcribed data available from
other languages to build acoustic models which can be shared with the low-resource language
[94,95]. However training such systems requires all the multilingual data to be transcribed
using a common phone set across the different languages. This common phone set can
be derived either in a data driven fashion or using phonetic sets such as the International
Phonetic Alphabet (IPA) [96]. More recently cross-lingual training with Subspace Gaussian
Mixture Models (SGMM) [97,98] have also been proposed for this task.
An alternative approach to this problem moves the focus from using the shared
data to build acoustic models, to training data-driven front-ends. The key element in
this data-driven approach, is a multi-layer perceptron (MLP) which is trained on large
amounts of task independent data. In [99, 100], a task independent approach has been
used to first train MLPs with large amounts of data. Features derived from these nets
are then shown to reduce the requirement of task specific data to train subsequent HMM
stages. In these experiments, although the task specific data comes from the same language
as the task independent data, the data sources are collected in different domains. More
recently this approach has been shown useful also in cross-domain and cross-lingual LVCSR
52
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
tasks [75,101]. In [101], Tandem features trained on English CTS data are shown to improve
performance when used in other domains (meeting data) within the language and even
in other languages (Mandarin and Arabic). Even though MLPs are trained on different
phone sets in different languages, Tandem features are able to capture common phonetic
distinctions among languages and improve performance of conventional acoustic features.
In this chapter, we investigate two approaches to building neural network based
data-driven front-ends in low-resource settings. We assume the availability of only 1 hour
of transcribed task specific data to train the acoustic models. To improve over the poor
performance of acoustic models using conventional features in these settings, we use data-
driven feature front-ends that integrate the following additional sources of information -
(a) Multilingual task independent data - Transcribed data from other languages other than
the target language are first used to train initial neural networks models. These task-
independent models are then adapted using limited amounts of task-specific data.
(b) Multiple feature representations - Significant gains were demonstrated in the previous
chapter using different feature representations. We show how these features can be
effective also in low-resource settings.
One of the key problems in training neural network systems using data from mul-
tiple domains are differences in how data sources are transcribed. Although there are
phoneme sets like the IPA which can be used to uniformly label data across languages, only
few data sources are labeled using such sets. This chapter proposes techniques that can be
used to train neural networks in such scenarios.
In low-resource settings, the performance of other modules of the ASR pipeline -
53
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
for example the language model or pronunciation dictionary are also affected. We however
focus our attention only on the feature extraction module and acoustic models.
3.2 Training Using a Combined Phone set
In this section we describe a training approach using two data sets - H and L. H
is a task independent data set with significantly more amounts of training data than the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
and L. We train a neural network system using the following steps -
(a) Train an initial network using data set H - We start by training a multilayer perceptron
(MLP) on the high resource task independent data set. After it has been trained, this
network estimates posterior probabilities of speech sounds in H, conditioned on the
input feature vectors.
(b) Find a mapping between phoneme sets H and L - If the two phoneme sets share the
same phonetic transcription scheme for example the IPA, it is relatively easy to find
such a mapping. However, this is often not the case.
In the proposed training scheme we investigate the use of a data-driven technique based
on an analysis of confusion matrices to find such a mapping. Confusion matrices have
been used in the past to measure the reliability of human speech recognition [102]. More
recently they have also been used to study the performance of ASR systems [103,104].
We start by forward passing the low-resource task specific data L through the MLP
trained on task independent data in step (a) to obtain phoneme posteriors. To un-
54
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
derstand the relationship between phonemes, we treat the phoneme recognition system
as a discrete, memory-less, noisy communication channel with the phonemes in L as
source symbols to the system. Using the recognized phonemes belonging to H at the
output of the recognizer as received symbols, confusion matrices that characterize the
data sets are then built.
Each time a feature vector corresponding to phoneme li is passed through the trained
MLP, posterior probabilities corresponding to all phonemes in set H are obtained at
the output of the MLP. We treat each of these posterior probabilities as soft-counts
to populate a phoneme confusion matrix. From a fully populated CM c, the following
counts can be derived. Entry (i, j) of the confusion matrix corresponds to the soft count
aggregate c(i, j) of the total number of times task-specific phoneme li was recognized
as task-independent phoneme hj . Marginal count c(i) of each row is the total number
of times phoneme li occurred in the task-specific data. Similarly count c(j) of each
column is the total number of times phoneme hj of the task-independent data set was
recognized. C is the total number of counts in the confusion matrix.
Given such a CM, we would like to find the best map for every phoneme li among the
phones of H based on these counts. A useful information theoretic quantity that can
be used is the empirical point wise mutual information [105]. In [104], the use of this
quantity in conjunction with confusion matrices has been shown. For an input alphabet
A and output alphabet B, using the count based confusion matrix, the empirical point
wise mutual information between two symbols ai from A and bj from B is expressed as
IAB(ai, bj) = logNij .N
Ni.Nj, (3.1)
55
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
where Nij is the number of times the joint event (A = ai, B = bj) occurs and Ni, Nj
are marginal counts∑
j Nij and∑
iNij .
Using our soft count based confusion matrix between two phone sets H and L, we
similarly define the empirical point wise mutual information between phoneme pairs
(li, hj) as
I(li, hj) = logc(i, j).Cc(i).c(j)
, (3.2)
using quantities defined earlier. For a given task specific phoneme li we compare
I(li, hj), ∀hj ∈ H. In this comparison since total count C and the monotonically
increasing log function are common, simplified count based measure -
J(li, hj) =c(i, j)c(i).c(j)
(3.3)
is instead used.
Using this measure, for each label li, the more frequently a particular label hj occurs,
higher the value of J(li, hj). We hence map each phoneme li in the task specific phoneme
set to a phoneme hj in the task independent set which has the highest J(li, hj).
If assumptions that there exists a one-to-one mapping between the phoneme sets and
the cardinality of H is greater than L, can be made, multiple phoneme assignments
can be avoided. This can be done by removing an assigned phoneme from the list of
available phonemes once it has been mapped.
(c) Re-transcribe L using a new mapped phone set H - Using the mapping derived using
confusion matrices from above, the task specific data L can now be re-transcribed into
the phone set used to train the initial network.
56
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
(d) Adapt the network using data set L - The initial task independent neural network can
now be adapted using the task specific data since it has been mapped to the same phone
set. The neural network is adapted by retraining it using the new data after initializing
it with its original weights.
(e) Extract data-driven features - Posterior features are derived for ASR after Tandem
processing the phoneme posterior outputs of these networks.
3.3 Training Using Multiple Output Layers
In this section we propose a second training technique for training neural network
systems across different data sets without having to map all the data using a common
phoneme set. As before we describe the training approach using two data sets - H and
L. H is a task independent data set with significantly more amounts of training data the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
and L, with cardinalities h and l respectively. The network is trained using an acoustic
representation with dimension d in the following steps -
(a) Train the MLP on the task independent set H - We start by training a 4 layer MLP
of size d×m1×m2×h on the high resource language with randomly initialized weights.
While the input and output nodes are linear, the hidden nodes are non-linear. While
the dimension of m1 is high, m2 is low dimensional and is known as the ‘bottleneck’
layer. We are motivated to introduce the bottleneck layer to allow the network to learn
a common low dimensional representation among the languages.
57
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
������������������������������������
������������������������������������
���������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������
������������������
������������������
������������������
������������������
����������������������������������
����������������������������������
Weights initializedfrom a single layer
perceptron
Bottleneck layer
d
Layers commonacross
data−sets
Final outputlayer specific
layer specificIntermediate output
m1
m2
to phoneme set H
h
l
to phoneme set L
Figure 3.1: Schematic of the proposed training technique with multiple output layers
(b) Initialize the network to train on task specific set L - To continue training on the low-
resource data set which has a different phoneme set size, we create a new 4 layer MLP
of size d×m1×m2×l. The first 3 layer weights of this new network are initialized using
weights from the MLP trained on the high resource data set. Instead of using random
weights between the last two layers, we initialize these weights from a separately trained
single layer perceptron. To train the single layer perceptron, non-linear representations
of the low-resource training data are derived by forward passing the data through the
first 3 layers of the MLP. The data is then used to train a single layer network of size
m2×l.
(c) Train the MLP on task specific set L - Once the 4 layer MLP of size d×m1×m2×l
has been initialized, we re-train the MLP on the task specific data. By sharing weights
across data sets the MLP is now able to train better on limited amounts of task specific
data. Figure 3.1 is a schematic of the proposed MLP system.
58
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
(d) Derive data-driven features - The proposed 4 layer MLP are trained to estimate phoneme
posterior probabilities using the standard back propagation algorithm with cross en-
tropy error criteria. We derive two kinds of features for LVCSR tasks -
A. Tandem features - These features are derived from the posteriors estimated by
the MLP at the fourth layer. When networks are trained on multiple feature repre-
sentations, better posterior estimates can be derived by combining the outputs from
different system using posterior probability combination rules. Phoneme posteriors are
then converted to features by Gaussianizing the posteriors using the log function and
decorrelating them by using the Karhunen-Loeve transform (KLT). A dimensionality
reduction is also performed by retaining only the feature components which contribute
most to the variance of the data.
B. Bottleneck features - Unlike Tandem features, bottleneck features are derived as lin-
ear outputs of the neurons from the bottleneck layer. These outputs are used directly
as features for LVCSR features without applying any transforms. When bottleneck
features are derived from multiple feature representations, these features are appended
together and a dimensionality reduction is performed using KLT to retain only relevant
components.
3.4 Speech Recognition Experiments and Results
3.4.1 Data sets
We use the English, German and Spanish parts of the Callhome corpora collected
by LDC for our experiments [106–108]. The conversational nature of speech along with high
59
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
out-of-vocabulary rates, use of foreign words and telephone channel distortions make the
task of speech recognition on this database challenging. The English database consists of
120 spontaneous telephone conversations between native English speakers. 80 conversations
corresponding to about 15 hours of speech, form the complete training data. We use 1 hour
of randomly chosen speech covering all the speakers from the complete train set for our
experiments as an example of data from a low-resource language. The English MLPs and
subsequent HMM-GMM systems use this one hour of data. Two sets of 20 conversations,
roughly containing 1.8 hours of speech each, form the test and development sets. Similar to
the English database, the German and Spanish databases consist of 100 and 120 spontaneous
telephone conversation respectively between native speakers. 15 hours of German and 16
hours of Spanish are used as examples of task independent high resource languages for
training the MLPs. Each of these languages use different phoneme sets - 47 phonemes for
English, 46 for German and 28 for Spanish.
3.4.2 Low-resource LVCSR System
We train a single pass HTK [109] based recognizer with 600 tied states and 4
mixtures per state on the 1 hour of data. We use fewer states and mixtures per state since
the amount of training data is low. The recognizer uses a 62K trigram language model with
an OOV rate of 0.4%, built using the SRILM tools. The language model is interpolated from
individual models created using the English Callhome corpus, the Switchboard corpus [110],
the Gigaword corpus [111] and some web data. The web data is obtained by crawling the
web for sentences containing high frequency bigrams and trigrams occurring in the training
text of the Callhome corpus [97]. The 90K PRONLEX dictionary [112] with 47 phones is
60
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
used as the pronunciation dictionary for the system. The test data is decoded using the
HTK decoder - HDecode, and scored with the NIST scoring scripts [91].
3.4.3 Building Data-driven Front-ends using a Common Phoneme Set
We use the steps described in Section 3.2 to build a data-driven front-end for
low-resource settings.
(a) Build a multilingual task independent MLP - We train cross-lingual MLP systems on
data from two other languages - German and Spanish using a phone set that covers
phonemes from both the languages. We derive spectral envelope and modulation fre-
quency features from 15 hours of German and 16 hours of Spanish data. Even though
these languages have different phonemes from English, they share several common pho-
netic attributes of speech. The cross-lingual MLPs capture these attributes from each
of the different features streams for that language.
(b) Construct the data-driven map for English - One hour of English data is forward passed
using the cross lingual MLP to obtain phoneme posteriors in terms of 52 cross-lingual
phones. The true labels for English data contains 47 English phonemes. Using the
mapping technique described earlier we then determine to which phone in the German-
Spanish set each English phoneme can be mapped. This one-to-one mapping is created
by associating each English phoneme to the phone which gives the highest count based
score in the German-Spanish set.
(c) Build low-resource MLPs using task specific data - We train a set of low resource MLP
systems for each of the feature streams by adapting the cross-lingual system using 1
61
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
envelopefeatures
Spectral
ASRmerger
Posteriorprobability
Low
resourceMLP
Low
MLPresourceModulation
features
Features for
Trained on German and Spanish data
Trained on German and Spanish dataCross−lingualMLP
Cross−lingualMLP
adapted using 1 hour of English data
Cross−lingual MLPadapted using 1 hour of English data
Cross−lingual MLP
Tandemprocessing
Figure 3.2: Deriving cross-lingual and multi-stream posterior features for low resourceLVCSR systems
hour of English data after mapping it to the new phone set. By adapting the nets it is
observed that the systems are able to discriminate better between phonetic classes of
the low resource language. The primary challenge in adapting an MLP system using
additional data from different language is to effectively map the phonetic units of the
new language to the phone set on which the system has already been trained. We
construct the map as described earlier between the existing and new language phone
sets. This adaptation allows the systems to capture information about phonetic classes
from the acoustic signal enhanced along with common phonetic attributes from the other
languages. We adapt the MLP by retraining it using the new data after initializing it
with its original weights.
(d) Extract data-driven features - We use the two FDLP based acoustic streams proposed in
62
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.1: Word Recognition Accuracies (%) using different Tandem features derived usingonly 1 hour of English data
39 dimensional PLP features used
directly to train HMM-GMM system 28.8
Tandem features derived from PLP
features with 9 frame context 28.7
Tandem features derived from FDLP-S
features with 9 frame context 29.3
Tandem features derived from 476
dimensional FDLP-M features 27.2
the earlier chapter for our experiments. We derive short-term features (FDLP-S) from
sub-band temporal envelopes, modeled using FDLP by integrating the envelopes in short
term frames (of the order of 25 ms with a shift of 10 ms). These short term sub-band
energies are converted into 13 cepstral features along with their first and second deriva-
tives. Each frame of these spectral envelope features is used with a context of 9 frames
for training an MLP network. To extract modulation frequency features (FDLP-M), we
first compress the sub-band temporal envelopes statically using the logarithmic func-
tion and dynamically with an adaptation circuit consisting of five consecutive nonlinear
adaptation loops. The compressed temporal envelopes are then transformed using the
DCT in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation
frequency components from each cosine transform, yielding modulation spectrum in the
0-35 Hz range. The static and dynamic modulation frequency features of each sub-band
are stacked together and used to train an MLP network. For telephone channel speech,
63
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.2: Word Recognition Accuracies (%) using Tandem features enhanced using cross-lingual posterior features
Tandem features derived from
Cross-lingual systems FDLP-S FDLP-M
System 1 - Trained on
German data 30.6 27.9
System 2 - Trained on German
and Spanish data 30.9 29.4
System 3 - Trained on German
(System 1) and adapted 32.3 29.9
with 1 hr of English
System 4 - Trained on German and
Spanish (System 2) further 33.1 30.2
adapted with 1 hr of English
we use 17 bark spaced bands for extracting these features.
Posterior features from the two acoustic streams (FDLP-M and FDLP-S) are combined
at the posterior level. This allows us to obtain more accurate and robust estimates of
posteriors. Posterior features corresponding to 1 hour of data are Gaussianized, decor-
related and dimensionality reduced to 30 dimensional Tandem features. These features
are used to train the subsequent HMM-GMM system. Figure 3.2 is the proposed data-
driven front-end
64
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.3: Word Recognition Accuracies (%) using multi-stream cross-lingual posteriorfeatures
Baseline PLP features 28.8
Multi-stream Cross-lingual
Tandem features 36.5
Table 3.1 summarizes the baseline results for our experiments using different fea-
tures with only 1 hour of English data. In our second set of experiments we derive Tandem
features for the 1 hour of English data from the cross-lingual systems. It is clear that systems
built using low amounts of training data perform very poorly. Our subsequent experiments
aim to improve these performances using multi-stream and cross-lingual data. Table 3.2
shows the experiments using Tandem features derived from the spectral envelope and mod-
ulation features using the cross-lingual systems. These experiment show the improvements
as more cross-lingual data is used. Adapting the systems with the limited amount of task
specific language improves the performance of each system further. As described earlier
posterior streams derived from two different feature representations are now combined to
derive better representations.
Table 3.3 shows the results of combining posterior streams from the final cross-
lingual systems (System 4 of Table 3.2) of both streams the using the Dempster Shafer
(DS) theory of evidence [74]. The results show significant improvements after combining
posterior streams over the results from individual streams compared to the baseline PLP
system.
65
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
3.4.4 Data-driven Front-ends with MLPs Adapted using Multiple Output
Layers
We use the similar experimental set described in the previous section to demon-
strate the usefulness of the second technique. The primary advantage of this new technique
is that it does not require the multilingual data to be mapped using a common phone set
across various languages.
Training with 2 languages
In our first set of experiments we train a 4 layer MLP system on two languages -
Spanish and English as outlined in Sec. 3.3. We start by training two separate networks
on the task independent language using 16 hours of Spanish. Both these systems have a
first hidden layer of 1000 nodes, a bottleneck layer of 25 nodes and a final output layer
of 28 nodes corresponding to the size of the Spanish phoneme set. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to train
the first network with architecture - 351×1000×25×28. A second system is trained on 476
dimensional modulation features derived using FDLP. These features correspond to 28 static
and dynamic modulation frequency components extracted from 17 bark spaced bands. This
system has an architecture of 476×1000×25×28. Both the systems are trained using the
standard back propagation algorithm with cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
on the cross validation data.
After the task independent networks have been trained, the task specific networks
66
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.4: Word Recognition Accuracies (%) using two languages - Spanish and English
Baseline PLP features 28.8
Tandem features 34.9
Bottleneck features 35.4
to be trained on 1 hour of English are initialized in two stages as discussed in Sec. 3.3. In
the first stage, all weights except the weights between the bottleneck layer and the output
layer are initialized directly from the Spanish network. The second set of weights are
initialized from a single layer network trained on non-linear representations of the 1 hour
of English data derived by forward passing the English data through the Spanish network
till the bottleneck layer. This network has an architecture of 25×47 corresponding to the
dimensionality of the non-linear representations from the bottleneck layer of the Spanish
network and the size of the English phoneme set. These networks are trained on both PLP
and FDLPM features.
Once the networks has been initialized, PLP and FDLPM features derived from 1
hour of English are used to train the new task specific low-resource networks. The networks
trained on PLP and FDLPM features now have an architecture of 351×1000×25×47 and
476×1000×25×47 respectively. 47 dimensional phoneme posteriors from both the networks
are combined using the Dempster Shafer (DS) theory of evidence before deriving the 25
dimensional Tandem set. The 2 sets of 25 dimensional bottleneck features from each of the
networks are appended together before applying a dimensionality reduction to form a final
25 dimensional bottleneck feature vector. Both the Tandem and bottleneck features are
used to train the subsequent low-resource HMM-GMM system on 1 hour of training data.
67
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
������������������������������������
������������������������������������
���������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������
������������������
������������������ ��
������������
��������������������������
������������
��������������������������
��������������������������
������������������������������������������������������
������������������������������������������������������
���������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������
������������
���������������
������������������
���������������������������������������
������������������
���������������������������������������
���������������������������������������
reductionDimensionality
25D bottleneckfeatures
25D bottleneckfeatures
Posterior probability
processingmerger and Tandem
trained on SpanishIntermediate output layer
25D Tandem features for ASR
25D bottleneck features for ASR
featuresPLP
Layers common acrosslanguages trained on English
Final output layer
on Germanlayer trained
Intermediate output
modulationFDLP
features
Figure 3.3: Tandem and bottleneck features for low-resource LVCSR systems.
Table 3.4 shows the results of using the proposed MLP based features. We train
the 1 hour HMM-GMM system on 39 dimensional PLP features (13 cepstral + Δ + ΔΔ
features) as our baseline system.
Training with 3 languages
We extend our training on 2 languages to train a multilingual MLP system on 3
languages - Spanish, German and English. The training procedure starts as outlined earlier
with 15 hours of Spanish. The networks are then initialized to train with the German data
in two stages - with weights from the Spanish system till the bottleneck layer and with
weights from single layer network trained to the German data. After the net has been
trained on the German data, we do a re-training using the 1 hour of English data. Figure
3.3 is a schematic of the training and feature extraction procedure. Table 3.5 shows the
results of using the proposed MLP based features.
68
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.5: Word Recognition Accuracies (%) using three languages - Spanish, German andEnglish
Tandem features 35.8
Bottleneck features 37.2
The above results show the advantage of the proposed approach to training MLPs
on multilingual data. Unlike in earlier approaches we are able to train on multiple languages
without using a common phoneset among the languages.
3.5 Conclusions
In this chapter we have demonstrated the usefulness of data-driven feature front-
ends over conventional features in low-resource settings. In these settings, data-driven fea-
tures are built using task independent data. However in most cases, this data is transcribed
using different phoneme sets. We have addressed this issue using two methods. Features
extracted using these techniques are used to train LVCSR systems in the low-resource lan-
guage. In our experiments, the proposed features provide a relative improvement of about
30% in an low-resource LVCSR setting with only one hour of training data. In the next
chapter we investigate more complex front-end for these scenarios.
69
Chapter 4
Wide and Deep MLP Architectures
in Low-resource Settings
Significant improvements in ASR performance have been observed when additional
processing layers have been added to neural network front-ends. To train these additional
parameters, large amounts of training data are also required. This chapter explores how
these additional layers can be incorporated in low-resource settings with only few hours of
task specific training data.
4.1 Overview
In the previous chapter, improvements were observed in low-resource settings by
using multiple feature representations of the acoustic signal. To allow these parallel streams
of information to be trained, task independent data from different languages were used in
70
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Feature FeatureStream 2
MLP
Post−processingFeature
Speech
Acoustic Feature Extraction
Data−driven Feaures
Acoustic Feature Extraction
Speech
FeatureStream
MLP MLP
FeaturePost−processing
Data−driven Feaures
MLPInteractions via
intermediateoutputs
(a) (b)
Stream 1
Figure 4.1: (a) Wide and (b) Deep neural network topologies for data-driven features
conjunction with simple neural network topologies. In this chapter, in addition to these
parallel feature streams, we explore if more complex neural network architectures which are
currently being used in state-of-the-art ASR systems can also be trained in low-resource
settings.
In [113], these complex neural network architectures have been broadly classified
into two categories - wide networks and deep networks. In wide networks, several parallel
neural network modules that interact with each other are used. On the other hand, in deep
networks topologies several interacting neural network layers are stacked one after the other
in a serial fashion. Figure 4.1, illustrates these topologies.
71
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Several wide network topologies have been used in processing long-term modula-
tion features for example the architectures used in the TRAPS [66] or HATS framework [73].
In a more recent approach [114], modulation features are first divided into two separate
streams as shown in Figure 4.1. The phoneme posterior outputs of a neural network trained
on high modulations (> 10Hz) are then combined with low modulation features to train
a second network. Tandem processed features from the second network are then used for
ASR.
Hierarchical networks where the outputs of one neural network processing stage are
further processed by a second neural network have been used in [100, 115]. More recently,
Deep Belief Networks with several layers (5-6 hidden layers) have been used in acoustic
modeling. In this approach individual layers of the deep network are usually pre-trained
before being assembled together and trained together [116–118].
In this chapter we discuss techniques to train both these classes of complex net-
works in low-resource settings. Faced with limited amounts of task specific data in these
scenarios we demonstrate the use of task independent data to build these networks.
4.2 Wide Network Topologies
4.2.1 Building the Data-driven Front-ends
We use two kinds of task independent data sources in building the proposed front-
end with wide network topologies -
(a) Up to 20 hours of data from the same language collected for a different task. Although
this data has a different genre, it has similar acoustic channel conditions as the low
72
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Data driven front−end
Speech (from thelow−resource
setting)
{1/2/5/10/15/20} hours
Same language,
− PLPFeaturesAcoustic
of dataon N hrs
MLP trainedterm detection
for LVCSR/SpokenPosterior features
different genre −
Figure 4.2: Data driven front-end built using data from the same language but from adifferent genre.
resource data.
(b) 200 hours of data from a different language but with similar acoustic channel conditions.
We build two kinds of front-ends on varying amounts of these task independent training
data.
1. A monolingual front-end trained on varying amounts of data from the same language as
the low-resource task. As shown in Figure 4.2, we train different configurations of this
front-end on 1 to 20 hrs of data (N hours). The primary advantage of this kind of a
front-end is that even though the genre is different, the MLP learns useful information
that characterizes the acoustics of the language. This improves as the amount of training
data increases. For our current experiments we also choose task independent data from
similar acoustic conditions as the low resource setting. Features generated using this
front-end are hence enhanced with knowledge about the language and have unwanted
variabilities from the channel and speaker removed. We use conventional short-term
acoustic features to train these nets.
2. A cross-lingual front-end that uses large amounts of data from a different language.
In most low-resource settings, it is less likely to have sufficient transcribed data in the
73
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
��������
��������
AcousticFeatures − PLP
− FDLPMFeaturesAcoustic
on M hrs of dataMLP trained
Different language −
MLP trainedon M hrs of data
PosteriorCombination
ProcessingTandem
with multilingual posteriorsAcoustic Features enhanced
Data driven front−end
MLP trained onN hrs of data
Same language, differentAcousticFeatures − PLP
hoursgenre − {1/2/5/10/15/20}
200 hours
for LVCSRfeatures
Posterior
Speech
Figure 4.3: A cross-lingual front-end built with data from the same language and with largeamounts of additional data from a different language but with same acoustic conditions.
same language to train a monolingual front-end. However considerable resources in other
languages might be available. Figure 4.3 outlines the components of the cross-lingual
front-end that we train to include additional data from a different language. This front-
end has two parts. The first part is similar to the monolingual front-end described above
and consists of an MLP trained on various amounts of data from same language but
different genre (N hours). The second part includes a set of MLPs trained on large
amounts of data from a different language (M hours). Outputs from these MLPs are
used to enhance the input acoustic features for the former part.
Although languages have common attributes between them, data from these languages
is transcribed using different phone sets and need to be combined before it can be used.
In the previous chapter, we use two different approaches to deal with this - a count based
74
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
data driven approach to find a common phone set and an MLP training scheme with
intermediate language specific layers. Both these approaches finally involve adaptation
of multilingual MLPs to the low-resource language. In this chapter, we do not adapt
any MLPs, instead we keep the front-end fixed by using the multilingual MLP to derive
posterior features.
When MLPs trained on a particular language are used to derive phoneme posteriors from
a different language, the language mismatch results in less sharp posteriors than from an
MLP trained on the same language. However an association can still be seen between
similar speech sounds from the different languages. We use this information to enhance
acoustic features of the task specific language. Phoneme posteriors from two compli-
mentary acoustic streams are combined to improve the quality of the posteriors before
they are converted to features using the Tandem technique. The multilingual posterior
features are finally appended to short-term acoustic features to train a second level of
MLPs on varying amounts of data from the same language as the low-resource task. This
procedure is hence similar to the approaches described earlier with modulation features
and the TRAPS/HATS configurations used to build wide neural network topologies (see
Figure 4.1).
4.2.2 Experiments and Evaluations
We train two data-driven front-ends for the low-resource LVCSR task as described
in Sec. 4.2.1. We train the monolingual front-end on a separate task independent training set
of 20 hours from the Switchboard corpus. Although this training set has similar telephone
75
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
channel conditions, as the low-resource task used for our experiments, it has a different
genre. The phone labels for this set are obtained by force aligning word transcripts to
previously trained HMM/GMM models using a set of 45 phones. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames. We
train separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours to understand how the
amount of task independent data affects performance on these features.
In addition to the Switchboard corpus, we train Spanish MLPs on 200 hours of tele-
phone speech from the LDC Spanish Switchboard and Callhome corpora for the cross-lingual
front-end. Phone labels for this database are obtained by force aligning word transcripts
using BBN’s Byblos recognition system using 27 phones. We use two acoustic features
- short-term 39 dimensional PLP features with 9 frames of context and 476 dimensional
long-term modulation features (FDLPM). When networks are trained on multiple feature
representations, better posterior estimates can be derived by combining the outputs from
different systems using posterior probability combination rules. We use the Dempster-Shafer
rule of combination for our experiments. Posteriors from multiple streams are combined to
reduce the effects of language mismatch and improve posteriors. Phoneme posteriors are
then converted to features by Gaussianizing the posteriors using the log function and decor-
relating them by using the Karhunen-Loeve transform (KLT). A dimensionality reduction
is also performed by retaining only the top 20 feature components which contribute most
to the variance of the data.
The English MLPs in the cross-lingual setting are trained on enhanced acoustic
features. These features are created by appending posterior features derived from the
76
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.1: Word Recognition Accuracies (%) using different amounts of Callhome data totrain the LVCSR system with conventional acoustic features
1hr 2hr 5hr 10hr 15hr
PLP features 28.8 33.60 39.70 43.80 46.50
Spanish MLPs to the PLP features used in monolingual training. We similarly also train
separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours of task independent data.
In our first experiment we use 39 dimensional PLP features directly for the 1
hour Callhome LVCSR task. The acoustic models have a low word accuracy of 28.8%.
These features are then replaced by 25 dimensional posterior features using the monolingual
and cross-lingual front-ends, each trained on varying amounts of task independent data
from the Switchboard corpus. Figure 4.4 shows how the performance changes for both the
monolingual and cross-lingual systems. Using the data-driven front-ends, the word accuracy
improves from 28.8% to 30.1% and 37.1% with just 1 hour of task independent training
data using the monolingual and cross-lingual front-ends respectively. These improvements
continue to 37.2% and 41.5% with the same 1 hour of Callhome LVCSR training data as
the amount of task-independent data is increased for both the front-ends. We draw the
following conclusions from these experiments -
1. With very few hours of task specific training data, posteriors features can provide
significant gains over conventional acoustic features. Table 4.1 shows the work accu-
racies when different amounts of Callhome data are used to train the LVCSR system.
By using the cross-lingual front-end, features from only 1 hour of data perform close
to 5-10 hours of the Callhome data with conventional features. This demonstrates
77
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
1 5 10 15 2026283032343638404244464850
Amount of Task−independent Training Data
Wor
d ac
cura
cy (%
)
1hr of Acoustic Features only1hr of Posterior Features using Monolingual Front−end1hr of Posterior Features using Crosslingual Front−end
Figure 4.4: LVCSR word recognition accuracies (%) with 1 hour of task specific trainingdata using the proposed front-ends
the usefulness of our approach where we use task independent data in low-resource
settings to generate better features.
2. When data from a different language is used, additional gains of 4-7% absolute are
achieved over just using task independent data from the same language. It is interest-
ing to observe that the performance with the cross-lingual front-end starts improving
from the best performance achieved with the monolingual front-end.
4.3 Deep Network Topologies
A deep neural networks (DNN) is multilayer MLPs with several more layers than
traditionally used networks. The layers of a DNN are often initialized using a pretraining
algorithm before the network is trained to completion using the error back-propagation
algorithm [119]. In this section we discuss the development of a DNN for low-resource
scenarios.
78
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.3.1 DNN Pretraining and Initialization
The purpose of the pretraining step is to initialize a DNN network with a better set
of weights than a randomly selected set. Networks trained from these kinds of initial weights
are observed to be well regularized and converge to a better local optimum than a randomly
initialized networks [120, 121]. As with traditional ANNs, deep neural networks have been
used both as acoustic models that directly model context-dependent states of HMMs [117]
and also to derive data-driven features [122, 123]. In both cases, the performances of these
networks are better than traditional shallow networks [117,118].
In the deep belief network (DBN) pretraining procedure [124], by treating layers
of the MLP as restricted Boltzmann machines (RBM), the parameters of the network are
trained in an unsupervised fashion with an approximate contrastive divergence algorithm
[124]. However various approximations in training algorithm, introduce modeling errors
which in turn decreases the effectiveness of this approach when the number of layers is
increased [119].
A different algorithm that has been shown to equally effective for pretraining
DNNs is called discriminative pretraining [119, 125]. This pretraining procedure starts by
training an MLP with 1 hidden layer. After this MLP has been trained discriminatively
with the error back-propagation algorithm, a new randomly initialized hidden layer and
softmax layer are introduced to replace the initial soft-max layer of the first network. The
deeper network is then trained again discriminatively. This procedure is repeated until the
desired number of hidden layers in place.
Although pretraining algorithms are effective in initializing DNNs, the key con-
79
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
straint in low resource settings is often the insufficient amount of data to train these net-
works. We show that in these scenarios, task independent data can instead be used to
pretrain and initialize a DNN before it is finally adapted and used with limited amounts of
task specific data in a low resource setting.
We outline the training of a 5 layer DNN of size - d×m1×m2×m3×h. The training
algorithm is however general and can be extended to more hidden layers. The MLP has a
linear input layer with a size d corresponding to the dimension of the input feature vector,
followed by three non-linear layers m1,m2,m3 and a final linear layer with a size h corre-
sponding to the phone set of the task independent data the DNN is being trained. While
the dimensions of m1,m2 are quite high, m3 is low bottleneck dimensional layer. Similar
to data driven networks described in the previous chapter, both posterior and bottleneck
features can be derived from the DNN. We use the following steps to pretrain a DNN -
1. Initializing the network - We begin the training procedure by initializing a simple
network with 1 hidden layer - d×m1×h. Starting with randomly initialized the weights
connecting all the layers of the network, we train this network with one pass of the
entire data similar to [119].
2. Growing the network - The d×m1×h network is now grown by inserting a new layer
m2 and a set of random weights connecting m1 −m2 and m2 − h. The new network
is again trained with one pass of the entire data using the standard back-propagation
algorithm. The weights d −m1 are copied from the initialization step and are kept
fixed.
The desired network d×m1×m2×m3×h is finally created by adding the bottleneck
80
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
layer m3. While weights d − m1, m1 − m2 are copied from the previous step, new
random weights are used to connect m2 −m3 and m3 − h.
3. Final training - With all the layers of the network in place, the complete network is
trained to full convergence.
We use task independent data in all these steps. The DNN is next adapted to the
low-resource setting using limited amounts of task specific data.
4.3.2 DNN Adaptation with task specific data
As described in the previous chapter, one limitation while adapting between do-
mains are differences in the phoneme set. We have proposed a neural network based tech-
nique for this in the previous chapter that replaces the last language specific layer. We use
this technique in the following steps for the adapting the DNN -
1. Initialize the network to train on task specific set - To continue training on the task
specific set which has a different phoneme set size l, we create a new 5 layer DNN of
size d×m1×m2×m3×l. The first 4 layer weights of this new network are initialized
using weights from the DNN trained on the task independent data set. Instead of
using random weights between the last two layers, we initialize these weights from a
separately trained single layer perceptron. To train the single layer perceptron, non-
linear representations of the low-resource training data are derived by forward passing
the data through the first 4 layers of the MLP. The data is then used to train a single
layer network of size m3×l.
2. Train the MLP on task specific set - Once the 4 layer MLP of size d×m1×m2×m3×l
81
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
has been initialized, we re-train the MLP on the low-resource language. By sharing
weights across languages the MLP is now able to train better on limited amounts of
task specific data.
We derive features from the bottleneck hidden layer of the final DNN as features
for ASR.
4.3.3 Experiments and Evaluations
Similar to low-resource experiments in the previous chapter, we build a cross-
lingual DNN front-end using data from 3 different languages - Spanish, German and English.
Separate DNNs are trained on two different feature representations - PLP and FDLPM.
Bottleneck features from these front-ends are then combined and used for ASR experiments.
DNN pretraining with cross-lingual data
32 hours of cross-lingual data from Spanish (16 hours), German (15 hours) and
English (1 hour) are used to train a 6 layer DNN network with 3 hidden layers. The cross-
lingual data uses a combined phoneme set size of 52 derived from a count-based mapping
scheme (Chapter 3, Section 3.4.3).
Separate DNNs are trained on two feature representations. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to
train the first network with architecture - 351×1000×1000×25×52. A second system is
trained on modulation features derived using FDLP. These features (FDLPM) correspond
to 28 static and dynamic modulation frequency components extracted from 17 bark spaced
bands. A reduced feature set from only 9 alternate odd bands is used to train a system
82
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.2: Word Recognition Accuracies (%) with semi-supervised pre-training
System Word Rec. Acc. (%)
Conventional acoustic features (PLP) using
1 hour of English training data 28.8
Data-driven features using data-driven map - 31 hours of
multilingual data (German + Spanish) and 1 hour of English 36.5
(Chapter 3)
Data-driven features using adaptable last layer for MLP training
31 hours of multilingual data (German + Spanish) and 1 hour 37.2
of English (Chapter 3)
Data-driven features using deep neural network pre-trained using
31 hours of multilingual data (German + Spanish) and 1 hour 41.0
of English
with an architecture of 252×1000×1000×25×52. Both the systems are trained with the
standard back propagation algorithm and cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
on the cross validation data.
The DNN networks are build in stages as described in the previous section. For the
DNN trained using PLP features, a three layer MLP (351×1000×52) initialized with random
weights, is first trained using one pass of the cross-lingual data. In the next step, a four
layer MLP (351×1000×1000×52) is trained starting with copied weights from the 351×1000
section of the earlier network and random weights for the 1000×1000×52 section. A single
83
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
pass of the cross-lingual data is used to train this network keeping the copied weights fixed.
The final 6 layer network (351×1000×1000×25×52) is constructed with copied weights for
the 351×1000×1000 section and random weights for the 1000×25×52 part. The network
is then trained to full convergence. A similar 252×1000×1000×25×52 is trained using the
FDLPM features.
DNN adaptation to low-resource settings
Each of the DNN networks trained on task independent data are then adapted
to the low-resource setting with task-specific 1 hour of English data. The networks are
adapted after the task dependent output layer of the cross-lingual DNN has been replaced.
This is done in two steps.
In the first step, all weights except the weights between the bottleneck layer and
the output layer are initialized directly from the cross-lingual network. The second set
of weights are initialized from a single layer network trained on non-linear representations
of the 1 hour of English data derived by forward passing the English data through the
cross-lingual network till the bottleneck layer. This network has an architecture of 25×47
corresponding to the dimensionality of the non-linear representations from the bottleneck
layer of the cross-lingual network and the size of the English phoneme set.
Once the networks has been initialized, PLP and FDLPM features derived from 1
hour of English are used to train the new low-resource networks. The networks trained on
PLP and FDLPM features now have an architecture of 351×1000×25×47 and 252×1000×25×47
respectively. These networks are then used to derive bottleneck features. The 2 sets of 25
dimensional bottleneck features from each of the networks are appended together before ap-
84
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
plying a dimensionality reduction to form a final 25 dimensional bottleneck feature vector
for ASR.
ASR Experiments using DNN features
We use the similar ASR setup on Callhome English described earlier. The baseline
HMM-GMM system is trained on 1 hour of data using 39 dimensional PLP features. Table
4.2 shows the recognition accuracies on this task using different approaches. The DNN
features significantly improve ASR accuracies when compared with equivalent systems built
using features from simpler 3 layer MLPs.
4.4 Semi-supervised training in Low-resource Settings
4.4.1 Overview
Semi-supervised training has been effectively used to train acoustic models in
several languages and conditions [93,126–128]. In this section we describe the development
of a semi-supervised approach to improve speech recognition performances in low-resource
settings.
We start by using the best acoustic models trained in the low-resource setting to
decode the available untranscribed data. The decoded data is then used along with the
limited amounts of transcribed training data to train acoustic models in a semi-supervised
fashion.
85
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.4.2 Selecting Reliable Data
In low-resource settings, since the recognition performance of recognizers is low,
the quality of the decoded untranscribed data is also poor. It is hence useful to select
reliable portions of the untranscribed data for semi-supervised training. This selection is
done using confidences scores computed for each decoded utterance. Confidence scores are
computed using two techniques -
1. LVCSR based word confidences - LVCSR lattice outputs can be treated as directed
graphs with arcs representing hypothesized words. Each arc spans a duration of
time (ts, tf ), that the word is hypothesized to be present in the speech signal and is
also associated with acoustic and language model scores. Using these scores, word
posteriors can be computed with the standard forward-backward algorithm [129].
For any given hypothesized word wi, at a given time frame t, several instances of
the word can be present on different lattice arcs simultaneously. A frame-based word
posterior of wi can be computed as
p(wi|t) =∑
j
p(wji |t) (4.1)
where j corresponds to all the different instances of wi that are present at time frame
t [130]. In our proposed selection technique we use a word confidence measure Cmax
based on these frame level word posteriors [130], given as the maximum word confi-
dence of the word in its hypothesized time interval (ts, tf )
Cmax(wi, ts, tf ) = maxtε(ts,tf )
p(wi|t) (4.2)
86
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
����������
����������
���������������
���������������
����������
����������
���������������
��������������� ���
���������
������������
��������
�����������������
���������
������
������
���������
�����������
������������������
������������
������������
����������
����������
���������������
���������������
��������������������
Phonemes ofword W
tt s+1Time (frames)
t
43p
p21
p
p
s f
Presence ofphoneme
Path along on whichoccurances are
counted
Figure 4.5: MLP posteriogram based phoneme occurrence count
2. MLP posteriogram based phoneme occurrence confidence - Similar to the above men-
tioned confidence from the LVCSR classifier, we also derive confidences scores from
phoneme posterior outputs of a neural network classifier. This confidence measure
uses the posteriogram representation of an utterance, derived by forward passing
acoustic features corresponding to the utterance through the trained MLP classifier.
For each hypothesized word wi in the LVCSR transcripts, we first look up its set of
constituent phonemes {p1, p2 . . . pn} from a pronunciation lexicon. Phoneme posteri-
ors corresponding to each phoneme are then selected for the utterance’s posteriogram
representation and binarized to indicate the phoneme’s presence or absence using a
set threshold. The average number of times the constituent phonemes appear in the
hypothesized time span (ts, tf ) along a Viterbi search path is then used as confidence
measure. The selected path is designed to produced the occurrence count while visit-
ing all constituent phonemes in sequence. The rationale behind this measure is that
if a word is hypothesized correctly, it is likely that all its constituent phonemes will
be present in the posteriogram, hence resulting in a high average occurrence count.
Figure 4.5 is a schematic of the proposed count based measure computed as -
Cocc(wi, ts, tf ) =c
N(4.3)
87
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
where c is the total number of times phoneme occurrences and N is the total number
of frames in the hypothesized interval (ts, tf ).
The two confidence measures are finally combined using logistic regression. The
regressor is trained to predict a combined confidence using word confidence and phoneme
occurrence confidence scores of a held out data set.
4.4.3 Experiments and Results
For our experiments in low-resource settings, we use a randomly selected 1 hour
of transcribed data from the complete 15 hour Callhome English data set. In our semi-
supervised training experiments we consider the remaining 14 hours as untranscribed data
and attempt to use it.
Data selection
Using the ASR system trained with features from the cross-lingual DNN front-
end, the 14 hour data set is first decoded. Word lattices also produced during the decoding
process are used to generate word-confidences for each hypothesized word as described
above. The cross-lingual DNN front-end is also used to produce phoneme posterior outputs
from which phoneme occurrence based confidences are derived. Combination weights for
these confidence scores are then estimated by training a logistic regressor on a 45 minute
held-out data set with the set’s ground truth transcriptions.
After every hypothesized word in the decoded output has been given a score using
the trained logistic regression module, each utterance is assigned an utterance-level score.
This utterance level score is the average of all word-level scores in the utterance.
88
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.3: Word Recognition Accuracies (%) at different word confidence thresholds
Threshold Word Rec. Acc. % Threshold Word Rec. Acc. %
None 38.75 + 0.2 44.0
- 0.1 39.5 + 0.3 45.5
+ 0.0 41.7 + 0.4 45.4
+ 0.1 42.7 + 0.5 44.6
To evaluate the usefulness of the proposed confidence selection scheme we generate
utterance level scores for the held out data. The word recognition accuracy (%) is then
evaluated on selected sentences at different threshold levels. Table 4.3 shows the word
recognition accuracies at different thresholds. As the threshold increases, only fewer reliable
sentences get selected.
Semi-supervised training of DNNs
The initial cross-lingual DNN training experiments described earlier were based on
only 1 hour of transcribed data. For semi-supervised training of DNNs we include additional
data with noisy transcripts. These utterances are selected from the untranscribed data based
on their utterance level confidences.
To avoid detrimental effects from noisy semi-supervised data during discriminative training
of neural networks, we make the following design choices -
(a) During back-propagation training, the semi-supervised data is de-weighted. This is
done by multiplying the cross-entropy error with a small multiplicative factor during
training,
89
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.4: Word Recognition Accuracies (%) with semi-supervised pre-training
Cross-lingual pre-training 41.0
Cross-lingual pre-training with semi-supervised data 42.7
(b) The semi-supervised data is used only in the final pre-training stage after all the layers
of the DNN have been created,
(c) Only limited amount of semi-supervised data is added.
For our experiments we select about 4.5 hours of data using utterances with a
score of 0.3 and greater. This data is then combined with the cross-lingual pre-training
data set of 15 hours of German, 16 hours of Spanish and 1 hour of English. During the
DNN training, we use a multiplicative factor of 0.3 to de-weight the cross-entropy error
from the semi-supervised data.
The semi-supervised data is used in the final pre-training stage (Section 4.3.1,
step 3) to train both the DNN networks using PLP (351×1000×1000×25×52 network) and
FDLPM (252×1000×1000×25×52 network) features (Section 4.3.3). After pre-training,
both the networks are adapted with 1 hour of English as before. Bottleneck features from
both the networks are combined and used to train the low-resource ASR system with 1 hour
of data as before. Table 4.4 shows the performance of the system after using semi-supervised
data for pre-training.
90
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
Table 4.5: Word Recognition Accuracies (%) with semi-supervised acoustic model training
Hours of semi-supervised
data added Word Rec. Acc. %
0 42.7
2 43.3
4 44.0
8 44.3
14 44.8
Semi-supervised training of Acoustic Models
Features from the DNN front-end with semi-supervised data are used to extract
data-driven features for semi-supervised training of the ASR system. Similar to the weighing
of semi-supervised data during the DNN training, we also use a simple corpus weighing while
training the ASR systems. This is done by adding the 1 hour of fully supervised data with
accurate transcripts twice.
To understand the effect of the semi-supervised data, we evaluate the recognition
performance using different amounts of semi-supervised data. From Table 4.5 we observe
that as we double the amount of semi-supervised data, there is roughly a 0.5% absolute
increase in performance.
91
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCESETTINGS
4.5 Conclusions
In this chapter we have shown how complex neural network architectures can be
built in low resource settings. Using large amounts of multilingual data, we have show
that task independent data can significantly improve performances in low resource settings.
Training using task independent data compensates for the lack of limited amounts of tran-
scribed task specific data in low resource settings. Both the deep and wide networks trained
in this fashion improve word recognition accuracies significantly.
92
Chapter 5
Applications of Data-driven
Front-end Outputs
In the previous chapters, the outputs of data-driven front-ends were used as features for
automatic speech recognition. In this chapter, we describe how these front-ends can be used
in other applications - to derive features for speech activity detection, combination weights
in neural network based speaker recognition models, feature representations for zero resource
speech applications and event detectors for speech recognition.
5.1 Application 1 - Speech Activity Detection
5.1.1 Overview
Speech activity detection (SAD) is the first step in most speech processing ap-
plications like speech recognition, speech coding and speaker verification. This module is
93
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
an important component that helps subsequent processing blocks focus resources on the
speech parts of the signal. In each of these applications, several approaches have been
used to build reliable SAD modules. These techniques are usually variants of decision rules
based on features from the audio signal like signal energy [131], pitch [132], zero crossing
rate [133] or higher order statistics in the LPC residual domain [134]. Acoustic features
have also been used to train multi-layer perceptrons (MLPs) [135] and hidden Markov
models (HMMs) [136] to differentiate between speech and non-speech classes. All these
approaches in essence focus on characteristic attributes of speech which differentiate it from
other acoustic events that can appear in the signal.
5.1.2 Data-driven Features for SAD
Traditionally acoustic features derived from the spectrum of speech have been
used to differentiate between speech and other acoustic events. In a different approach, we
train MLPs on large amounts of data to differentiate between two classes - speech versus
non-speech. Instead of using these models to directly produce S/NS decisions, the models
are used as a data-driven front-ends to derive features for SAD.
The proposed front-end has a multi-stream architecture with several levels of MLPs
[137]. The motivation behind this multi-stream front-end is to use parallel streams of data
that carry complementary or redundant information while at the same time degrading
differently in noisy environments [138]. We form 3 feature streams by dividing the sub-
band trajectories derived using FDLP on a mel-scale with 45 filters equally into 3 groups.
Similar to deriving short-term spectral features, we then integrate the envelopes in short
term frames (of the order of 25 ms with a shift of 10 ms). We also use a context of
94
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
about 1 second by appending 50 frames from the right and left with each sub-band feature
vector to form TRAP like features [65]. The two other streams are formed by dividing the
14 modulation features into 2 groups - the first 5 DCT coefficients corresponding to slow
modulations and the remaining 5 coefficients corresponding to fast modulations.
5.1.3 Experiments and Results
Speech activity detection is carried out on the proposed features in three main
steps. In the first step, the input frame-level features are projected to a lower-dimensional
space. The reduced features are then used to compute per-frame log likelihood scores with
respect to speech and non-speech classes, each class being represented separately by a GMM.
The frame level log likelihood scores are mapped to S/NS classification decisions to produce
final segmentation outputs in the last step. Figure 5.1 is a brief schematic of the proposed
approach and the processing pipeline for SAD. Each of these steps is described in detail
in [139].
The proposed features are evaluated in terms of speech activity detection (SAD)
accuracy on noisy radio communications audio provided by the Linguistic Data Consortium
(LDC) for the DARPA RATS program [140, 141]. The audio data for the DARPA RATS
program is collected under both controlled and uncontrolled field conditions over highly
degraded, weak and/or noisy communication channels making the SAD task very challeng-
ing [140]. Most of the RATS data released for SAD were obtained by retransmitting existing
audio collections - such as the DARPA EARS Levantine/English Fisher conversational tele-
phone speech (CTS) corpus - over eight radio channels, labeled A through H covering a
wide range of radio channel transmission effects.
95
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Figure 5.1: Schematic of (a) features and (b) the processing pipeline for speech activitydetection.
The development corpus used in our SAD experiments consists of 11 hours of
audio from the Arabic Levantine and English Fisher CTS corpus, retransmitted over the
eight channels. The training corpus consists of 73 hours of audio (62 hours from the Fisher
collection, and 11 from new RATS collection). Although the entire data was also retrans-
mitted over eight channels, since some data from channel F was unusable, all data from
that channel was excluded from both training and development.
The MLPs used for extracting data-driven features are trained on close to 660
hours of audio from the RATS development corpus using LDC provided S/NS annotations.
Outputs from these 5 sub-systems are then fused by a merger MLP at the second level to
derive the final S/NS posterior features. These features are derived from the pre-softmax
96
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Dimensionality Equal Error Rate (%) on different channels
#Dims. #Frame Total
Features /Frame Context #Dims. A B C D E G H All
PLP 15 31 465 3.55 3.00 5.03 2.51 2.75 3.48 2.34 3.34
FDLPS 15 31 465 3.42 3.10 4.46 2.42 2.78 3.40 2.29 3.20
FDLPM 340 1 340 3.88 3.80 4.12 3.26 3.52 3.60 2.51 4.15
MLP 2 31 62 3.05 2.96 3.76 2.20 2.71 3.35 2.10 3.17
PLP+MLP 17 31 527 3.10 2.84 3.20 2.25 2.63 2.96 2.07 2.84
FDLPS+MLP 17 31 527 3.15 2.94 3.04 2.17 2.67 2.89 1.93 2.82
FDLPM+MLP 402 1 402 3.02 2.90 3.73 2.26 2.84 2.42 1.89 2.88
Table 5.1: Equal Error Rate (%) on different channels using different acoustic features andcombinations
outputs of the final layer.
SAD models are trained both acoustic and data-driven features, as well as on
feature combinations. In each case, HLDA was used to reduce dimensionality prior to
GMM training. Table 5.1 shows the dimensionality of the original space, prior to the
application of HLDA, for each feature type used. A context of 31 frames was used for
short-term features. In all cases, the output dimensionality of HLDA was set to 45. A single
Gaussian was used to represent each of the two classes (speech, non-speech) during HLDA
estimation. After the dimensionality reduction, a 512-component GMMs is trained for
S/NS classification. The number of contextual frames, HLDA dimensionality, and number
of GMM components were optimized using separate experiments [142]. The derived SAD
models were evaluated on the development set in terms of equal error rate (EER%), which is
97
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
the operating point at which the falsely rejected speech rate (probability of missed speech)
is equal to the falsely accepted non-speech rate (probability of false alarm). The results
are shown in Table 5.1 for conventional features (PLP), short-term features derived using
FDLP (FDLPS), long-term modulation features (FDLPM) and data-driven features (MLP).
Although each of the feature sets have varying performance in each of the individual noisy
channels, they are comparable to each other in terms of overall SAD performance. In a
second set of experiments, acoustic and data-driven features which capture various kinds
of information about speech, are combined. We observe close to 15% relative improvement
when the acoustic features are used in conjunction with the data-driven features. We draw
the following conclusions from these experiments -
1. MLP based models, which are traditionally used to directly produce S/NS decisions,
can be used as data-driven front-ends to produce complementary data-driven features.
2. Acoustic and data-driven features capture complementary attributes. When combined
these lead to further performance improvements.
5.2 Application 2 - Neural Network based Speaker Verifica-
tion
5.2.1 Overview
The goal of speaker verification is to verify the truth of a speakers claimed identity.
Majority of current speaker verification systems model overall acoustic feature vector space
using a GMM based Universal Background Model (UBM), trained on large amounts of data
98
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
from multiple speakers [143,144]. In this section we discuss the development of a mixture of
AANNs for speaker verification. The mixture consists of several AANNs tied using posterior
probabilities of various broad phoneme classes derived from MLPs.
5.2.2 AANN Models for Speaker Verification
Modeling Speaker Data
AANNs are feed-forward neural networks with several layers trained to reconstruct
the input at its output through a hidden compression layer. This is typically done by
modifying the parameters of the network using the back-propagation algorithm such that
the average squared error between the input and output is minimized over the entire training
data. More formally, for an input vector x, the network produces an output x(x,W) which
depends both on the input x and the parameters W of the network (the set of weights and
biases). For simplicity, we denote the network output as x(W). The training process then
adjusts the parameters such that -
min{W}
E[‖x − x(W)‖2
]. (5.1)
This method of training ensures that for a well trained network, the average reconstruction
error of input vectors that are drawn from the distribution of the training data will be small
compared to vectors drawn from a different distribution [145]. The likelihood of the data x
given the model can then be linked to the error as -
p (x;W) ∝ exp(−E[‖x − x(W)‖2
]). (5.2)
In [146, 147], these properties have been used to model acoustic data for speaker
verification. A single AANN is first trained as a universal background model (UBM) on
99
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
acoustic features from large amounts of data containing multiple speakers. Since data from
many speakers are used, the AANN model learns a speaker independent distribution of the
acoustic vectors. For each speaker in the enrollment set, the UBM-AANN is then adapted to
learn speaker dependent distributions by retraining the entire network using each speaker’s
enrollment data. During the test phase, the average reconstruction error of the test data is
computed using both the UBM-AANN and the claimed speaker AANN model. In an ideal
case, if the claim is true, the average reconstruction error under the speaker specific model
will be smaller than under the UBM-AANN and vice versa if false.
This approach is similar to conventional UBM-GMM techniques [144] except for
the maximum a posteriori probability (MAP) adaptation to obtain speaker specific models.
In the MAP adaptation of GMMs, only those components that are well represented in the
adaptation data get significantly modified. However in the case of neural networks, there
is no similar mechanism by which only parts of the model can be adapted. This limits the
ability of a single AANN to capture the distribution of acoustic vectors especially when
the space of speakers is large. To address this issue, we introduce a mixture of AANNs as
described in the following section.
Mixture of AANNs
A mixture of AANNs is composed of several independent AANNs each modeling
a separate part of the acoustic feature space [148]. In our experiments we partition the
acoustic space into 5 classes corresponding to the broad phoneme classes of speech - vowels,
fricatives, nasals, stops and silence. The assignment of a feature vector to one of these classes
is done using posterior probabilities of these classes estimated using a separate multilayer
100
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
perceptron (MLP). This additional information is incorporated into the objective function
in Eqn. (5.1) as -
c∑j=1
min{Wj}
E[P (Cj/x) ‖x − x(Wj)‖2
](5.3)
where c denotes the number of mixture components or number of broad phoneme classes,
and the set Wj consists of parameters of the jth AANN of the mixture. P (Cj/x) is the
posterior probability of jth broad phonetic class Cj given x estimated using the MLP. During
back propagation training, since the error is weighted with class posterior probabilities, each
mixture component is trained only on frames corresponding to a particular broad phonetic
class.
Similar to the single AANN case, a UBM-AANN is first trained on large amounts of
data. For each speaker in the enrollment, the UBM is then adapted using speaker specific
enrollment data. Broad class phoneme posteriors are used in both these cases to guide
the training of each class specific mixture component on appropriate set of frames. This
approach helps to alleviate the limitation of a single AANN model described earlier since
only parts of the UBM-AANN are now adapted based on the speaker data.
Using the mixture of AANNs, the average reconstruction error of data D =
{x1, . . . ,xn} is given by
e (D;W1, . . . ,Wc) =
n∑i=1
c∑j=1
P (Cj/xi) ‖xi − xi(Wj)‖2
n. (5.4)
During the test phase, likelihood scores based on reconstruction errors from both the UBM-
AANN and the claimed speaker models are used to make a decision. In our experiments,
since the amount of adaptation data is usually limited, we adapt only the last layer weights
101
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
of each AANN component. We also restrict the number of nodes of the third hidden layer
to the size of the output layer.
5.2.3 Experiments and Results
As described earlier, we train a mixture of AANNs with five components on suf-
ficiently large amounts of data to serve as UBM. Gender specific UBMs are trained on a
telephone development data set consisting of audio from the NIST 2004 speaker recognition
database, the Switchboard II Phase III corpora and the NIST 2006 speaker recognition
database. We use only 400 male and 400 female utterances each corresponding to about 17
hours of speech. The acoustic features used in our experiments are 39 dimensional FDLP
features [149].
Posteriors to train the UBM are derived from an MLP trained on 300 hours of
conversational telephone speech (CTS) [88]. The 45 phoneme posteriors are combined
appropriately to obtain 5 broad phonetic class posteriors corresponding to vowels, fricatives,
plosives, nasals and silence.
Each AANN component of the UBM has a linear input and a linear output layer
along with three nonlinear (tanh nonlinearity) hidden layers. Both input and output layers
have 39 nodes corresponding to the dimensionality of the input FDLP features. We use 160
nodes in the first hidden layer, 20 nodes in the compression layer and 39 nodes in the third
hidden layer. Speaker specific models are obtained by adapting (retraining) only the last
layer weights (39×39 parameters) of each AANN component.
Once the UBMs and speaker models have been trained, a score for a trial is
computed as difference between the average reconstruction error (given by (5.4)) values
102
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
of test utterance under the UBM model and the claimed model.
As a baseline, we train a gender independent UBM-GMM system with 1024 com-
ponents on FDLP features. The UBM-GMM is trained using the entire development data
described in above section. The speaker specific GMM models are obtained by MAP adapt-
ing the UBM-GMM with a relevance factor 16. As a second baseline, we train gender specific
AANN systems. These systems use 160 nodes in both the second and fourth hidden layers
and 20 nodes in the compression layer. The UBMs are trained using the same development
data that is used for training the mixture of AANNs.
System C6 C7 C8
GMM (1024 comp.) 84.4 (17.3) 60.8 (11.7) 69.1 (14.3)
Baseline AANN 88.3 ( 28.7) 75.9 (20.6) 77.0 (25.7)
Mixture of AANNs 86.7 (22.5) 60.4 (11.8) 57.3 (12.8)
Mixture of AANNs + GMM 81.3 (16.4) 51.9 (10.9) 54.4 (11.4)
Table 5.2: Performance in terms of Min DCF (×103) and EER (%) in parentheses ondifferent NIST-08 conditions
The performance is evaluated on a subset of the NIST-08 telephone core conditions
(C6, C7 and C8) consisting of 3851 trials from 188 speakers. Table 5.2 lists both minimum
detection cost function (DCF) and equal error rate (EER) of various systems. The proposed
mixture of AANNs system performs much better than the baseline AANN system and
yields comparable results to the conventional GMM system. The score combination (equal
weighting) of GMM baseline and the proposed system further improves the performance.
However, state-of-the-art GMM systems use factor analysis to obtain much better gains.
103
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
In [150,151], the AANN based approach has been further developed to use factor analysis.
5.3 Application 3 - Zero Resource Settings
In zero resource settings, tasks such as spoken term discovery attempt to auto-
matically identify repeated words and phrases in speech without any transcriptions [152].
In recent approaches [152–154] to address this task, a dynamic time warping (DTW) search
of the speech corpus is performed against itself to discover repeated patterns. With no
transcripts to guide the process, results of the search largely depend on the quality of the
underlying speech representation being used. In [155], multiple information retrieval met-
rics have been proposed to evaluate the quality of different speech representations on this
task. These metrics operate by using a large collection of pre-segmented word examples to
first compute the DTW distance between all example pairs and then quantify how well the
DTW distances can differentiate between same or different the example pairs. Better scores
with these metrics are indicative of good speaker independence and high word discrim-
inability of feature representations. Since these are also desirable properties of features for
other downstream recognition applications, these metrics are also predictive of how different
features will perform in those applications. We evaluate posterior features from both the
multilingual and cross-lingual front-ends (Chapter 4, Sec. 4.2) for spoken term discovery
with information retrieval metrics used in [155].
The evaluation metric uses 11K words from the Switchboard corpus resulting in
60.7M word pairs of which 96K are same word pairs [155]. Similarity between word pairs
(wi, wj) are measured using minimum DTW alignment cost - DTW (wi, wj) between wi and
104
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
1 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Amount of Task−independent Training Data
Ave
rage
Pre
cisi
on
Acoustic Features onlyPosterior Features using Monolingual Front−endPosterior Features using Crosslingual Front−end
Figure 5.2: Average precision for different configuration of the wide topology front-ends
wj . For a particular threshold τ , wi and wj are predicted to be the same if DTW (wi, wj) ≤
τ . Computing DTW distances also requires a distance metric to be defined between feature
vector frames that make up words. For this evaluation cosine distance is used for comparing
frames of raw acoustic features corresponding to words. A more meaningful symmetric KL-
divergence is used for accessing similarities on phoneme posteriors vectors generated by the
proposed front-ends for words.
The entire set of word pairs is now used in the context of an information retrieval
task where the goal is to retrieve same word pairs from different word impostors for each
front-end configuration. Sweeping τ allows us to create a standard precision-recall curve
for each setting. The precision-recall curves can then be characterized by several criteria.
We use the average precision metric defined as the area under the precision-recall curve for
our experiments, which summarizes the system performance across all operating points.
Figure 5.2 shows the average precision scores for the two front-ends with varying
105
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
amounts of training data. The plot shows that posterior features perform significantly
better than the raw acoustic features (39D PLP features with zero mean/unit variance)
which have a very low score of only 0.177. As in the LVCSR case (Chapter 4, Sec. 4.2),
posterior features from the cross-lingual front-end perform even better. Both front-ends
improve as the amount of task independent data increases. Since this evaluation metric is
based on DTW distances over a moderately large set of words, improved performances on
this metric imply more accurate spoken term discovery. These experiments clearly show
the potential of data-driven front-ends not only in low-resource settings but zero-resource
settings.
5.4 Application 4 - Event detectors for Speech Recognition
In [156], we present a new application of phoneme posteriors for ASR. We use MLP
based phoneme posteriors to detect phonetic events in the acoustic signal. These phoneme
detectors are then used along with Segmental Conditional Random Fields (SCRFs) [157] to
represent the information in the underlying audio signal.
5.4.1 Building Phoneme Detectors
Multilayer perceptrons are used to estimate the posterior probability of phonemes
given the acoustic evidence. Each output unit of the MLP is associated with a particular
HMM state to allow these probabilities to be used as emission probabilities of a HMM
system. The Viterbi algorithm is then applied on the hybrid system to decode phoneme
sequences. Each time frame in the acoustic signal is associated with a phoneme in the
106
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
decoded output. We use the output phonemes along with their corresponding time stamps
as a collection of phoneme detections. A phoneme detection is registered at the mid-point
of the time span in which a phoneme is present. These phoneme detections are subsequently
used in the SCARF framework.
To derive reliable detections corresponding to the underlying acoustic signal, pos-
terior probabilities of phonetic sound classes are estimated using a hierarchical configuration
of MLPs. We use both short-term spectral and long-term modulation acoustic features as
input along with the hierarchical configuration to identify phonetic events.
5.4.2 Integrating Detectors with SCARF
An important characteristic of the SCRF approach is that it allows a set of features
from multiple information sources to be integrated together to model the probability of a
word sequence using a log-linear model. SCARF [158] uses four basic kinds of features
to describe the events present in the observation stream to the words being hypothesized.
These include - expectation features, Levenshtein features, existence features, language
model features and baseline features. The expectation and Levenshtein features measure
the similarity between expected and observed phoneme strings, while the existence features
indicate simple co-occurrence between words and phonemes. The baseline feature indicates
(dis)agreement between the label on a lattice link, and the word which occurs in the same
time span in a baseline decoding sequence.
The phoneme detections that we now include capture phonetic events that occur
in the underlying acoustic signal. During the training process SCARF learns weights for
each of the features. In the testing phase, SCARF uses the inputs from the detectors to
107
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
search the constrained space of possible hypothesis.
Systems WER% on dev04f
Baseline system (LDA + MLLT+ VTLN
+ fMLLR + MLLR + fMMI + mMMI + wide beams) 16.3
SCARF with baseline features 16.0
SCARF + Word Detector 15.3
SCARF + Word Detectors + Phoneme Detectors 15.1
Table 5.3: Integrating MLP based event detectors with ASR
We use the SCARF along with the earlier described event detectors on the Broad-
cast News task [159]. Table 5.3 shows the results of using the word detector stream along
with all the phoneme detector streams in combination. In this experiment we observe fur-
ther improvements with the phoneme detectors even after the word detectors have been
used. Both the experiments clearly show that additional information in the underlying
acoustic signal is being captured by the detectors and hence the further reduction in error
rates. It should be noted that these improvements are on top of results using state-of-the-art
recognition systems.
5.5 Conclusions
In this chapter, we have demonstrated the use of outputs of data-driven front-ends
for four different applications. For speech activity detection, the data-driven front-ends are
used to derive features which improve speech detection in very noisy environments. In the
108
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
second application we use broad class posteriors to improve neural network based speaker
verification. By the introduction of this side information, a mixture of neural networks can
be trained similar to conventional GMM based models. This technique improves the neural
network framework significantly and make its performance comparable with state-of-the-art
systems.
We have demonstrated the usefulness of data-driven features for zero-resource
speech applications like spoken term discovery, which operate without any transcribed
speech to train systems. The proposed features provide significant gains over conventional
acoustic features on various information retrieval metrics for this task. In chapter we have
also explored a different application of phoneme posteriors - as phonetic event detectors for
speech recognition. We show how these detectors can be built to reliably capture phonetic
events in the acoustic signal by integrating both acoustic and phonetic information about
sound classes in along with segmental conditional random fields.
109
Chapter 6
Conclusions
This chapter summarizes the important contributions made in this thesis.
6.1 Contributions
In this thesis we have proposed novel data-driven feature front-ends for differ-
ent speech applications. This approach is different from conventional feature extraction
techniques which derive information only from the spectrum of speech in short analysis
windows.
To build effective data-driven front-ends, we have investigated the use of novel
features based on auto-regressive modeling of sub-band envelopes of speech. In conjunction
with these features, we have explored the use of various data-driven front-ends in different
scenarios, especially in low-resource settings. Several novel neural network architectures
and adaptation techniques have been proposed to improve the performance of these front-
ends when only limited amounts of task specific transcribed data is available. We have also
110
CHAPTER 6. CONCLUSIONS
demonstrated the use of these front-ends for other speech applications like speech activity
detection and speaker verification.
The novel contributions made in this thesis can be summarized as -
1. Exploiting temporal dynamics of speech
Data-driven features for speech recognition (Chap. 2, Sec. 2.3) - We have pro-
posed a new set of data-driven features for speech recognition. These features are derived
by combining posterior outputs of MLPs trained on FDLP based short-term spectral and
long-term modulation features. The proposed data-driven features significantly improve
performances on various ASR tasks - phoneme recognition, digit recognition and large
vocabulary continuous speech recognition. [79,80,82,160,161].
2. Working with limited amounts of training data
Techniques to combine data transcribed using different phoneme sets (Chap.
3, Sec. 3.2) - We have developed a count based technique to map between phoneme
classes used to transcribe data in different languages and domains. This technique is
based on a measure that uses posteriors of phoneme classes as soft counts. We have
demonstrated the use of this approach in combining data from three languages - English,
Spanish and German, to train neural network systems. Significant gains are observed
when data-driven features derived using these multilingual MLPs are used in low-resource
settings [162].
3. Neural network architectures for data-driven front-ends
a. Neural network adaptation scheme using multiple output layers (Chap. 3,
Sec. 3.3) - Instead of using a mapping scheme to combine data from different sources
111
CHAPTER 6. CONCLUSIONS
before training, we have developed an approach to train neural networks using domain
specific output layers that are modified as training progresses across different domains.
This approach has been shown to be useful in sharing trained network layers across
different domains especially in low-resource settings [163]. Both the above mentioned
techniques address a key issue usually encountered while training neural networks
with data transcribed using different phoneme sets from multiple sources.
b. Wide neural network topology using data from multiple languages (Chap.
4, Sec. 4.2) - We have explored the use of a wide neural network topology that uses
several MLPs trained on large amounts of task independent data for low-resource and
zero-resource speech applications. Results using these front-ends demonstrate that
when task dependent training data is scarce, task independent multi-lingual data can
be used to compensate for performance drops [164].
c. Deep neural network with pre-training using task independent data (Chap.
4, Sec. 4.3) - To allow deep neural networks to be effectively trained in low resource
settings, we have investigated the use of multilingual data for initialization and train-
ing. By using deep neural networks, significant gains are observed on a low-resource
task using only 1 hour of training data. We also illustrate the use of unsupervised
acoustic model training in these settings. Table 6.1 summarizes the gains obtained by
using the proposed techniques in a low-resource experimental setup with only 1 hour
of transcribed training data.
112
CHAPTER 6. CONCLUSIONS
System Word Accuracy (%)
Conventional Acoustic Features (PLP) features using
1 hour of English training data (Baseline system) 28.8
Data-driven features using count based map - 31 hours of
multilingual data (German + Spanish) and 1 hour of English 36.5
(Contribution 2)
Data-driven features using adaptable last layer for MLP training
(Contribution 3a) 37.2
Data-driven features using wide network topology - 200 hours
Spanish MLP and 20 hours of English MLP from different domain 41.5
(Contribution 3b)
Data-driven features using deep neural network pre-trained using
31 hours of multilingual data (German + Spanish) and 1 hour 41.0
of English (Contribution 3c)
Semi-supervised acoustic training with DNN features
(Contribution 3c) 44.8
Conventional acoustic features (PLP) with all the
available 15 hours of training data used for acoustic 46.5
model training (Baseline system)
Table 6.1: Performances in a low-resource setting using dif-
ferent data-driven front-ends proposed in the thesis.
113
CHAPTER 6. CONCLUSIONS
The above results clearly show that data-driven features are able to improve recog-
nition accuracies in low resources settings significantly. With only a small fraction
of task specific training data, the proposed approaches are able to achieve perfor-
mances (44.8%) very close to those obtained with conventional features when all of
the training training data is used (46.5%).
4. Applications of data-driven features
• Multi-stream data-driven features for speech activity detection (Chap. 5,
Sec. 5.1) - Neural networks have traditionally been used only as acoustic models
for speech activity detection. We have proposed the use of data-driven features
derived using MLPs for this task. When combined with acoustic features, significant
improvements are observed on speech activity task in noisy environments [139].
• Mixture of AANNs using MLP posteriors for speaker verification (Chap.
5, Sec. 5.2) - To allow neural network models to be able to effectively capture the
distribution of acoustic vectors, a mixture of AANNs has been proposed. Several
independent AANNs are trained on different parts of the acoustic space correspond-
ing to broad phoneme classes of speech. The assignment of a feature vector to one
of these classes is done using posterior probabilities of these classes estimated us-
ing an MLP. Experiments show significant improvements by using the mixtures of
AANNs to model speakers. This novel approach is comparable with conventional
GMM based modeling approaches for this task [148,150].
• Data-driven features for zero-resource settings (Chap. 5, Sec. 5.3) -
Data-driven features provide significant gains over conventional acoustic features
114
CHAPTER 6. CONCLUSIONS
on various information retrieval metrics in zero-resource speech applications like
spoken term discovery which features provide significant gains over conventional
acoustic features on this task [164].
• Event detectors for speech recognition (Chap. 5, Sec. 5.4) - Phoneme
posterior probabilities estimated using MLPs are extensively used both as scaled
likelihoods (HMM-ANN framework) and features (Tandem approach) for speech
recognition. We explore a different application of these posteriors - as phonetic
event detectors for speech recognition [156].
6.2 Summary
In this chapter, we have summarized the contributions of this thesis. Although the
proposed data-driven feature extraction techniques have been shown to be useful in many
applications they have limitations related to their training and use. These include -
1. Labeled training data - For the neural network systems to be trained, sufficient data
with frame level phonetic transcriptions are required. These labels are often produced
from alignments generated by an LVCSR system. In low-resource settings or zero-
resource settings where no such transcripts are available, building neural network
based front-end systems will be difficult.
2. Mismatch conditions - Neural networks are sensitive to mismatches in train and test
conditions. Neural network based front-ends can be useful for deriving features only
in matched training conditions.
115
CHAPTER 6. CONCLUSIONS
These current limitations open up several interesting avenues for future work. It
would be interesting to see if any of the techniques currently being developed for unsuper-
vised sub-word acoustic model training using universal background models [165], successive
state splitting algorithms for HMMs [166], estimation of sub-word HMMs [167], discrimi-
native clustering objectives [168], non-parametric Bayesian estimation of HMMs [169], au-
tomatically discovered context independent sub-word units [170] can be used to build data-
driven front-ends when transcribed data is unavailable. An interesting paradigm to deal
with mismatch conditions is multi-stream speech recognition. The multi-stream recognition
paradigm for processing of corrupted signals has been studied for more than a decade [137].
In these approaches a number of different representations of the signal would be processed
and classified in separate processing channels in order to provide for a possibility to adap-
tively alleviate the corrupted channels while preserving the uncorrupted channels for further
processing. It would be interesting to see if robust data-driven front-ends [138,171,172] can
be built using this technique to deal with unexpected or unseen noise environments.
116
Bibliography
[1] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the
IEEE, vol. 64, no. 4, pp. 532–556, 1976.
[2] ——, Statistical methods for speech recognition. MIT press, 1998.
[3] H. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach.
Springer, 1994, vol. 247.
[4] L. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring
in the statistical analysis of probabilistic functions of Markov chains,” The Annals of
Mathematical Statistics, pp. 164–171, 1970.
[5] S. Katz, “Estimation of probabilities from sparse data for the language model com-
ponent of a speech recognizer,” IEEE Transactions on Acoustics, Speech and Signal
Processing, vol. 35, no. 3, pp. 400–401, 1987.
[6] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63,
no. 4, pp. 561–580, 1975.
117
BIBLIOGRAPHY
[7] A. Oppenheim and R. Schafer, “Homomorphic analysis of speech,” IEEE Transactions
on Audio and Electroacoustics, vol. 16, no. 2, pp. 221–226, 1968.
[8] R. Schafer and L. Rabiner, “Digital representations of speech signals,” Proceedings of
the IEEE, vol. 63, no. 4, pp. 662–677, 1975.
[9] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-
syllabic word recognition in continuously spoken sentences,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[10] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” The Journal
of the Acoustical Society of America, vol. 87, p. 1738, 1990.
[11] S. Furui, “Speaker-independent isolated word recognition using dynamic features of
speech spectrum,” IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 34, no. 1, pp. 52–59, 1986.
[12] R. Fukunaga, Statistical pattern recognition. Academic Press., 1990.
[13] C. Bishop, Pattern recognition and machine learning. Springer-Verlag, 2006.
[14] M. Richard and R. Lippmann, “Neural network classifiers estimate Bayesian a poste-
riori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991.
[15] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.
[16] H. Hermansky and N. Malayath, “Spectral basis functions from discriminant analy-
sis,” in Proceedings of ICSLP. ISCA, 1998.
118
BIBLIOGRAPHY
[17] R. Cole, M. Fanty, M. Noel, and T. Lander, “Telephone speech corpus development
at CSLU,” in Proceedings of ICSLP. ISCA, 1994.
[18] P. Brown, “The acoustic-modeling problem in automatic speech recognition,” Ph.D.
dissertation, Carnegie-Mellon University, 1987.
[19] M. Hunt, “A statistical approach to metrics for word and syllable recognition,” The
Journal of The Acoustical Society of America, vol. 66, no. S1, pp. S35–S36, 1979.
[20] M. Hunt and C. Lefebvre, “A comparison of several acoustic representations for speech
recognition with degraded and undegraded speech,” in Proceedings of ICASSP. IEEE,
1989.
[21] S. Van Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in
Proceedings of Eurospeech. ESCA, 1997.
[22] N. Malayath and H. Hermansky, “Data-driven spectral basis functions for automatic
speech recognition,” Speech communication, vol. 40, no. 4, pp. 449–466, 2003.
[23] F. Valente and H. Hermansky, “Discriminant linear processing of time-frequency
plane,” in Proceedings of INTERSPEECH. ISCA, 2006.
[24] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272,
1981.
[25] A. Jansen and P. Niyogi, “Intrinsic Fourier analysis on the manifold of speech sounds,”
in Proceedings of ICASSP. IEEE, 2006.
119
BIBLIOGRAPHY
[26] V. Jain and L. Saul, “Exploratory analysis and visualization of speech and music by
locally linear embedding,” in Proceedings of ICASSP. IEEE, 2004.
[27] A. Errity and J. McKenna, “An investigation of manifold learning for speech analysis,”
in Proceedings of ICSLP. ISCA, 2006.
[28] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist feature extraction for
conventional HMM systems,” in Proceedings of ICASSP. IEEE, 2000.
[29] B. Chen, S. Chang, and S. Sivadas, “Learning discriminative temporal patterns in
speech: Development of novel TRAPS-like classifiers,” in Proceedings of Eurospeech,
vol. 242. ESCA, 2003.
[30] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic and bottle-neck
features for LVCSR of meetings,” in Proceedings of ICASSP. IEEE, 2007.
[31] J. Pinto, S. Garimella, M. Magimai-Doss, H. Hermansky, and H. Bourlard, “Analysis
of MLP-based hierarchical phoneme posterior probability estimator,” IEEE Transac-
tions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225–241, 2011.
[32] J. Pinto, G. Sivaram, H. Hermansky, and M. Magimai-Doss, “Volterra series for ana-
lyzing MLP based phoneme posterior estimator,” in Proceedings of ICASSP. IEEE,
2009.
[33] L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information
estimation of hidden markov model parameters for speech recognition,” in Proceedings
of ICASSP. IEEE, 1986.
120
BIBLIOGRAPHY
[34] P. Woodland, D. Povey et al., “Large scale MMIE training for conversational telephone
speech recognition,” in Proceedings of Speech Transcription Workshop, 2000.
[35] E. McDermott, “Discriminative training for speech recognition,” Ph.D. dissertation,
Waseda, Japan, 1997.
[36] G. Doddington, “Phonetically sensitive discriminants for improved speech recogni-
tion,” in Proceedings of ICASSP. IEEE, 1989.
[37] E. Schukat-Talamazzini, J. Hornegger, and H. Niemann, “Optimal linear feature
transformations for semi-continuous hidden Markov models,” in Proceedings of
ICASSP. IEEE, 1995.
[38] N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and reduced rank
HMMs for improved speech recognition,” Speech communication, vol. 26, no. 4, pp.
283–297, 1998.
[39] R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classi-
fication,” in Proceedings of ICASSP. IEEE, 1998.
[40] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and
K. Visweswariah, “Boosted MMI for model and feature-space discriminative train-
ing,” in Proceedings of ICASSP. IEEE, 2008.
[41] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Dis-
criminatively trained features for speech recognition,” in Proceedings of ICASSP.
IEEE, 2005.
121
BIBLIOGRAPHY
[42] N. Kambhatla and T. Leen, “Dimension reduction by local principal component anal-
ysis,” Neural Computation, vol. 9, no. 7, pp. 1493–1516, 1997.
[43] B. Zhang, S. Matsoukas, and R. Schwartz, “Discriminatively trained region dependent
feature transforms for speech recognition,” in Proceedings of ICASSP. IEEE, 2006.
[44] J. Zheng, O. Cetin, M. Hwang, X. Lei, A. Stolcke, and N. Morgan, “Combining
discriminative feature, transform, and model training for large vocabulary speech
recognition,” in Proceedings of ICASSP. IEEE, 2007.
[45] E. Zwicker, G. Flottorp, and S. Stevens, “Critical band width in loudness summation,”
The Journal of the Acoustical Society of America, vol. 29, no. 5, pp. 548–557, 1957.
[46] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on
Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.
[47] H. Hermansky and P. Fousek, “Multi-resolution RASTA filtering for TANDEM-based
ASR,” in Proceedings of Interspeech, 2005.
[48] L. Lee and R. Rose, “Speaker normalization using efficient frequency warping proce-
dures,” in Proceedings of ICASSP. IEEE, 1996.
[49] ETSI, “Speech processing, transmission and quality aspects (STQ); Distributed
speech recognition; Advanced front-end feature extraction algorithm; Compression
algorithms,” ETSI ES, vol. 202, no. 050, p. v1, 2007.
[50] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech communica-
tion, vol. 16, no. 3, pp. 261–291, 1995.
122
BIBLIOGRAPHY
[51] M. Gales and S. Young, “The application of hidden Markov models in speech recog-
nition,” Signal Processing, vol. 1, no. 3, pp. 195–304, 2007.
[52] C. Avendano, “Temporal processing of speech in a time-feature space,” Ph.D. disser-
tation, Oregon Graduate Institute, 1997.
[53] D. Gelbart and N. Morgan, “Double the trouble: handling noise and reverberation in
far-field automatic speech recognition,” in Proceedings of ICSLP. ISCA, 2002.
[54] N. Morgan, H. Bourlard, C. Wooters, P. Kohn, and M. Cohen, “Phonetic context in
hybrid HMM/MLP continuous speech recognition,” in Second European Conference
on Speech Communication and Technology, 1991.
[55] B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz, “Long span features and minimum
phoneme error heteroscedastic linear discriminant analysis,” in Proceedings of EARS
RT-04 Workshop, 2004.
[56] S. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, and G. Zweig,
“Advances in speech transcription at IBM under the DARPA EARS program,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1596–
1608, 2006.
[57] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf,
P. Jain, H. Hermansky, D. Ellis et al., “Pushing the envelope - aside: Beyond the
spectral envelope as the fundamental representation for speech recognition,” IEEE
Signal Processing Magazine, vol. 22, no. 5, pp. 81–88, 2005.
123
BIBLIOGRAPHY
[58] H. Yang, S. Vuuren, S. Sharma, and H. Hermansky, “Relevance of time-frequency fea-
tures for phonetic and speaker-channel classification,” Speech Communication, vol. 31,
no. 1, pp. 35–50, 2000.
[59] R. Drullman, J. Festen, and R. Plomp, “Effect of reducing slow temporal modulations
on speech reception,” The Journal of the Acoustical Society of America, vol. 95, p.
2670, 1994.
[60] T. Arai, M. Pavel, H. Hermansky, and C. Avendano, “Syllable intelligibility for tem-
porally filtered LPC cepstral trajectories,” The Journal of the Acoustical Society of
America, vol. 105, p. 2783, 1999.
[61] T. Houtgast and H. Steeneken, “The modulation transfer function in room acoustics as
a predictor of speech intelligibility,” The Journal of the Acoustical Society of America,
vol. 54, no. 2, pp. 557–557, 1973.
[62] T. Chi, Y. Gao, M. Guyton, P. Ru, and S. Shamma, “Spectro-temporal modulation
transfer functions and speech intelligibility,” The Journal of the Acoustical Society of
America, vol. 106, p. 2719, 1999.
[63] T. Houtgast and H. Steeneken, “A review of the MTF concept in room acoustics and
its use for estimating speech intelligibility in auditoria,” The Journal of the Acoustical
Society of America, vol. 77, p. 1069, 1985.
[64] H. Hermansky, “The modulation spectrum in the automatic recognition of speech,”
in Proceedings of ASRU. IEEE, 1997.
124
BIBLIOGRAPHY
[65] H. Hermansky and S. Sharma, “TRAPS-classifiers of temporal patterns,” in Proceed-
ings of ICSLP. ISCA, 1998.
[66] ——, “Temporal patterns (TRAPS) in ASR of noisy speech,” in Proceedings of
ICASSP. IEEE, 1999.
[67] P. Schwarz, “Phoneme recognition based on long temporal context,” Ph.D. disserta-
tion, Brno University of Technology, 2009.
[68] P. Jain and H. Hermansky, “Beyond a single critical-band in TRAP based ASR,” in
Eighth European Conference on Speech Communication and Technology, 2003.
[69] J. Herre and J. Johnston, “Enhancing the performance of perceptual audio coders by
using temporal noise shaping (TNS),” in 101st AES Convention, 1996.
[70] R. Kumaresan and A. Rao, “Model-based approach to envelope and positive instan-
taneous frequency estimation of signals with speech applications,” The Journal of the
Acoustical Society of America, vol. 105, p. 1912, 1999.
[71] M. Athineos, “Linear prediction of temporal envelopes for speech and audio applica-
tions,” Ph.D. dissertation, Columbia University, 2008.
[72] S. Ganapathy, “Signal analysis using autoregressive models of amplitude modulaiton,”
Ph.D. dissertation, The Johns Hopkins University, 2012.
[73] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in LVCSR
using neural networks,” in Proceedings of ICSLP. IEEE, 2004.
125
BIBLIOGRAPHY
[74] F. Valente and H. Hermansky, “Combination of acoustic classifiers based on Dempster-
Shafer theory of evidence,” in Proceedings of ICASSP. IEEE, 2007.
[75] Q. Zhu, B. Chen, N. Morgan, A. Stolcke et al., “On using MLP features in LVCSR,”
in Proceedings of ICSLP. ISCA, 2004.
[76] M. Athineos and D. Ellis, “Autoregressive modeling of temporal envelopes,” IEEE
Transactions on Signal Processing, vol. 55, no. 11, pp. 5237–5245, 2007.
[77] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech
using frequency domain linear prediction,” IEEE Signal Processing Letters, vol. 15,
pp. 681–684, 2008.
[78] S. Ganapathy, S. Thomas, and H. Hermansky, “Temporal envelope compensation for
robust phoneme recognition using modulation spectrum,” The Journal of the Acous-
tical Society of America, vol. 128, p. 3769, 2010.
[79] S. Thomas, S. Ganapathy, and H. Hermansky, “Spectro-temporal features for auto-
matic speech recognition using linear prediction in spectral domain,” in Proceedings
of EUSIPCO. EURASIP, 2008.
[80] ——, “Phoneme recognition using spectral envelope and modulation frequency fea-
tures,” in Proceedings of ICASSP. IEEE, 2009.
[81] S. Mallidi, S. Ganapathy, and H. Hermansky, “Modulation spectrum analysis for
recognition of reverberant speech,” in Proceedings of INTERSPEECH. ISCA, 2011.
[82] S. Thomas, S. Ganapathy, and H. Hermansky, “Hilbert envelope based features for
126
BIBLIOGRAPHY
far-field speech recognition,” Machine Learning for Multimodal Interaction, pp. 119–
124, 2008.
[83] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative model of the effective signal
processing in the auditory system. i. model structure,” The Journal of the Acoustical
Society of America, vol. 99, p. 3615, 1996.
[84] S. Ganapathy, S. Thomas, and H. Hermansky, “Modulation frequency features for
phoneme recognition in noisy speech,” The Journal of the Acoustical Society of Amer-
ica, vol. 125, no. 1, pp. EL8–EL12, 2008.
[85] ——, “Comparison of modulation features for phoneme recognition,” in Proceedings
of ICASSP. IEEE, 2010.
[86] B. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the
modulation spectrogram,” Speech Communication, vol. 25, no. 1, pp. 117–132, 1998.
[87] V. Tyagi and C. Wellekens, “Fepstrum representation of speech signal,” in Proceedings
of ASRU. IEEE, 2005.
[88] S. Ganapathy, S. Thomas, and H. Hermansky, “Static and dynamic modulation spec-
trum for speech recognition,” in Proceedings of INTERSPEECH. ISCA, 2009.
[89] K. Lee and H. Hon, “Speaker-independent phone recognition using hidden Markov
models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37,
no. 11, pp. 1641–1648, 1989.
[90] T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiat, M. Lincoln, I. McCowan,
127
BIBLIOGRAPHY
D. Moore, V. Wan, R. Ordelman et al., “The 2005 AMI system for the transcription
of speech in meetings,” Machine Learning for Multimodal Interaction, pp. 450–462,
2006.
[91] J. Fiscus, N. Radde, J. Garofolo, A. Le, J. Ajot, and C. Laprun, “The rich transcrip-
tion 2005 spring meeting recognition evaluation,” Machine Learning for Multimodal
Interaction, pp. 369–389, 2006.
[92] D. Moore, J. Dines, M. Doss, J. Vepa, O. Cheng, and T. Hain, “Juicer: A weighted
finite-state transducer speech decoder,” Machine Learning for Multimodal Interaction,
pp. 285–296, 2006.
[93] G. Zavaliagkos, M. Siu, T. Colthurst, and J. Billa, “Using untranscribed training data
to improve performance,” in Proceedings of ICSLP. ESCA, 1998.
[94] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C. Lee, “A study on multilingual
acoustic modeling for large vocabulary ASR,” in Proceedings of ICASSP. IEEE,
2009.
[95] D. Imseng, H. Bourlard, and P. Garner, “Using KL-divergence and multilingual infor-
mation to improve ASR for under-resourced languages,” in Proceedings of ICASSP.
IEEE, 2012.
[96] IPA, Handbook of the International Phonetic Association: A guide to the use of the
International Phonetic Alphabet. Cambridge University Press, 1999.
[97] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek,
N. Goel, M. Karafiat, D. Povey et al., “Multilingual acoustic modeling for speech
128
BIBLIOGRAPHY
recognition based on subspace Gaussian mixture models,” in Proceedings of ICASSP.
IEEE, 2010.
[98] Y. Qian, D. Povey, and J. Liu, “State-level data borrowing for low-resource speech
recognition based on subspace GMMs,” in Proceedings of INTERSPEECH. ISCA,
2011.
[99] S. Sivadas and H. Hermansky, “On use of task independent training data in tandem
feature extraction,” in Proceedings of ICASSP. IEEE, 2004.
[100] J. Pinto, “Multilayer perceptron based hierarchical acoustic modeling for automatic
speech recognition,” Ph.D. dissertation, Ecole Polytechnique Feddrale de Lausanne,
2010.
[101] A. Stolcke, F. Grezl, M. Hwang, X. Lei, N. Morgan, and D. Vergyri, “Cross-domain
and cross-language portability of acoustic features estimated by multilayer percep-
trons,” in Proceedings of ICASSP. IEEE, 2006.
[102] G. Miller and P. Nicely, “An analysis of perceptual confusions among some English
consonants,” The Journal of the Acoustical Society of America, vol. 27, no. 2, pp.
338–352, 1955.
[103] A. Lovitt, J. Pinto, and H. Hermansky, “On confusions in a phoneme recognizer,”
IDIAP Research Report, IDIAP-RR-07-10, Tech. Rep., 2007.
[104] C. Pelaez-Moreno, A. Garcıa-Moral, and F. Valverde-Albacete, “Analyzing phonetic
confusions using formal concept analysis,” The Journal of the Acoustical Society of
America, vol. 128, p. 1377, 2010.
129
BIBLIOGRAPHY
[105] R. Fano, Transmission of Information: A Statistical Theory of Communication. MIT
press, 1961.
[106] A. Canavan, D. Graff, and G. Zipperlen, “CALLHOME American English speech,”
Linguistic Data Consortium, 1997.
[107] ——, “CALLHOME German speech,” Linguistic Data Consortium, 1997.
[108] A. Canavan and G. Zipperlen, “CALLHOME Spanish speech,” Linguistic Data Con-
sortium, 1997.
[109] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev,
and P. Woodland, “The HTK book,” Cambridge University Engineering Department,
vol. 3, 2002.
[110] J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech
corpus for research and development,” in Proceedings of ICASSP. IEEE, 1992.
[111] D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword,” Linguistic Data
Consortium, Philadelphia, 2003.
[112] P. Kingsbury, S. Strassel, C. McLemore, and R. MacIntyre, “CALLHOME American
English lexicon (PRONLEX),” Linguistic Data Consortium, Philadelphia, 1997.
[113] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 7–13,
2012.
130
BIBLIOGRAPHY
[114] F. Valente and H. Hermansky, “Hierarchical and parallel processing of modulation
spectrum for ASR applications,” in Proceedings of ICASSP. IEEE, 2008.
[115] G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme recogni-
tion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,
pp. 23–29, 2012.
[116] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief net-
works,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,
pp. 14–22, 2012.
[117] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neu-
ral networks for large vocabulary speech recognition,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
[118] T. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed,
“Making deep belief networks effective for large vocabulary continuous speech recog-
nition,” in Proceedings of ASRU. IEEE, 2011.
[119] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep
neural networks for conversational speech transcription,” in Proceedings of ASRU.
IEEE, 2011.
[120] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why
does unsupervised pre-training help deep learning?” The Journal of Machine Learning
Research, vol. 11, pp. 625–660, 2010.
131
BIBLIOGRAPHY
[121] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-
dependent DBN-HMMs for real-world speech recognition,” in Proceedings of NIPS
Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
[122] D. Yu and M. Seltzer, “Improved bottleneck features using pretrained deep neural
networks,” in Proceedings of INTERSPEECH. ISCA, 2011.
[123] T. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features
using deep belief networks,” in Proceedings of ICASSP. IEEE, 2012.
[124] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,”
Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[125] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of
deep networks,” Advances in neural information processing systems, vol. 19, p. 153,
2007.
[126] T. Kemp and A. Waibel, “Unsupervised training of a speech recognizer: Recent ex-
periments,” in Proceedings of EUROSPEECH. ESCA, 1999.
[127] L. Lamel, J. Gauvain, and G. Adda, “Unsupervised acoustic model training,” in
Proceedings of ICASSP. IEEE, 2002.
[128] J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervised training on large
amounts of broadcast news data,” in Proceedings of ICASSP. IEEE, 2006.
[129] T. Kemp and T. Schaaf, “Estimating confidence using word lattices,” in Proceedings
of EUROSPEECH. ESCA, 1997.
132
BIBLIOGRAPHY
[130] L. Burget, P. Schwarz, P. Matejka, M. Hannemann, A. Rastrow, C. White, S. Khu-
danpur, H. Hermansky, and J. Cernocky, “Combination of strongly and weakly con-
strained recognizers for reliable detection of OOVs,” in Proceedings of ICASSP.
IEEE, 2008.
[131] K. Woo, T. Yang, K. Park, and C. Lee, “Robust voice activity detection algorithm
for estimating noise spectrum,” IET Electronics Letters, vol. 36, no. 2, pp. 180–181,
2000.
[132] R. Chengalvarayan, “Robust energy normalization using speech/non-speech discrim-
inator for German connected digit recognition,” in Proceedings of EUROSPEECH.
ISCA, 1999.
[133] A. Benyassine, E. Shlomot, H. Su, D. Massaloux, C. Lamblin, and J. Petit, “ITU-
T recommendation g. 729 annex B: a silence compression scheme for use with G.
729 optimized for V. 70 digital simultaneous voice and data applications,” IEEE
Communications Magazine, vol. 35, no. 9, pp. 64–73, 1997.
[134] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity detection using
higher-order statistics in the LPC residual domain,” IEEE Transactions on Speech
and Audio Processing, vol. 9, no. 3, pp. 217–231, 2001.
[135] J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting record-
ings for automatic speech recognition,” in Proceedings of INTERSPEECH. ISCA,
2006.
[136] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan, and R. Sarikaya, “Robust speech
133
BIBLIOGRAPHY
recognition in noisy environments: The 2001 IBM SPINE evaluation system,” in
Proceedings of ICASSP. IEEE, 2002.
[137] H. Hermansky, S. Tibrewala, and M. Pavel, “Towards ASR on partially corrupted
speech,” in Proceedings of ICSLP. ISCA, 1996.
[138] N. Mesgarani, S. Thomas, and H. Hermansky, “Toward optimizing stream fusion in
multistream recognition of speech,” The Journal of the Acoustical Society of America
- Electronic Letters, 2011.
[139] S. Thomas, S. Mallidi, T. Janu, H. Hermansky, N. Mesgarani, X. Zhou, S. Shamma,
T. Ng, B. Zhang, L. Nguyen et al., “Acoustic and data-driven features for robust
speech activity detection,” in Proceedings of INTERSPEECH. ISCA, 2012.
[140] K. Walker and S. Strassel, “The RATS radio traffic collection system,” in Proceedings
of Odyssey. ISCA, 2012.
[141] X. Ma, D. Graff, and K. Walker, “RATS - first incremental SAD audio delivery,”
Linguistic Data Consortium, 2011.
[142] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, K. Vesely, P. Matejka, X. Zhu, and
N. Mesgarani, “Developing a speech activity detection system for the DARPA RATS
program,” in Proceedings of INTERSPEECH. ISCA, 2012.
[143] D. Reynolds and R. Rose, “Robust text-independent speaker identification using
Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Pro-
cessing, vol. 3, no. 1, pp. 72–83, 1995.
134
BIBLIOGRAPHY
[144] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian
mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.
[145] B. Yegnanarayana and S. Kishore, “AANN: an alternative to GMM for pattern recog-
nition,” Neural Networks, vol. 15, no. 3, pp. 459–469, 2002.
[146] M. Shajith Ikbal, H. Misra, and B. Yegnanarayana, “Analysis of autoassociative map-
ping neural networks,” in Proceedings of IJCNN. IEEE, 1999.
[147] K. Murty and B. Yegnanarayana, “Combining evidence from residual phase and
MFCC features for speaker recognition,” IEEE Signal Processing Letters, vol. 13,
no. 1, pp. 52–55, 2006.
[148] G. Sivaram, S. Thomas, and H. Hermansky, “Mixture of auto-associative neural net-
works for speaker verification,” in Proceedings of INTERSPEECH. ISCA, 2011.
[149] S. Ganapathy, J. Pelecanos, and M. Omar, “Feature normalization for speaker verifi-
cation in room reverberation,” in Proceedings of ICASSP. IEEE, 2011.
[150] S. Thomas, S. Mallidi, S. Ganapathy, and H. Hermansky, “Adaptation transforms of
auto-associative neural networks as features for speaker verification,” in Proceedings
of Odyssey. ISCA, 2012.
[151] S. Garimella, “Alternative regularized neural network architectures for speech and
speaker recognition,” Ph.D. dissertation, The Johns Hopkins University, 2012.
[152] A. Jansen, K. Church, and H. Hermansky, “Towards spoken term discovery at scale
with zero resources,” in Proceedings of INTERSPEECH. ISCA, 2010.
135
BIBLIOGRAPHY
[153] A. Muscariello, G. Gravier, F. Bimbot et al., “Audio keyword extraction by unsuper-
vised word discovery,” in Proceedings of INTERSPEECH. ISCA, 2009.
[154] Y. Zhang and J. Glass, “Towards multi-speaker unsupervised speech pattern discov-
ery,” in Proceedings of ICASSP. IEEE, 2010.
[155] M. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech
representations for spoken term discovery,” in Proceedings of INTERSPEECH. ISCA,
2011.
[156] S. Thomas, P. Nguyen, G. Zweig, and H. Hermansky, “MLP based phoneme detectors
for automatic speech recognition,” in Proceedings of ICASSP. IEEE, 2011.
[157] G. Zweig and P. Nguyen, “A segmental CRF approach to large vocabulary continuous
speech recognition,” in Proceedings of ASRU. IEEE, 2009.
[158] ——, “SCARF: A segmental conditional random field toolkit for speech recognition,”
in Proceedings of INTERSPEECH. ISCA, 2010.
[159] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell,
M. Wang, F. Sha, H. Hermansky et al., “Speech recognition with segmental con-
ditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in
Proceedings of ICASSP. IEEE, 2011.
[160] S. Thomas, S. Ganapathy, and H. Hermansky, “Hilbert envelope based spectro-
temporal features for phoneme recognition in telephone speech,” in Proceedings of
INTERSPEECH. ISCA, 2008.
136
BIBLIOGRAPHY
[161] ——, “Tandem representations of spectral envelope and modulation frequency fea-
tures for ASR,” in Proceedings of INTERSPEECH. ISCA, 2009.
[162] ——, “Cross-lingual and multistream posterior features for low resource LVCSR sys-
tems,” in Proceedings of INTERSPEECH. ISCA, 2010, pp. 877–880.
[163] ——, “Multilingual MLP features for low-resource LVCSR systems,” in Proceedings
of ICASSP. IEEE, 2012.
[164] S. Thomas, S. Ganapathy, A. Jansen, and H. Hermansky, “Data-driven posterior
features for low resource speech recognition applications,” in Proceedings of INTER-
SPEECH. ISCA, 2012.
[165] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting via segmental DTW
on Gaussian posteriorgrams,” in Proceedings of ASRU. IEEE, 2009.
[166] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised learning of acoustic
sub-word units,” in Proceedings of HLT. ACL, 2008.
[167] M. Siu, H. Gish, S. Lowe, and A. Chan, “Unsupervised audio patterns discovery using
HMM-based self-organized units,” in Proceedings of INTERSPEECH. ISCA, 2011.
[168] X. Anguera, “Speaker independent discriminant feature extraction for acoustic
pattern-matching,” in Proceedings of ICASSP. IEEE, 2012.
[169] C. Lee and J. Glass, “A non-parametric Bayesian approach to acoustic model discov-
ery,” in Proceedings of ACL, 2012.
137
BIBLIOGRAPHY
[170] A. Jansen and K. Church, “Towards unsupervised training of speaker independent
acoustic models,” in Proceedings of INTERSPEECH. ISCA, 2011.
[171] N. Mesgarani, S. Thomas, and H. Hermansky, “Adaptive stream fusion in multistream
recognition of speech,” in Proceedings of INTERSPEECH. ISCA, 2011.
[172] ——, “A multistream multiresolution framework for phoneme recognition,” in Pro-
ceedings of INTERSPEECH. ISCA, 2010.
138
Vita
Samuel Thomas received his Bachelor of Technology degree in Computer
Science and Engineering from Cochin University of Science and Technology, Kerala,
India in 2000 and Master of Science by Research degree from the Indian Institute of
Technology, Madras in 2006. He completed his PhD. in Electrical and Computer Sci-
ence while being affiliated to the Center for Language and Speech Processing (CLSP)
at the Johns Hopkins University in 2012. He is currently a post-doctoral researcher
at the IBM T.J. Watson Research Center, Yorktown Heights, USA. His research in-
terests include speech recognition, speaker recognition, speech synthesis and machine
learning. In the past he has been part of several summer workshops at the CLSP and
has also worked at IDIAP Research, Switzerland and the IBM India Research Lab,
New Delhi, India.
139