discriminant feature space transformation for automatic speech...

Discriminant Feature Space Transformation forAutomatic Speech Recognition

Vikrant Tomar

Department of Electrical & Computer EngineeringMcGill UniversityMontreal, Canada

December 15, 2010

Traditional ASR– The Big Picture

MFCCFeature

Extraction

Speech Signal(Aurora TIGIT Corpus)Multi-noise mixed data

Separate data for Testing and Training

The ASR System

Train HMM

RunRecognition

13 Time Derivatives (Delta- and Acceleration-

Cepstrum)

Results

ASR – A Machine Learning Perspective

• Objective: Find the optimal sequence of words W from allpossible word sequences W that yields the highest probabilitygiven acoustic feature data X [3].

W = arg maxW

p(W|X)

= arg maxW

p(X|W)p(W)

= arg maxW

acoustic model︷︸︸︷p(X|W) × p(W)︸︷︷︸

language model

• p(X|W): probability of observing X under the assumption

that W is a true utterance.

• p(W): probability that word sequence W is uttered.

• p(X): average probability that X will be observed.

MFCC Feature Extraction

Mel-filter bank

Problem Statement

• MFCC cepstrum gives us a fairly accurate static informationin speech. However, we are also interested in thetime-evolution of speech.

• For the same, the static Cepstrum vector is augmented with∆-, and ∆∆-Cepstrum features. But ...

• The components of Augmented Feature vector are stronglycorrelated [5].

• The obtained features are not necessarily the mostparsimonious representation for capturing dynamics of speech.

• It’s not even clear that the linear operator (discrete cosinetransform) used in the final step of Cepstrum coefficientextraction, is an optimal choice.

• To avoid that, LDA based methods of capturing dynamics ofspeech have been suggested. [1, 4].

Problem Statement

• MFCC cepstrum gives us a fairly accurate static informationin speech. However, we are also interested in thetime-evolution of speech.

• For the same, the static Cepstrum vector is augmented with∆-, and ∆∆-Cepstrum features. But ...

• The components of Augmented Feature vector are stronglycorrelated [5].

• The obtained features are not necessarily the mostparsimonious representation for capturing dynamics of speech.

• It’s not even clear that the linear operator (discrete cosinetransform) used in the final step of Cepstrum coefficientextraction, is an optimal choice.

• To avoid that, LDA based methods of capturing dynamics ofspeech have been suggested. [1, 4].

ASR as a classification problem

• Divide the speech feature vectors into various classes, andapply class based discrimination algorithms.

• The best definition of classes is not clear.• Once can use words, phones, HMM states or some other

arbitrarily specified notion of class.

• Use Linear Discriminant Analysis (LDA) for Discriminantfeature-space transformation on an augmented super vector(dim. 117 → 39) .

• Advantages:• a proper mathematical basis for extracting information about

the time evolution of speech frames• Maximizes the separability of different classes (however

defined).

ASR as a classification problem

• Divide the speech feature vectors into various classes, andapply class based discrimination algorithms.

• The best definition of classes is not clear.• Once can use words, phones, HMM states or some other

arbitrarily specified notion of class.

• Use Linear Discriminant Analysis (LDA) for Discriminantfeature-space transformation on an augmented super vector(dim. 117 → 39) .

• Advantages:• a proper mathematical basis for extracting information about

the time evolution of speech frames• Maximizes the separability of different classes (however

defined).

LDA sounds good, but...

• ASR models are often trained with the assumption that thedata has diagonal covariance.

• LDA produces a projected space whose dimensions might behighly correlated (full covariance matrix).

• To tackle that we can use separate covariance matrix of eachclass. Or,

• Semitied Covariance Transform [2].

LDA sounds good, but...

• ASR models are often trained with the assumption that thedata has diagonal covariance.

• LDA produces a projected space whose dimensions might behighly correlated (full covariance matrix).

• To tackle that we can use separate covariance matrix of eachclass. Or,

• Semitied Covariance Transform [2].

Semitied Covariance Transform

• Instead of having a distinct covariance matrix for everycomponent in the recognizer, each covariance matrix consistsof two elements:

• a component specific diagonal covariance element• a semi-tied class-dependent, non-diagonal matrix

• Basically, transforms LDA’s full covariance matrix into asemi-diagonal matrix, which results in increased ASRperformance.

LDA based ASR– The Big Picture

MFCCFeature

Extraction

Speech Signal(Aurora TIGIT Corpus)Multi-noise mixed data

Separate data for Testing and Training

The ASR System

Train HMM

RunRecognition

13 Time Derivatives (Delta- and Acceleration-

Cepstrum)

CompareResults

State-alignedAugmented

features

LDA 39

Force Alignment

SemiTied Covariance

Feature Augmentation

Results-Test Case A

5 10 15 20 2585

SNR(dB)

Test Case A, Noise: Subway

5 10 15 20 2575

SNR(dB)

5 10 15 20 2584

SNR(dB)

Test Case A, Noise: Car

BaselineLDASTC

5 10 15 20 2584

SNR(dB)

Test Case A, Noise: Exhibition

BaselineLDASTC

Results-Test Case B

5 10 15 20 2575

SNR(dB)

Test Case B, Noise: Restaurant

BaselineLDASTC

5 10 15 20 2570

SNR(dB)

Test Case B, Noise: Street

BaselineLDASTC

5 10 15 20 2580

SNR(dB)

Test Case B, Noise: Airport

BaselineLDASTC

5 10 15 20 2575

SNR(dB)

Test Case B, Noise: Airport

BaselineLDASTC

Results-Test Case C

5 10 15 20 2580

SNR(dB)

Test Case C, Noise: Subway

BaselineLDASTC

5 10 15 20 2565

SNR(dB)

Test Case C, Noise: Streets

BaselineLDASTC

ReferencesP. F. Brown.

The Acoustic-Modeling Problem in Automatic Speech Recognition.PhD thesis, Carnegie Mellon University, May 1987.

Mark J. F. Gales.

Semi-tied covariance matrices for hidden markov models.IEEE Transactions on Speech and Audio Processing, 7(3):272 – 281, May 1999.

Xuedong Huang, Alex Acero, and Hsiao-Weun Hon.

Spoken Language Processing.Prentice Hall PTR, 2001.

G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen.

Maximum likelihood discriminant feature spaces.Technical report, IBM T. J. Watson Research Center, 2000.

Yun Tang and Richard Rose.

A study of using locality preserving projections for feature extraction in speech recognition.In ICASSP: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008.

discriminant feature space transformation for automatic speech...

Documents

human speech emotion recognition2001/08/16 · for each...

mfcc based enlargement of the training set for emotion...

enhanced forensic speaker veriﬁcation using a …secondly,...

74.793 nlp and speech 2004 feature structures feature...

feature analysis fot emotion identification in speech

fpga based speech recognition using dynamic mfcc and speaker...

speech recognition using mfcc - semantic scholar ·...

speech synthesis using mel-cepstral coefficient feature

speech feature toolbox (speft) design and...

speaker independent speech recognition using mfcc with...

mfcc tutorial - ocw.nthu.edu.tw · 1980s can recognize...

speech recognition, articulatory feature detection, and...

learning vector quantization (lvq) neural network approach...

nonlinear feature based classification of speech under

vol. 9, no. 10, 2018 isolated automatic speech recognition...

feature extraction techniques for speech...

a survey: speech emotion...

speech recognition using mfcc -...

quranic verse recitation feature extraction using mfcc ·...

1bicaradengan praproses mfcc