discriminant feature space transformation for automatic speech...

17
Discriminant Feature Space Transformation for Automatic Speech Recognition Vikrant Tomar Department of Electrical & Computer Engineering McGill University Montreal, Canada December 15, 2010

Upload: others

Post on 12-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Discriminant Feature Space Transformation forAutomatic Speech Recognition

Vikrant Tomar

Department of Electrical & Computer EngineeringMcGill UniversityMontreal, Canada

December 15, 2010

Page 2: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Traditional ASR– The Big Picture

MFCCFeature

Extraction

Speech Signal(Aurora TIGIT Corpus)Multi-noise mixed data

Separate data for Testing and Training

The ASR System

Train HMM

RunRecognition

13 Time Derivatives (Delta- and Acceleration-

Cepstrum)

Results

39

Page 3: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

ASR – A Machine Learning Perspective

• Objective: Find the optimal sequence of words W from allpossible word sequences W that yields the highest probabilitygiven acoustic feature data X [3].

W = arg maxW

p(W|X)

= arg maxW

p(X|W)p(W)

P(X)

= arg maxW

acoustic model︷ ︸︸ ︷p(X|W) × p(W)︸ ︷︷ ︸

language model

• p(X|W): probability of observing X under the assumption

that W is a true utterance.

• p(W): probability that word sequence W is uttered.

• p(X): average probability that X will be observed.

Page 4: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

MFCC Feature Extraction

Page 5: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Mel-filter bank

Page 6: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Problem Statement

• MFCC cepstrum gives us a fairly accurate static informationin speech. However, we are also interested in thetime-evolution of speech.

• For the same, the static Cepstrum vector is augmented with∆-, and ∆∆-Cepstrum features. But ...

• The components of Augmented Feature vector are stronglycorrelated [5].

• The obtained features are not necessarily the mostparsimonious representation for capturing dynamics of speech.

• It’s not even clear that the linear operator (discrete cosinetransform) used in the final step of Cepstrum coefficientextraction, is an optimal choice.

• To avoid that, LDA based methods of capturing dynamics ofspeech have been suggested. [1, 4].

Page 7: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Problem Statement

• MFCC cepstrum gives us a fairly accurate static informationin speech. However, we are also interested in thetime-evolution of speech.

• For the same, the static Cepstrum vector is augmented with∆-, and ∆∆-Cepstrum features. But ...

• The components of Augmented Feature vector are stronglycorrelated [5].

• The obtained features are not necessarily the mostparsimonious representation for capturing dynamics of speech.

• It’s not even clear that the linear operator (discrete cosinetransform) used in the final step of Cepstrum coefficientextraction, is an optimal choice.

• To avoid that, LDA based methods of capturing dynamics ofspeech have been suggested. [1, 4].

Page 8: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

ASR as a classification problem

• Divide the speech feature vectors into various classes, andapply class based discrimination algorithms.

• The best definition of classes is not clear.• Once can use words, phones, HMM states or some other

arbitrarily specified notion of class.

• Use Linear Discriminant Analysis (LDA) for Discriminantfeature-space transformation on an augmented super vector(dim. 117 → 39) .

• Advantages:• a proper mathematical basis for extracting information about

the time evolution of speech frames• Maximizes the separability of different classes (however

defined).

Page 9: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

ASR as a classification problem

• Divide the speech feature vectors into various classes, andapply class based discrimination algorithms.

• The best definition of classes is not clear.• Once can use words, phones, HMM states or some other

arbitrarily specified notion of class.

• Use Linear Discriminant Analysis (LDA) for Discriminantfeature-space transformation on an augmented super vector(dim. 117 → 39) .

• Advantages:• a proper mathematical basis for extracting information about

the time evolution of speech frames• Maximizes the separability of different classes (however

defined).

Page 10: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

LDA sounds good, but...

• ASR models are often trained with the assumption that thedata has diagonal covariance.

• LDA produces a projected space whose dimensions might behighly correlated (full covariance matrix).

• To tackle that we can use separate covariance matrix of eachclass. Or,

• Semitied Covariance Transform [2].

Page 11: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

LDA sounds good, but...

• ASR models are often trained with the assumption that thedata has diagonal covariance.

• LDA produces a projected space whose dimensions might behighly correlated (full covariance matrix).

• To tackle that we can use separate covariance matrix of eachclass. Or,

• Semitied Covariance Transform [2].

Page 12: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Semitied Covariance Transform

• Instead of having a distinct covariance matrix for everycomponent in the recognizer, each covariance matrix consistsof two elements:

• a component specific diagonal covariance element• a semi-tied class-dependent, non-diagonal matrix

• Basically, transforms LDA’s full covariance matrix into asemi-diagonal matrix, which results in increased ASRperformance.

Page 13: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

LDA based ASR– The Big Picture

MFCCFeature

Extraction

Speech Signal(Aurora TIGIT Corpus)Multi-noise mixed data

Separate data for Testing and Training

The ASR System

Train HMM

RunRecognition

13 Time Derivatives (Delta- and Acceleration-

Cepstrum)

CompareResults

State-alignedAugmented

features

13

LDA 39

Force Alignment

39

SemiTied Covariance

39

Feature Augmentation

117

Page 14: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Results-Test Case A

5 10 15 20 2585

90

95

100

SNR(dB)

% W

ord

Accu

racy

Test Case A, Noise: Subway

5 10 15 20 2575

80

85

90

95

100

SNR(dB)

% W

ord

Accu

racy

5 10 15 20 2584

86

88

90

92

94

96

98

100

SNR(dB)

% W

ord

Accu

racy

Test Case A, Noise: Car

BaselineLDASTC

5 10 15 20 2584

86

88

90

92

94

96

98

100

SNR(dB)

% W

ord

Accu

racy

Test Case A, Noise: Exhibition

BaselineLDASTC

Page 15: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Results-Test Case B

5 10 15 20 2575

80

85

90

95

100

SNR(dB)

% W

ord

Accu

racy

Test Case B, Noise: Restaurant

BaselineLDASTC

5 10 15 20 2570

75

80

85

90

95

100

SNR(dB)

% W

ord

Accu

racy

Test Case B, Noise: Street

BaselineLDASTC

5 10 15 20 2580

82

84

86

88

90

92

94

96

98

100

SNR(dB)

% W

ord

Accu

racy

Test Case B, Noise: Airport

BaselineLDASTC

5 10 15 20 2575

80

85

90

95

100

SNR(dB)

% W

ord

Accu

racy

Test Case B, Noise: Airport

BaselineLDASTC

Page 16: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

Results-Test Case C

5 10 15 20 2580

82

84

86

88

90

92

94

96

98

100

SNR(dB)

% W

ord

Accu

racy

Test Case C, Noise: Subway

BaselineLDASTC

5 10 15 20 2565

70

75

80

85

90

95

100

SNR(dB)

% W

ord

Accu

racy

Test Case C, Noise: Streets

BaselineLDASTC

Page 17: Discriminant Feature Space Transformation for Automatic Speech …vikrantt.github.io/pubs/reports/ML_ppt.pdf · Traditional ASR{ The Big Picture MFCC Feature Extraction Speech Signal

ReferencesP. F. Brown.

The Acoustic-Modeling Problem in Automatic Speech Recognition.PhD thesis, Carnegie Mellon University, May 1987.

Mark J. F. Gales.

Semi-tied covariance matrices for hidden markov models.IEEE Transactions on Speech and Audio Processing, 7(3):272 – 281, May 1999.

Xuedong Huang, Alex Acero, and Hsiao-Weun Hon.

Spoken Language Processing.Prentice Hall PTR, 2001.

G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen.

Maximum likelihood discriminant feature spaces.Technical report, IBM T. J. Watson Research Center, 2000.

Yun Tang and Richard Rose.

A study of using locality preserving projections for feature extraction in speech recognition.In ICASSP: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008.