learning long-term temporal features

May 4, 2004 Speech Lunch Talk

Learning Long-Term Temporal Features

A Comparative Study

Barry Chen


Log-Critical Band Energies



ConventionalFeature Extraction



TRAPS/HATSFeature Extraction


What is a TRAP? (Background Tangent)

• TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP)

• Stands for TempRAl Pattern

• TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long)


Example of TRAPS

Mean Temporal Patterns for 45 phonemes at 500 Hz


TRAPS Motivation

• Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale

• Information measurements (joint mutual information) show information still exists >100ms away within single critical-band

• Potential robustness to speech degradations


Let’s Explore• TRAPS and HATS are examples of a

specific two-stage approach to learning long-term temporal features

• Is this constrained two-stage approach better than an unconstrained one-stage approach?

• Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary?


Learn Everything in One Step


Learn in Individual Bands


One-Stage Approach


2-Stage Linear Approaches


PCA/LDA Comments

• PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance

• LDA projects in directions that maximize class separability measured by between class covariance over within class covariance

• Keep top 40 dimensions for comparison with MLP-based approaches


2-Stage MLP-Based Approaches


MLP Comments• As with the other 2-stage approaches, we first

learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories

• Interpretation of various MLP layers:1. Input to hidden weights – discriminant linear

transformations2. Hidden unit outputs – Non-linear discriminant

transforms 3. Before Softmax – transforms hidden activation space

to unnormalized phone probability space 4. Output Activations – critical band phone probabilities


Experimental Setup• Training: ~68 hours of conversational telephone

speech from English CallHome, Switchboard I, and Switchboard Cellular

– 1/10 used for cross-validation set for MLPs

• Testing: 2001 Hub-5 Evaluation Set (Eval2001) – 2,255,609 frames and 62,890 words

• Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help)


Frame Accuracy Performance

62.0%

63.0%

64.0%

65.0%

66.0%

67.0%

68.0%

1 5 B a n d s x 5 1 F ra me s P C A 4 0 L D A 4 0 H A T S B e fo re S ig mo id H A T S T R A P S B e fo re S o ftma x T R A P S P L P 9 F ra me s

Fra

me

Acc

ura

cy

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames


Standalone Feature System

• Transform MLP outputs by:1. log transform to make features more Gaussian

2. PCA for decorrelation

• Same as Tandem setup introduced by Hermansky, Ellis, and Sharma

• Use transformed MLP outputs as front-end features for the SRI recognizer


Standalone Features

36.0%

38.0%40.0%

42.0%44.0%

46.0%48.0%

50.0%

15B

ands

x

LDA

40

HA

TS

TR

AP

S

Wo

rd E

rro

r R

ate


PCA 40

LDA 40

HATS Before Sigmoid

HATS


TRAPS

PLP 9 Frames


Combination W/State-of-the-Art Front-End Feature

• SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d)

• Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature– Similar to Qualcom-ICSI-OGI features in

AURORA


Combo W/PLP Baseline Features

32.0%

33.0%

34.0%

35.0%

36.0%

37.0%

38.0%

H L D A (P L P +3 d ) 1 5 B a n d s x 5 1

F ra me s

P C A 4 0 L D A 4 0 H A T S B e fo re

S ig mo id

H A T S T R A P S B e fo re

S o ftma x

T R A P S P L P 9 F ra me s H A T S + P L P 9

F ra me s

Wo

rd E

rro

r R

ate

HLDA(PLP+3d)


PCA 40

LDA 40

HATS Before Sigmoid

HATS


TRAPS

PLP 9 Frames

HATS + PLP 9 Frames


Ranking Table

System Frame Acc. Standalone Combination15 Bands x 51 Frames 6 6 6PCA 40 5 2 2LDA 40 4 3 2HATS Before Sigmoid 3 4 2HATS 1 1 1TRAPS Before Softmax 2 4 5TRAPS 7 7 7


Observations

• Throughout the three various testing setups:

1. HATS is always #1

2. The one-stage 15 Bands x 51 Frames is always #6 or second last

3. TRAPS is always last

4. PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance


Interpretation• Learning constraints introduced by the 2-stage

approach is helpful if done right.• Non-linear discriminant transform of HATS is

better than linear discriminant transforms from LDA and HATS before sigmoid

• The further mapping from hidden activations to critical-band phone posteriors is not helpful– Perhaps, mapping to critical-band phones is too

difficult and inherently noisy

• Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames.


Frame Accuracy Performance

System Frame Acc. Rel. Improvement15 Bands x 51 Frames 64.7% -

PCA 40 65.5% 1.2%LDA 40 65.5% 1.2%HATS Before Sigmoid 65.8% 1.7%HATS 66.9% 3.4%TRAPS Before Softmax 65.9% 1.7%TRAPS 64.0% -1.2%

PLP 9 Frames 67.6% N/A


Standalone Features WER

System WER Rel. Improvement15 Bands x 51 Frames 48.0% -

PCA 40 45.3% 5.6%LDA 40 46.5% 3.1%HATS Before Sigmoid 45.9% 4.4%HATS 44.5% 7.3%TRAPS Before Softmax 45.9% 4.4%TRAPS 48.2% -0.4%

PLP 9 Frames 41.2% N/A


Combo W/PLP Baseline FeaturesSystem WER Rel. ImprovementHLDA(PLP+3d) 37.2% -

15 Bands x 51 Frames 37.1% 0.3%PCA 40 36.8% 1.1%LDA 40 36.8% 1.1%HATS Before Sigmoid 36.8% 1.1%HATS 36.0% 3.2%TRAPS Before Softmax 36.9% 0.8%TRAPS 37.2% 0.0%PLP 9 Frames 36.1% 3.0%100.0%Inverse Entropy ComboHATS + PLP 9 Frames 34.0% 8.6%

learning long-term temporal features

Documents

individual bandslearn

stage approaches

class separability

class covariancekeep

longterm temporal featuresis

hidden activation space

period of time

switchboard cellular110