an alternative approach of finding competing hypotheses for better minimum classification error...

An Alternative Approach of

Finding Competing

Hypotheses for Better

Minimum Classification Error

Training

Mr. Yik-Cheung Tam

Dr. Brian Mak

Motivation

Overview of MCE training

Problem using N-best hypotheses

Alternative:1-nearest hypothesis

What?

Why?

How?

Evaluation

Conclusion

OutlineOutline

MCE OverviewMCE OverviewThe MCE loss function:

Distance measure:

G(X) may be computed using the N-best

hypotheses.

l(.) = 0-1 soft error-counting function (Sigmoid)

Gradient descent method to obtain a better

estimate.

When d(X) gets large enough,

It falls out of the steep trainable region of Sigmoid.

Trainable region

Problem Using N-best Problem Using N-best HypothesesHypotheses

What is 1-nearest Hypothesis?What is 1-nearest Hypothesis?

d(1-nearest) <= d(1-best)

The idea can be generalized to N-nearest hypotheses.

Keep the training data inside the steep trainable

region.

Trainable region

Using 1-nearest HypothesisUsing 1-nearest Hypothesis

Method 1 (exact approach)

Stack-based N-best decoder

Drawback:

• N may be very large => memory problem

• Need to limit the size of N.

Method 2 (approximated approach)

Modify the Viterbi algorithm with a special

pruning scheme.

How to Find 1-nearest How to Find 1-nearest Hypothesis?Hypothesis?

Approximated 1-nearest HypothesisApproximated 1-nearest Hypothesis

Notation: V(t+1, j) : accumulated score at time t+1 and state j

: transition probability from state i to j

: observation probability at time t+1 and state j

: accumulated score of the Viterbi path of the

correct string at time t+1.

Beam(t+1) : beam width applied at time t+1

There exists some “nearest” path in the search space (shaded area).

Approximated 1-nearest Hypothesis Approximated 1-nearest Hypothesis (.)(.)

System EvaluationSystem Evaluation

Corpus: AuroraCorpus: AuroraAurora

Noisy connected digits derived from TIDIGIT.

Multi-condition training: (Train on noisy condition)

{subway, babble, car, exhibition} x {clean, 20, 15,

10, 5} (5 noise levels)

8440 training utterances.

Testing: (Test on matched noisy condition)

Same as above except with additional samples

with 0 and –5 dB (7 noise levels)

28,028 testing utterances.

System ConfigurationSystem Configuration

Standard 39-dimension MFCC (cep + + )

11 Whole-word digit HMM (0-9, oh)

16 states, 3 Gaussians per state

3-state silence HMM, 6 Gaussians per state

1-state short pause HMM tied to the 2nd state of the

silence model.

Baum-Welch training to obtain the initial HMM.

Corrective MCE training on HMM parameters.

Compare 3 kinds of competing hypotheses:

1-best hypothesis

Exact 1-nearest hypothesis

Approx. 1-nearest hypothesis

Sigmoid parameters:

Various (control slope of Sigmoid)

Offset = 0

System Configuration (.)System Configuration (.)

Learning rate = 0.05, with different 0.1 (best test performance)0.5 (steeper)0.02, 0.004 (more flat)

Experiment I: Effect of Sigmoid Experiment I: Effect of Sigmoid slopeslope

Baseline: 12.71%Baseline: 12.71%1-best: 11.01%1-best: 11.01%

Approx. 1-nearest: 10.71%Approx. 1-nearest: 10.71%

Exact 1-nearest: 10.45%Exact 1-nearest: 10.45%

Soft error < 0.95 is defined to be “effective”.

1-nearest approach has more training data when the

Sigmoid slope is relatively steep.

Effective Amount of Training DataEffective Amount of Training Data

1-best (40%)1-best (40%)

Approx. 1-nearest (51%)Approx. 1-nearest (51%)

Exact. 1-nearest (67%)Exact. 1-nearest (67%)

With 100% effective training data, apply more training

iterations:

= 0.004, learning rate = 0.05

Result: Slow improvement compared to the best case.

Experiment II: Experiment II: Compensation Compensation With More Training IterationsWith More Training Iterations

Exact 1-nearest with gamma = Exact 1-nearest with gamma = 0.10.1

Use a larger learning rate (0.05 -> 1.25)

Fix = 0.004 (100% effective training data)

Result: 1-nearest approach is better than one-best

approach after compensation.

Experiment II: Compensation Experiment II: Compensation Using a Larger Learning RateUsing a Larger Learning Rate

System Before

compensation

After

compensation

Baseline 12.71% 12.71%

1-best 12.07% 11.55%

Approx 1-nearest 12.27% 10.70%

Exact 1-nearest 12.16% 10.79%

Using a Larger Learning Rate (.)Using a Larger Learning Rate (.)

Training performance: MCE loss versus # of

training iterations.

1-best1-best

Approx. 1-nearestApprox. 1-nearest

Exact. 1-nearestExact. 1-nearest

Using a Larger Learning Rate (..)Using a Larger Learning Rate (..)

Test performance: WER versus # of training

iterations.

Approx. 1-nearest Approx. 1-nearest (10.70%)(10.70%)

Exact. 1-nearest (Exact. 1-nearest (10.79%)

1-best (1-best (11.55%)

ConclusionConclusion

1-best and 1-nearest methods were compared in

MCE training.

Effect of Sigmoid slope.

Compensation on using a flat sigmoid.

1-nearest method is better than 1-best approach.

More trainable data are available in the 1-nearest

approach.

Approx. and exact 1-nearest methods yield

comparable performance.

Questions and Answers

an alternative approach of finding competing hypotheses for better minimum classification error...

Documents

nearest slide

nearest approach

best approach

nearest path

nearest methods

nearest hypothesis notation

d1nearest method

training performance