an alternative approach of finding competing hypotheses for better minimum classification error...
TRANSCRIPT
An Alternative Approach of
Finding Competing
Hypotheses for Better
Minimum Classification Error
Training
Mr. Yik-Cheung Tam
Dr. Brian Mak
Motivation
Overview of MCE training
Problem using N-best hypotheses
Alternative:1-nearest hypothesis
What?
Why?
How?
Evaluation
Conclusion
OutlineOutline
MCE OverviewMCE OverviewThe MCE loss function:
Distance measure:
G(X) may be computed using the N-best
hypotheses.
l(.) = 0-1 soft error-counting function (Sigmoid)
Gradient descent method to obtain a better
estimate.
When d(X) gets large enough,
It falls out of the steep trainable region of Sigmoid.
Trainable region
Problem Using N-best Problem Using N-best HypothesesHypotheses
What is 1-nearest Hypothesis?What is 1-nearest Hypothesis?
d(1-nearest) <= d(1-best)
The idea can be generalized to N-nearest hypotheses.
Keep the training data inside the steep trainable
region.
Trainable region
Using 1-nearest HypothesisUsing 1-nearest Hypothesis
Method 1 (exact approach)
Stack-based N-best decoder
Drawback:
• N may be very large => memory problem
• Need to limit the size of N.
Method 2 (approximated approach)
Modify the Viterbi algorithm with a special
pruning scheme.
How to Find 1-nearest How to Find 1-nearest Hypothesis?Hypothesis?
Approximated 1-nearest HypothesisApproximated 1-nearest Hypothesis
Notation: V(t+1, j) : accumulated score at time t+1 and state j
: transition probability from state i to j
: observation probability at time t+1 and state j
: accumulated score of the Viterbi path of the
correct string at time t+1.
Beam(t+1) : beam width applied at time t+1
There exists some “nearest” path in the search space (shaded area).
Approximated 1-nearest Hypothesis Approximated 1-nearest Hypothesis (.)(.)
System EvaluationSystem Evaluation
Corpus: AuroraCorpus: AuroraAurora
Noisy connected digits derived from TIDIGIT.
Multi-condition training: (Train on noisy condition)
{subway, babble, car, exhibition} x {clean, 20, 15,
10, 5} (5 noise levels)
8440 training utterances.
Testing: (Test on matched noisy condition)
Same as above except with additional samples
with 0 and –5 dB (7 noise levels)
28,028 testing utterances.
System ConfigurationSystem Configuration
Standard 39-dimension MFCC (cep + + )
11 Whole-word digit HMM (0-9, oh)
16 states, 3 Gaussians per state
3-state silence HMM, 6 Gaussians per state
1-state short pause HMM tied to the 2nd state of the
silence model.
Baum-Welch training to obtain the initial HMM.
Corrective MCE training on HMM parameters.
Compare 3 kinds of competing hypotheses:
1-best hypothesis
Exact 1-nearest hypothesis
Approx. 1-nearest hypothesis
Sigmoid parameters:
Various (control slope of Sigmoid)
Offset = 0
System Configuration (.)System Configuration (.)
Learning rate = 0.05, with different 0.1 (best test performance)0.5 (steeper)0.02, 0.004 (more flat)
Experiment I: Effect of Sigmoid Experiment I: Effect of Sigmoid slopeslope
Baseline: 12.71%Baseline: 12.71%1-best: 11.01%1-best: 11.01%
Approx. 1-nearest: 10.71%Approx. 1-nearest: 10.71%
Exact 1-nearest: 10.45%Exact 1-nearest: 10.45%
Soft error < 0.95 is defined to be “effective”.
1-nearest approach has more training data when the
Sigmoid slope is relatively steep.
Effective Amount of Training DataEffective Amount of Training Data
1-best (40%)1-best (40%)
Approx. 1-nearest (51%)Approx. 1-nearest (51%)
Exact. 1-nearest (67%)Exact. 1-nearest (67%)
With 100% effective training data, apply more training
iterations:
= 0.004, learning rate = 0.05
Result: Slow improvement compared to the best case.
Experiment II: Experiment II: Compensation Compensation With More Training IterationsWith More Training Iterations
Exact 1-nearest with gamma = Exact 1-nearest with gamma = 0.10.1
Use a larger learning rate (0.05 -> 1.25)
Fix = 0.004 (100% effective training data)
Result: 1-nearest approach is better than one-best
approach after compensation.
Experiment II: Compensation Experiment II: Compensation Using a Larger Learning RateUsing a Larger Learning Rate
System Before
compensation
After
compensation
Baseline 12.71% 12.71%
1-best 12.07% 11.55%
Approx 1-nearest 12.27% 10.70%
Exact 1-nearest 12.16% 10.79%
Using a Larger Learning Rate (.)Using a Larger Learning Rate (.)
Training performance: MCE loss versus # of
training iterations.
1-best1-best
Approx. 1-nearestApprox. 1-nearest
Exact. 1-nearestExact. 1-nearest
Using a Larger Learning Rate (..)Using a Larger Learning Rate (..)
Test performance: WER versus # of training
iterations.
Approx. 1-nearest Approx. 1-nearest (10.70%)(10.70%)
Exact. 1-nearest (Exact. 1-nearest (10.79%)
1-best (1-best (11.55%)
ConclusionConclusion
1-best and 1-nearest methods were compared in
MCE training.
Effect of Sigmoid slope.
Compensation on using a flat sigmoid.
1-nearest method is better than 1-best approach.
More trainable data are available in the 1-nearest
approach.
Approx. and exact 1-nearest methods yield
comparable performance.
Questions and Answers