burhan necioglu bryan george george shuttic the mitre corporation
DESCRIPTION
Burhan Necioglu Bryan George George Shuttic The MITRE Corporation. Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing. - PowerPoint PPT PresentationTRANSCRIPT
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments
MITRE / MS State - ISIP
Burhan Necioglu
Bryan George
George Shuttic
The MITRE Corporation
Ramasubramanian Sundaram
Joe Picone
Mississippi State U.
Inst. for Signal & Information Processing
INTRODUCTION
Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP)– Primary goal: Evaluate the impact of noise pre-processing
developed for other DoD applications MITRE:
– Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links
– Distributed information access systems for military applications (DARPA Communicator)
Mississippi State:– Focus on stable, practical, advanced LVCSR technology– Open source large vocabulary speech recognition tools– Training, education and dissemination of information related
to all aspects of speech research ISIP-STT System utilized combination of technologies from both
organizations
OVERVIEW OF THE SYSTEM
Standard MFCC front-end with side-based CMS Acoustic modeling:
– Left-right model topology– Skip states for special models like silence – Continuous density mixture Gaussian HMMs– Both Baum-Welch and Viterbi training supported– Phonetic decision tree-based state-tying
Hierarchical search Viterbi decoder
STATE-TYING: MOTIVATION
Context-dependent models for better performance Increased parameter count Need to reduce computations without degrading performance
FEATURES AND PERFORMANCE
Batch processing Real-time performance of the training process during various
stages:
DECODER: OVERVIEW
Algorithmic features:– Single-pass decoding – Hierarchical Viterbi search – Dynamic network expansion
Functional features:– Cross-word context-dependent acoustic models – Word graph rescoring, forced alignments, N-gram decoding
Structural features:– Word graph compaction – Multiple pronunciations – Memory management
EVALUATION SYSTEM - NOISE PREPROCESSING
Using Harsh Environment Noise Pre-Processor (HENPP) front-end to remove noise from input speech
HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999)
Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”)
“Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions
Limitations:– Not designed to address transient noise– Noise adaptation sensitive to “push-to-talk” effects
Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR:
LPC-10 MELP MELP/HENPP
EVALUATION SYSTEM - DATA AND TRAINING
10 hours of SPINE data used for training - no DRT words 100 frames per second, 25msec Hamming window 12 base FFT-derived mel cepstra with side-based CMS and log-
energy Delta and acceleration coefficients 44 phone set to cover SPINE data 909 models, 2725 states
EVALUATION SYSTEM - LM and LEXICON
5226 words in the SPINE lexicon, provided by CMU CMU language model Bigrams obtained by throwing away the trigrams LM size: 5226 unigrams, 12511 bigrams
EVALUATION SYSTEM - DECODING
Single stage decoding using word-internal acoustic models and bigram LM
RESULTS AND ANALYSIS
Lattice generation/lattice rescoring will improve results. Informal analysis of evaluation data and results:
– Negative correlation between recognition performance and SNR
Experiment WER (%) Subs (%) Dels (%) Ins (%)
Baseline ISIP-STT 56.2 26.0 21.1 9.0
Noise pre-processedtraining & evaluation
data
58.4 27.1 24.9 6.5
RESULTS AND ANALYSIS (cont.)
Clean speech : “B” side of spine_eval_033 (281 total words)
Low SNR example: “A” side of spine_eval_021 (115 total words):
Experiment Correct Subs Dels Ins Tot err
Baseline ISIP-STT 221 36 24 4 64
Noise pre-processedtraining & evaluation
data
198 37 46 6 89
Experiment Correct Subs Dels Ins Tot err
Baseline ISIP-STT 72 25 18 4 47
Noise pre-processedtraining & evaluation
data
80 18 17 3 38
RESULTS AND ANALYSIS (cont.)
HENPP designed for human listening purposes– Optimized to raise DRT scores in presence of noise and
coding– DRT scores, WER tend to be poorly correlated; minor
perceptual distortions often have magnified adverse effect on speech recognizers
Need to retune the HENPP– Algorithm is very effective for robust recognition of noisy
speech at low SNR’s– Too aggressive when applied to clean speech - some
information is lost– Minor adjustments will preserve noisy speech performance
and boost clean speech performance
ISSUES
Decoding slow on this task– 100x real-time (on 600 MHz Pentium)– Newer version of ISIP-STT decoder will be faster– Had to use bigram LM in the allowed time frame
Large amount of eval data– With slow decoding, seriously limited experiments
The devil is in the details:– Certain training data problematic “Noise field is
<long silence> up”– Automatic segmentation (having eval segmentations would
help)
CONCLUSIONS
MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end
Time limitation: could only officially report on the baseline system
Performed initial experiment with noise-preprocessing (AT&T HENPP)– Overall word error rate did not improve– Informal analysis suggests that for low SNR conversations,
noise pre-processing does help.– Difficulty with high SNR conversations
There is potential for improvement with application specific tuning of HENPP.
Approach is very promising for coded speech in commercial and military environments