recent work on acoustic modeling for cts at isl
DESCRIPTION
Recent Work on Acoustic Modeling for CTS at ISL. Florian Metze , Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University. Overview. ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6 - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/1.jpg)
Recent Work on Acoustic Modeling for CTS at ISL
Florian Metze, Hagen Soltau, Christian Fügen,
Hua Yu
Interactive Systems Laboratories
Universität Karlsruhe, Carnegie Mellon University
![Page 2: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/2.jpg)
EARS Workshop, December 2003, St. Thomas 2
Overview
• ISL‘s RT-03 system revisitedSystem combination of Tree-150 & Tree-6
• Richer Acoustic Modeling– Across-phone Clustering
– Gaussian Transition Modeling
– Modalities
– Articulatory Features
![Page 3: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/3.jpg)
EARS Workshop, December 2003, St. Thomas 3
Decoding Strategy
• System Combination– Combine tree-150, tree-6; 8ms, 10ms output
– Confusion networks over multiple lattices and Rover
– Confidences computed from combined CNs
– Best single output (Tree-150): 25.4
– CNC + Rover: 24.9
• Results on eval03– Tree-150 single system: 24.2
– CNC + Rover: 23.4
![Page 4: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/4.jpg)
EARS Workshop, December 2003, St. Thomas 4
Vocabulary
• Vocabulary Size41k vocabulary selected from SWB, BN, CNN
• Pronunciation Variants95k entries generated by rule-based approach
• Pronunciation ProbabilitiesFrom frequencies (forced alignment of training data)
– Viterbi decoding: penalties (e.g. max = 1)
– Confusion networks: real probabilities (e.g. sum = 1)
![Page 5: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/5.jpg)
EARS Workshop, December 2003, St. Thomas 5
Clustering
• Entropy-based Divisive Clustering• Standard way :
– Grow tree for each context independent HMM state
– 50 phones, 3 states : 150 trees
• Alternative : clustering across phones– Global tree parameter sharing across phones
– Computationally expensive to cluster 6 trees
(begin, middle, end for vowels and consonants)
– Quint-phone context
![Page 6: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/6.jpg)
EARS Workshop, December 2003, St. Thomas 6
Motivation for Alternative Clustering
• Pronunciation modeling is important for recognizing conversational speech
• Adding pronunciation variants often gives marginal improvements due to increased confuseability
• Case study: Flapping of /T/
BETTER B EH T AXRBETTER(2) B EH DX AXR
Dictionary only contains single pronunciation and the phonetic decision tree chooses whether or not to flap /T/
![Page 7: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/7.jpg)
EARS Workshop, December 2003, St. Thomas 7
Clustering Across Phones:Tree construction
• How to grow a single tree?We expand the question set to allow questions regarding the substate identity and center phone identity. Computationally expensive on 600k SWB quint-phones
• Two dictionaries:• conventional dictionary with 2.2 variants per word
• (almost) single pronunciation dictionary with 1.1 variants per word
A simple procedure is used to reduce the number of pronunciation variants. Variants with a relative frequency of <20% are removed. For unobserved words, only the baseform is kept.
![Page 8: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/8.jpg)
EARS Workshop, December 2003, St. Thomas 8
• Allows better parameter tying (tying now possible across phones and sub-states)
• Alleviates lexical problems: over-specification and inconsistencies no need for an optimal phone set, preferable for multi-lingual / non-native speech recognition
• Implicitly models subtle reduction in sloppy speech
AX-b
IX-m
AX-m
0=vowel?
0=obstruent? 0=begin-state?
-1=syllabic? 0=mid-state? -1=obstruent? 0=end-state?
Clustering Across Phones
![Page 9: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/9.jpg)
EARS Workshop, December 2003, St. Thomas 9
Clustering Across Phones: Experiments
• Cross-substate clustering doesn’t make any difference
• Cross-phone clustering with 6 trees: {vowel|consonant}-{b|m|e}
• Single pronunciation lexicon has 1.1 variants per word(instead of 2.2 variants per word)
Dictionary Clustering WER 66hr training set
WER180hr training set
multi-pronunciation
traditional 34.4 33.4
cross-phone 33.9 -
single pronunciation
traditional 34.1 -
cross-phone 33.1 31.6
Results are based on first pass decoding on dev01
![Page 10: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/10.jpg)
EARS Workshop, December 2003, St. Thomas 10
Analysis
• Flexible tying works better with single pronunciation lexicon: Higher consistency, data-driven approach
• Significant cross-phone sharing:~30% of the leaf nodes are shared by multiple phones
• Commonly tied vowels: AXR & ER, AE & EH, AH & AX~ consonants: DX & HH, L & W, N & NG
-1=voiced?
-1=consonant? 0=high-vowel?
1=front-vowel? 0=high-vowel? -1=obstruent? 0=L | R | W?
Vowel-b
![Page 11: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/11.jpg)
EARS Workshop, December 2003, St. Thomas 11
Gaussian Transition Modeling
• A linear sequence of GMMs may contain a mix of different model sequences.
• To further distinguish these paths, we can model transitions between Gaussians in adjacent states.
![Page 12: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/12.jpg)
EARS Workshop, December 2003, St. Thomas 12
Frame-independence Assumption
• HMM assumes each speech frames to be conditionally independent given the hidden state sequence
frames
models
… …
… …
HMM as a generative model
![Page 13: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/13.jpg)
EARS Workshop, December 2003, St. Thomas 13
Gaussian Transition Modeling
GTM models transition probabilities between Gaussians
![Page 14: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/14.jpg)
EARS Workshop, December 2003, St. Thomas 14
GTM for Modeling Sloppy Speech
• Partial reduction/ realization may be better modeled at sub-phoneme level
• GTM can be thought of as pronunciation network at the Gaussian level
• GTM can handle a large number of trajectories• Advantages over Parallel Path HMMs/ Segmental
HMMs– Number of paths is very limited
– Hard to determine the right number of paths
![Page 15: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/15.jpg)
EARS Workshop, December 2003, St. Thomas 15
Experiments
• GTM can be readily trained using Baum-Welch algorithm
• Data sufficiency an issue since we are modeling 1st order variable
• Pruning transitions is important (backing-off)
Pruning Threshold
Avg. #transitions per Gaussian
WER(%)
Baseline 14.4 34.1
1e-5 9.7 33.7
1e-3 6.6 33.7
0.01 4.6 33.6
0.05 2.7 33.9
WERs on Switchboard (hub5e-01)
![Page 16: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/16.jpg)
EARS Workshop, December 2003, St. Thomas 16
Experiments II
• GTM offers better discrimination between trajectories• All trajectories are nonetheless still allowed.• Pruning away unlikely transitions leads to a more compact and prudent
model.• However, we need to be careful not to prune away unseen trajectories due
to a limited training set.
• Using a first-order acoustic model in decoding requires maintaining the left history, which is expensive at word boundaries. Viterbi approximation is used in current implementation.
• Log-Likelihood improvements during Baum-Welch training:-50.67 to -49.18
![Page 17: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/17.jpg)
EARS Workshop, December 2003, St. Thomas 17
Modalities
• Would like to include additional information into divisive clustering, e.g.:– Gender
– Signal-noise-ratio
– Speaking rate
– Speaking style (normal vs hyper-articulated)
– Dialect
– Show-type, Data-type (CNN, NBC, ...)
• Data-driven approach: sharing still possible
![Page 18: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/18.jpg)
EARS Workshop, December 2003, St. Thomas 18
Modalities II
• Suitable for different corpora?• Example:
– German Dialects
– Male/ Female-1=vowel?
-1=obstruent? 0=bavarian?
-1=syllabic? 0=suabian? -1=obstruent? 0=female?
![Page 19: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/19.jpg)
EARS Workshop, December 2003, St. Thomas 19
Modalities III
• Tested on German Verbmobil data• Not enough time to test on SWB/ RT-03• Proved beneficial in several applications
– Labeled data needed
– Our tests were not done on highly optimized systems (VTLN)
– Hyperarticulation: -1.7% for Hyper +0.3% for Normal
![Page 21: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/21.jpg)
EARS Workshop, December 2003, St. Thomas 21
Articulatory Features
• Idea: combine very specific sub-phone models with generic models
• Articulatory Features: Linguistically Motivated/F/ = UNVOICED, FRICATIVE, LAB-DNT, ...
• Introduce new Degrees of Freedom for– Modeling
– Adaptation
• Integrate into existing architecture, use existing training techniques (GMMs) for feature detectors
• Articulatory (Voicing) Features in Front-end did not help
![Page 22: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/22.jpg)
EARS Workshop, December 2003, St. Thomas 22
Articulatory Features
• Output from Feature Detectors:
p(FEAT)-p(NON_FEAT)+p0
![Page 23: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/23.jpg)
EARS Workshop, December 2003, St. Thomas 23
Articulatory Features
A-symmetric Stream Setup: ~4k models– ~4k GMMs in stream 0
– 2 GMMs in stream 1...N („Feature Streams“)
![Page 24: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/24.jpg)
EARS Workshop, December 2003, St. Thomas 24
Articulatory Features Results I
• Test on Read Speech (BN-F0)13.4% 11.6% with Articulatory Features
• Test on Multilingual Data13.1% 11.5% (English with ML detectors)
• Significant Improvements also seen on– Hyper-Articulated Speech
– Spontaneous, Clean Speech (ESST)
![Page 25: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/25.jpg)
EARS Workshop, December 2003, St. Thomas 25
Articulatory Features Results II
• Test on Switchboard (RT-03 devset) Sub Del Ins WER
– Baseline | 72.5 20.0 7.5 4.4 31.9 67.2 |
– Features | 68.3 18.3 13.4 2.2 33.9 68.4 |
• Result:– Substitutions, Insertions – Deletions
• No overall improvement yet will work on setup
![Page 27: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/27.jpg)
EARS Workshop, December 2003, St. Thomas 27
Related Work
• D. Jurafsky, et al.: What kind of pronunciation variation is hard for triphones to model? ICASSP’01
• T. Hain: Implicit pronunciation modeling in ASR. ISCA Pronunciation Modeling Workshop, 2002
• M. Saraclar, et al.: Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language, Apr. 2000
![Page 28: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/28.jpg)
EARS Workshop, December 2003, St. Thomas 28
Related Work
• R. Iyer, et al.: Hidden Markov models for trajectory modeling, ICSLP’98
• M. Ostendorf, et al.: From HMMs to segment models: A unified view of stochastic modeling for speech recognition, IEEE trans. Sap, 1996
![Page 29: Recent Work on Acoustic Modeling for CTS at ISL](https://reader036.vdocuments.mx/reader036/viewer/2022070418/56815928550346895dc6513a/html5/thumbnails/29.jpg)
EARS Workshop, December 2003, St. Thomas 29
Publications
• F. Metze and A. Waibel: A Flexible Stream Architecture for ASR using Articulatory Features; ICSLP 2002; Denver, CO
• C. Fügen and I. Rogina: Integrating Dynamic Speech Modalities into Context Decision Trees; ICASSP 2000; Istanbul, Turkey
• H. Yu and T. Schultz: Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition; Eurospeech 2003; Geneva
• H. Soltau, H. Yu, F. Metze, C. Fügen, Q. Jin, and S. Jou: The ISL transcription system for conversational telephony speech; submitted to ICASSP 2004; Vancouver
• ISL web page:
http://isl.ira.uka.de