large models for large corpora: preliminary findings

Large Models for Large Corpora:preliminary findings

Patrick Nguyen

Ambroise Mutel

Jean-Claude Junqua

Panasonic Speech Technology Laboratory

(PSTL)

… from RT-03S workshop

• Lots of data helps• Standard training can be done in reasonable time

with current resources• 10kh: it’s coming soon

• Promises:– Change the paradigm– Use data more efficiently– Keep it simple

The Dawn of a New Era?• Merely increasing the model size is

insufficient• HMMs will live on• Layering above HMM classifiers will do• Change topology of training• Large models with no impact on decoding• Data-greedy algorithms• Only meaningful with large amounts

Two approaches: changing topology

• Greedy models – Syllable units– Same model size, but consume more info– Increase data / parameter ratio– Add linguistic info

• Factorize training: increase model size– Bubble splitting (generalized SAT)– Almost no penalty in decoding– Split according to acoustics

Syllable units

• Supra-segmental info

• Pronunciation modeling (subword units)

• Literature blames lack of data

• TDT+ coverage is limited by construction (all words are in the decoding lexicon)

• Better alternative to n-phones

What syllables?• NIST / Bill Fisher tsyl software (ignore

ambi-syllabic phenomena)

• “Planned speech” mode

• Always a schwa in the lexicon (e.g. little)

• Phones del/sub/ins + supra-seg info is good

onset rhyme

coda (Body, Tail)peak

Syllable units: facts• Hybrid (syl + phones)

– [Ganapathiraju, Goel, Picone, Corrada, Doddington, Kirchhoff, Ordowski, Wheatley; 1997]

• Seeding with CD-phones works– [Sethy, Narayanan; 2003]

• State tying works [PSTL]

• Position-dependent syllables work [PSTL]

• CD-syllables kind of works [PSTL]

• ROVER should work – [Wu, Kingsbury, Morgan, Greenberg; 1998]

• Full embedded re-estimation does not work [PSTL]

Coverage and large corpus• Warning: biased by construction• In the lexicon: 15k syllables

• About 15M syllables total (10M words)• Total: 1600h / 950h filtered

127 examples14 examples

1 example

Coverage

CI-Syl Pos-CI Syl Pos-CD Syl

10k 99.7% 99.5% 94.7%

6k 99.7% 98.5% 90.7%

3k 98.2% 95.0% 82.2%

2k 95.9% 91.1% 76.0%

1.5k 93.4% 87.6% 71.3%

1k 88.7% 81.6% 64.6%

Hybrid: backing off• Cannot train all syllables => back-off to phone

sequence• “Context breaks”: context-dependency chain will

break• Two kinds of back-off: abduction

– True sequence: ae_b d_ah_k sh______ih______n– [Sethy+]: ae_b d_ah_k sh+ih sh-ih+n ih-n– [Doddington+]: ae_b d_ah_k k-sh+ih sh-ih+n ih-n

• Tricky• We don’t care: state-tying and almost never backoff

Seeding• Copy CD models instead of flat-starting

• Problem at syllable boundary (context break)

• CI < Seed syl < CD

• Imposes constraints on topologyih-n+????-sh+ih sh-ih+n

Seeding (results)

• Mono-Gaussian models

• Trend continues even with iterative split

• CI: 69% WER

• CD: 26% WER

• Syl-flat: 41% WER

• Syl-seed: 31% WER (CD init)

State-tying

• Data-driven approach

• Backing-off is a problem => train all syllables– too many states/distributions– too little data (skewed distribution)

• Same strategy as CD-phone: entropy merge

• Can add info (pos, CD) w/o worrying about explosion in # of states

State-tying (2)

• Compression w/o performance loss• Phone-internal, state-internal bottom-up merge to

limit computations• Count about 10 states per syllable (3.3 phones)

• Pos-dep CI syllables• 6000syl model: 59k Gaussians: 38.7% WER• Merged to 6k: (6k Gaussians): 38.6% WER• Trend continues with iterative split

Position-dependent syllables

• Word-boundary info (cf triphones)• Example:

– Worthiness: _w_er dh_iy n_ih_s– The: _dh_iy_

• Missing from [Sethy+] and [Dod.+]

• Results: (3% absolute at every split)– Pos-indep (6k syl): 39.2% WER (2dps)– Pos-dep: 35.7% WER (2dps)

CD-Syllables

• Inter-syllable context breaks• Context = next phone• Next syl? Next vowel? (Nucleus/peak)• CD(phone)-syl >= CD triphones• Small gains!

• All GD results• CI-syl (6k syl): 19.0% WER• CD-syl (20k syl): 18.5% WER• CD-phones: 18.9% WER

Segmentation

• Word and Subword units give poor segmentation

• Speaker-adapted overgrown CD-phones are always better

• Problem for: MMI and adaptation

• Results: (ML)– Word-internal: 21.8% WER– Syl-internal: 19.9% WER

MMI/ADP didn’t work well

• MMI: time-constrained to +/- 3ms within word boundary

• Blame it on the segmentation (Word-int)

ML MMI +adp

CD-phones 18.9% 17.5%-1.4%

15.5%-2%

CD-syl 18.5% 17.4%-1.1%

16.0%-1.4%

ROVER• Two different systems can be combined

• Two-pass “transatlantic” ROVER architecture

• CD-phones align, phonetic classes

• No gain (broken confidence), deletions– MMI+adp: 15.5% (CDp) and 16.0% (SY)– Best ROVER: 15.5% WER (4-pass, 2-way)

Adapted CD-phonesSyllable models

Summary: architecture

CD-phones

POS-CD syllables

ROVER

POS-CI syllables Merged (6k)

Merged (3k)GD / MMI

Decode Adapt+decode

Conclusion (Syllable)

• Observed similar effects as literature

• Added some observations (state tying, CD, pos, ADP/MMI)

• Performance does not beat CD-phones yet– CD phones: 15.5% WER ; syl: 16.0% WER

• Some assumptions might cancel the benefit of syllable modeling

Open questions

• Is syllabification (grouping) better than random? Syllable?

• Planned vs spontaneous speech?• Did we oversimplify?• Why do subword units resist to auto-

segmentation?• Why didn’t CD-syl work better?• Language-dependent effects

Bubble Splitting

• Outgrowth of SAT• Increase model-size 15-fold w/o

computational penalty in train/decode• Also covers VTLN implementation

• Basic idea:– Split training into locally homogenous regions

(bubbles), and then apply SAT

SAT vs Bubble Splitting

• SAT relies on locally linearly compactable variabilities

• Each Bubble has local variability

• Simple acoustic factorizationSAT Bubbl

e

Adaptation (MLLR)

TDT and speakers labels

• TDT is not speaker-labeled

• Hub4 has 2400 nominative speakers

• Use decoding clustering (show-internal clusters)

• Males: 33k speakers

• Females: 18k speakers

• Shows: 2137 (TDT) + 288 (Hub4)

Input Speech

TDT

Decoded Words

Maximum Likelihood

Multiplex

Compact Bubble Models (CBM)

SPLIT

SPLIT

ADAPT

ADAPT NORMALIZE

NORMALIZE

Bubble-Splitting: Overview

M

A

L

E

F

E

M

A

L

E

VTLN implementation

• VTLN is used for clustering

• VTLN is a linear feature transformation (almost)

• Finding the best warp

VTLN: Linear equivalence•According to [Pitz, ICSLP 2000], VTLN is equivalent to a linear transformation in the cepstral domain:

0

2|)(|log)cos(1

deXkc ik

~|)(|log)~cos(

1)(~ 2~

0deXnc i

n )(~ Φ

~))~(cos()~cos(2

)( 1

0dkΦnAnk

K

kknkn cAc

0

)()(~

•The relationship between a cepstral coefficient ck and a warped one (stretched or compressed) is as follows:

• Energy, Filter-banks, and cepstral liftering imply non-linear effects

dknM

csteAM

nk

)(

00

)]cos()(cos[)(

)(~ Φ

f

f~

)700

1log(1127)(

ffM

•The Authors didn’t take the Mel-scale into account. No closed-form solution in that case :

VTLN is linearDecoding Algorithm:

)}A,|({

Ai

i

maxarg OLINPUT

SPEECH

Decode with

Ai and λ

DECODED

WORDS

mt

tmT

tm oRotLQ,

ii2

i )}A()A(|A|log){(21

Experimental results:

Statistical multiplex

• GD-mode• Faster than Brent search• 3 times faster than

exhaustive search• Based on prior

distribution of alpha• Test 0.98, 1, and 1.02• If 0.98 wins, continue

5)()(14.1

86.0

NPNE

4 Q evaluations: N0.98 = N1.02

= 4

0.98

3 Q evaluations: N1.00=3

1.00

Bubble Splitting: PrincipleTraining Speaker

Bubble Bi

Partial center: satmodel λi

1. Separate conditions• VTLN

2. Train Bubble model3. Compact using SAT

• Feature-space SAT

SAT works on homogenous conditions

Results

1st-pass adapted

Baseline GD 19.0% 16.6%

SAT 18.7% 16.5%

Bubbles 18.5% 16.0%

192k => 384k 18.6% 16.3%

About 0.5% WER reductionDouble model size => 0.3% WER

Conclusion (Bubble)

• Gain: 0.5% WER

• Extension of SAT model compaction

• VTLN implementation more efficient

Open questions

• Baseline SAT does not work?

• Speaker definition?

• Best splitting strategy? (One per warp)

• Best decoding strategy? (Closest warp)

• Best bubble training? (MAP/MLLR)

• MMIE

Conclusion

• What do we do with all of these data?

• Syllable + bubble splitting

• Two narrowly explored paths among many

• Promising results but nothing breathtaking

• Not ambitious enough?

System setup• RT03eval• 6x RT• Same parameters as RT03S eval system

– WI triphones, gender dependent, MMI– 2pass– Global MLLU + 7-class MLLR– 39 MFCC + non-causal CMS (2s)– 192k Gaussians, 3400 mixtures– 128 Gaussians / mix => merged

large models for large corpora: preliminary findings

Documents