large models for large corpora: preliminary findings

36
Large Models for Large Corpora: preliminary findings Patrick Nguyen Ambroise Mutel Jean-Claude Junqua Panasonic Speech Technology Laboratory (PSTL)

Upload: alexa

Post on 11-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Large Models for Large Corpora: preliminary findings. Patrick Nguyen Ambroise Mutel Jean-Claude Junqua. Panasonic Speech Technology Laboratory (PSTL). … from RT-03S workshop. Lots of data helps Standard training can be done in reasonable time with current resources 10kh: it’s coming soon - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large Models for Large Corpora: preliminary findings

Large Models for Large Corpora:preliminary findings

Patrick Nguyen

Ambroise Mutel

Jean-Claude Junqua

Panasonic Speech Technology Laboratory

(PSTL)

Page 2: Large Models for Large Corpora: preliminary findings

… from RT-03S workshop

• Lots of data helps• Standard training can be done in reasonable time

with current resources• 10kh: it’s coming soon

• Promises:– Change the paradigm– Use data more efficiently– Keep it simple

Page 3: Large Models for Large Corpora: preliminary findings

The Dawn of a New Era?• Merely increasing the model size is

insufficient• HMMs will live on• Layering above HMM classifiers will do• Change topology of training• Large models with no impact on decoding• Data-greedy algorithms• Only meaningful with large amounts

Page 4: Large Models for Large Corpora: preliminary findings

Two approaches: changing topology

• Greedy models – Syllable units– Same model size, but consume more info– Increase data / parameter ratio– Add linguistic info

• Factorize training: increase model size– Bubble splitting (generalized SAT)– Almost no penalty in decoding– Split according to acoustics

Page 5: Large Models for Large Corpora: preliminary findings

Syllable units

• Supra-segmental info

• Pronunciation modeling (subword units)

• Literature blames lack of data

• TDT+ coverage is limited by construction (all words are in the decoding lexicon)

• Better alternative to n-phones

Page 6: Large Models for Large Corpora: preliminary findings

What syllables?• NIST / Bill Fisher tsyl software (ignore

ambi-syllabic phenomena)

• “Planned speech” mode

• Always a schwa in the lexicon (e.g. little)

• Phones del/sub/ins + supra-seg info is good

onset rhyme

coda (Body, Tail)peak

Page 7: Large Models for Large Corpora: preliminary findings

Syllable units: facts• Hybrid (syl + phones)

– [Ganapathiraju, Goel, Picone, Corrada, Doddington, Kirchhoff, Ordowski, Wheatley; 1997]

• Seeding with CD-phones works– [Sethy, Narayanan; 2003]

• State tying works [PSTL]

• Position-dependent syllables work [PSTL]

• CD-syllables kind of works [PSTL]

• ROVER should work – [Wu, Kingsbury, Morgan, Greenberg; 1998]

• Full embedded re-estimation does not work [PSTL]

Page 8: Large Models for Large Corpora: preliminary findings

Coverage and large corpus• Warning: biased by construction• In the lexicon: 15k syllables

• About 15M syllables total (10M words)• Total: 1600h / 950h filtered

127 examples14 examples

1 example

Page 9: Large Models for Large Corpora: preliminary findings

Coverage

CI-Syl Pos-CI Syl Pos-CD Syl

10k 99.7% 99.5% 94.7%

6k 99.7% 98.5% 90.7%

3k 98.2% 95.0% 82.2%

2k 95.9% 91.1% 76.0%

1.5k 93.4% 87.6% 71.3%

1k 88.7% 81.6% 64.6%

Page 10: Large Models for Large Corpora: preliminary findings

Hybrid: backing off• Cannot train all syllables => back-off to phone

sequence• “Context breaks”: context-dependency chain will

break• Two kinds of back-off: abduction

– True sequence: ae_b d_ah_k sh______ih______n– [Sethy+]: ae_b d_ah_k sh+ih sh-ih+n ih-n– [Doddington+]: ae_b d_ah_k k-sh+ih sh-ih+n ih-n

• Tricky• We don’t care: state-tying and almost never backoff

Page 11: Large Models for Large Corpora: preliminary findings

Seeding• Copy CD models instead of flat-starting

• Problem at syllable boundary (context break)

• CI < Seed syl < CD

• Imposes constraints on topologyih-n+????-sh+ih sh-ih+n

Page 12: Large Models for Large Corpora: preliminary findings

Seeding (results)

• Mono-Gaussian models

• Trend continues even with iterative split

• CI: 69% WER

• CD: 26% WER

• Syl-flat: 41% WER

• Syl-seed: 31% WER (CD init)

Page 13: Large Models for Large Corpora: preliminary findings

State-tying

• Data-driven approach

• Backing-off is a problem => train all syllables– too many states/distributions– too little data (skewed distribution)

• Same strategy as CD-phone: entropy merge

• Can add info (pos, CD) w/o worrying about explosion in # of states

Page 14: Large Models for Large Corpora: preliminary findings

State-tying (2)

• Compression w/o performance loss• Phone-internal, state-internal bottom-up merge to

limit computations• Count about 10 states per syllable (3.3 phones)

• Pos-dep CI syllables• 6000syl model: 59k Gaussians: 38.7% WER• Merged to 6k: (6k Gaussians): 38.6% WER• Trend continues with iterative split

Page 15: Large Models for Large Corpora: preliminary findings

Position-dependent syllables

• Word-boundary info (cf triphones)• Example:

– Worthiness: _w_er dh_iy n_ih_s– The: _dh_iy_

• Missing from [Sethy+] and [Dod.+]

• Results: (3% absolute at every split)– Pos-indep (6k syl): 39.2% WER (2dps)– Pos-dep: 35.7% WER (2dps)

Page 16: Large Models for Large Corpora: preliminary findings

CD-Syllables

• Inter-syllable context breaks• Context = next phone• Next syl? Next vowel? (Nucleus/peak)• CD(phone)-syl >= CD triphones• Small gains!

• All GD results• CI-syl (6k syl): 19.0% WER• CD-syl (20k syl): 18.5% WER• CD-phones: 18.9% WER

Page 17: Large Models for Large Corpora: preliminary findings

Segmentation

• Word and Subword units give poor segmentation

• Speaker-adapted overgrown CD-phones are always better

• Problem for: MMI and adaptation

• Results: (ML)– Word-internal: 21.8% WER– Syl-internal: 19.9% WER

Page 18: Large Models for Large Corpora: preliminary findings

MMI/ADP didn’t work well

• MMI: time-constrained to +/- 3ms within word boundary

• Blame it on the segmentation (Word-int)

ML MMI +adp

CD-phones 18.9% 17.5%-1.4%

15.5%-2%

CD-syl 18.5% 17.4%-1.1%

16.0%-1.4%

Page 19: Large Models for Large Corpora: preliminary findings

ROVER• Two different systems can be combined

• Two-pass “transatlantic” ROVER architecture

• CD-phones align, phonetic classes

• No gain (broken confidence), deletions– MMI+adp: 15.5% (CDp) and 16.0% (SY)– Best ROVER: 15.5% WER (4-pass, 2-way)

Adapted CD-phonesSyllable models

Page 20: Large Models for Large Corpora: preliminary findings

Summary: architecture

CD-phones

POS-CD syllables

ROVER

POS-CI syllables Merged (6k)

Merged (3k)GD / MMI

Decode Adapt+decode

Page 21: Large Models for Large Corpora: preliminary findings

Conclusion (Syllable)

• Observed similar effects as literature

• Added some observations (state tying, CD, pos, ADP/MMI)

• Performance does not beat CD-phones yet– CD phones: 15.5% WER ; syl: 16.0% WER

• Some assumptions might cancel the benefit of syllable modeling

Page 22: Large Models for Large Corpora: preliminary findings

Open questions

• Is syllabification (grouping) better than random? Syllable?

• Planned vs spontaneous speech?• Did we oversimplify?• Why do subword units resist to auto-

segmentation?• Why didn’t CD-syl work better?• Language-dependent effects

Page 23: Large Models for Large Corpora: preliminary findings

Bubble Splitting

• Outgrowth of SAT• Increase model-size 15-fold w/o

computational penalty in train/decode• Also covers VTLN implementation

• Basic idea:– Split training into locally homogenous regions

(bubbles), and then apply SAT

Page 24: Large Models for Large Corpora: preliminary findings

SAT vs Bubble Splitting

• SAT relies on locally linearly compactable variabilities

• Each Bubble has local variability

• Simple acoustic factorizationSAT Bubbl

e

Adaptation (MLLR)

Page 25: Large Models for Large Corpora: preliminary findings

TDT and speakers labels

• TDT is not speaker-labeled

• Hub4 has 2400 nominative speakers

• Use decoding clustering (show-internal clusters)

• Males: 33k speakers

• Females: 18k speakers

• Shows: 2137 (TDT) + 288 (Hub4)

Page 26: Large Models for Large Corpora: preliminary findings

Input Speech

TDT

Decoded Words

Maximum Likelihood

Multiplex

Compact Bubble Models (CBM)

SPLIT

SPLIT

ADAPT

ADAPT NORMALIZE

NORMALIZE

Bubble-Splitting: Overview

M

A

L

E

F

E

M

A

L

E

Page 27: Large Models for Large Corpora: preliminary findings

VTLN implementation

• VTLN is used for clustering

• VTLN is a linear feature transformation (almost)

• Finding the best warp

Page 28: Large Models for Large Corpora: preliminary findings

VTLN: Linear equivalence•According to [Pitz, ICSLP 2000], VTLN is equivalent to a linear transformation in the cepstral domain:

0

2|)(|log)cos(1

deXkc ik

~|)(|log)~cos(

1)(~ 2~

0deXnc i

n )(~ Φ

~))~(cos()~cos(2

)( 1

0dkΦnAnk

K

kknkn cAc

0

)()(~

•The relationship between a cepstral coefficient ck and a warped one (stretched or compressed) is as follows:

• Energy, Filter-banks, and cepstral liftering imply non-linear effects

dknM

csteAM

nk

)(

00

)]cos()(cos[)(

)(~ Φ

f

f~

)700

1log(1127)(

ffM

•The Authors didn’t take the Mel-scale into account. No closed-form solution in that case :

Page 29: Large Models for Large Corpora: preliminary findings

VTLN is linearDecoding Algorithm:

)}A,|({

Ai

i

maxarg OLINPUT

SPEECH

Decode with

Ai and λ

DECODED

WORDS

mt

tmT

tm oRotLQ,

ii2

i )}A()A(|A|log){(21

Experimental results:

Page 30: Large Models for Large Corpora: preliminary findings

Statistical multiplex

• GD-mode• Faster than Brent search• 3 times faster than

exhaustive search• Based on prior

distribution of alpha• Test 0.98, 1, and 1.02• If 0.98 wins, continue

5)()(14.1

86.0

NPNE

4 Q evaluations: N0.98 = N1.02

= 4

0.98

3 Q evaluations: N1.00=3

1.00

Page 31: Large Models for Large Corpora: preliminary findings

Bubble Splitting: PrincipleTraining Speaker

Bubble Bi

Partial center: satmodel λi

1. Separate conditions• VTLN

2. Train Bubble model3. Compact using SAT

• Feature-space SAT

SAT works on homogenous conditions

Page 32: Large Models for Large Corpora: preliminary findings

Results

1st-pass adapted

Baseline GD 19.0% 16.6%

SAT 18.7% 16.5%

Bubbles 18.5% 16.0%

192k => 384k 18.6% 16.3%

About 0.5% WER reductionDouble model size => 0.3% WER

Page 33: Large Models for Large Corpora: preliminary findings

Conclusion (Bubble)

• Gain: 0.5% WER

• Extension of SAT model compaction

• VTLN implementation more efficient

Page 34: Large Models for Large Corpora: preliminary findings

Open questions

• Baseline SAT does not work?

• Speaker definition?

• Best splitting strategy? (One per warp)

• Best decoding strategy? (Closest warp)

• Best bubble training? (MAP/MLLR)

• MMIE

Page 35: Large Models for Large Corpora: preliminary findings

Conclusion

• What do we do with all of these data?

• Syllable + bubble splitting

• Two narrowly explored paths among many

• Promising results but nothing breathtaking

• Not ambitious enough?

Page 36: Large Models for Large Corpora: preliminary findings

System setup• RT03eval• 6x RT• Same parameters as RT03S eval system

– WI triphones, gender dependent, MMI– 2pass– Global MLLU + 7-class MLLR– 39 MFCC + non-causal CMS (2s)– 192k Gaussians, 3400 mixtures– 128 Gaussians / mix => merged