integrated stochastic pronunciation modeling

23
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, J oe Frankel, James Scobbie

Upload: omana

Post on 16-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Integrated Stochastic Pronunciation Modeling. Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie. Contents. Problems we are addressing Previous research Integrated stochastic pronunciation modeling Current experimental results Work plan. Problems we are addressing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrated Stochastic Pronunciation Modeling

Integrated Stochastic Pronunciation Modeling

Dong Wang

Supervisors: Simon King, Joe Frankel, James Scobbie

Page 2: Integrated Stochastic Pronunciation Modeling

Contents

Problems we are addressing Previous research Integrated stochastic pronunciation modeling Current experimental results Work plan

Page 3: Integrated Stochastic Pronunciation Modeling

Problems we are addressing

1. Constructing a lexicon is time consuming.2. Traditional lexicon-based triphone systems lack robustness to pronunciatio

n variation in real speech. Linguistics-based lexica seldom considering real speech Deterministic decomposition from words to acoustic units, through lexica and decis

ion tress

Page 4: Integrated Stochastic Pronunciation Modeling

Previous research

Alternative pronunciation generation Utilize real speech to expand the lexicon.

Automatic lexicon generation Utilize real speech to create a lexicon.

Hidden sequence modeling (HSM) Build a probabilistic mapping from phonemes to context dependent

phones.

Page 5: Integrated Stochastic Pronunciation Modeling

Previous research

Problems:1. Linguistics-based lexica2. determinate mapping

Page 6: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Integrated Stochastic Pronunciation Modeling (ISPM) Build a flexible three-layer architecture which represents pronunciati

on variation in probabilistic mappings, achieving better performance than traditional triphone-based systems.

Focus on the grapheme-based ISPM system, eliminating human effo

rts on lexicon construction.

Page 7: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Grapheme-based ISPM

Page 8: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Spelling simplification model (SSM) Map a letter string with regular pronunciation into a simple grapheme accor

ding to the context. e.g., EA->E Map a letter string with several pronunciations to simple graphemes, with ap

pearance probability attached, e.g., OUGH->O (0.6) AF (0.4) Examining the transcription from the grapheme decoding against the refere

nce transcription will help find the mapping.

Grapheme pronunciation model (GPM) The probabilistic mapping between the canonical layer and acoustic layer. LMs/decisi

on trees/ANNs can all be examined here.

Page 9: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Why graphemes? Simple relationship between word spellings and sub-word units help

s generate baseforms for any words, so avoid human efforts on lexicon construction.

It is easy to handle OOV words and reconstruct words from grapheme strings.

Building and applying grapheme-based LMs will be simple. Internal composition of phonology rules and acoustic clues makes it

suitable for some applications, such as spoken term detection and la

nguage identification.

Page 10: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Direct grapheme ISPM

Direct grapheme ISPM: SSM is a 1:1 mapping

Page 11: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Hidden grapheme ISPM

Hidden grapheme ISPM: SSM is a n:m mapping

Page 12: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

Training A divide-and-conquer approach, as in HSM, will be utilized for ISPM training. With

this approach, SSM,GPM and AM are optimized iteratively and alternately within an EM framework, which ensures the process to converge to a local optimum.

The acoustic units will be grown from a set of initial single-letter grapheme HMMs, as in the automatic lexicon generation approach.

Decoding The optimized ISPM will be used to expand searching graphs fed to the viterbi decode

r. No changes are required in the decoder itself.

Implementation steps The SSM and GPM are well separated so can be designed/implemented respectively,

and then are combined together. The SSM is relatively simpler therefore will be implmented first.

Page 13: Integrated Stochastic Pronunciation Modeling

Integrated stochastic pronunciation modeling

The proposed ISPM will be evaluated on three tasks: Large vocabulary speech recognition (LVSR) Spoken term detection (STD) Language identification (LID)

Simplest grapheme

(NONO)

Simple grapheme

(SSM)

Direct grapheme

(GPM)

Hidden grapheme

(SSM+GPM)

LVSR ★ ★ ★★

STD ★ ★★ ★★

LID ★ ★

Performance gain expectation from ISPM

Page 14: Integrated Stochastic Pronunciation Modeling

Current experimental results

Large vocabulary speech recognition

Training(h.) Development(h.) Evaluation(h.)

WSJCAM0 14.9 0.65 1.00

RT04S 103.9 1.40 1.66

Training voc Test voc Language model

WSJCAM0 WSJCAM0 WSJ-5k WSJ 3-gram

RT04S CMU+festival

CMU AMI 3-gram

WSJCAM0 for read speech and RT04S for spontaneous speech on the meeting domain

Experiment settings for the LVSR task

Data corpora for the LVSR task

Page 15: Integrated Stochastic Pronunciation Modeling

Current experimental results

Phoneme system(WER) Grapheme system(WER)

WSJCAM0 11.3% 15.8%

RT04S 44.5% 54.5%

Large vocabulary speech recognition

CI(WER) CD(WRE)

Phoneme 21.2% 9.8%

Grapheme 48.4% 13.0%

Contribution of context dependent modeling

Experimental results of the LVSR task

Page 16: Integrated Stochastic Pronunciation Modeling

Current experimental results

Conclusions The Grapheme-based system works usually worse than the phoneme-based

one, especially in the RT04S task which is on the meeting domain, where 10% absolute performance degradation is observed.

A grapheme-based system relies on context dependent modeling more than a phoneme-based system, and requires more Gaussian mixture components.

State-tying questions that reflect phonological rules are helpful. Other experiments showed that manually-designed multi-letter graphemes d

o not help significantly.

Large vocabulary speech recognition

Phoneme(WER) Grapheme(WER)

Extended questions

Grapheme(WER)

Singleton questions

11.3% 15.8% 16.5%

Contribution of phonology oriented questions to the grapheme system

Page 17: Integrated Stochastic Pronunciation Modeling

Current experimental results

Spoken term detection

sub-word lattice based architecture for STD

Page 18: Integrated Stochastic Pronunciation Modeling

Current experimental results

Figure of Merit (FOM): average detection rate over the range [1,10] false alarms per hour.

Occurrence-weighted value (OCC)phone grapheme

FOM 20.5 18.0

OCC 0.44 0.34

ATWV 0.25 0.16

WER 44.5% 54.5%

STD performance on the RT04S task

Spoken term detection

termtrue

termspuriouscorrect

termN

termNtermN

)(

)}(1.0)({

Actual term-weighted value(ATWV)

)}()({1 termPtermP FAMiss

term

average

Page 19: Integrated Stochastic Pronunciation Modeling

Current experimental results

Spoken term detection

• A Grapheme-based STD systems is attractive because OOV words can be handled easily and the lattice search is efficient and simple.

• In our experiments the phoneme-based STD system works better. We suppose this because some unpopular terms are more difficult for the grapheme-based system to recognize.

• If similar ASR performance can be achieved, the grapheme-based system will outperform the phoneme-based one, as shown in the right figure.

Page 20: Integrated Stochastic Pronunciation Modeling

Current experimental results

Spoken term detection

We have demonstrated that in Spanish, which holds simple grapheme-phoneme relationship and achieves close ASR performance with phoneme and grapheme based systems, the grapheme-based STD system outperforms the phoneme-based one.

Page 21: Integrated Stochastic Pronunciation Modeling

Current experimental results

Language identification

parallel phone/grapheme recognizer architecture for LID

Page 22: Integrated Stochastic Pronunciation Modeling

Current experimental results

DER%

phone grapheme Phone+grapheme

unit likelihood 35.6 32.1 27.9

sentence likelihood 46.8 39.6 39.4

Language identification

•Globalphone is used for initial experiments, but we will move to NIST standard corpora.

•Detection error rate (DER), defined as the incorrect detection divided by total trials, is used as metric. Results on 3 seconds of speech within 4 languages are reported.

•Scores of whole sentences and those averaged over sub-word units as the ANN input are all tested.

Page 23: Integrated Stochastic Pronunciation Modeling

Work plan

Phase I: Simple grapheme-based system1. Finish the STD experiments with high-order LMs (by Jan.2008).2. Finish the LID oriented tuning (by Nov.2007).3. Apply powerful LMs to the LID task (by Jan.2008).4. Finish the SSM design (by Jan.2008).5. Apply the SSM on LVSR RTS04 and STD (by Feb.2008).

Phase II: Integrated stochastic pronunciation modeling1. Finish the direct-grapheme architecture (GPM) design (by Jul.2008).2. Test the direct-grapheme architecture on the LVSR RTS04 task (by Oct.2008).3. Finish the hidden-grapheme architecture (GPM+SSM) (by Jan.2009).4. Test the hidden-grapheme architecture on the LVSR RTS04 task (by Feb.2009).

Phase III: Applications based on ISPM1. Finish the test on the STD task (by May 2009).2. Finish the test on the LID task (by May 2009).