investigation on mandarin broadcast news speech recognition

15
Investigation on Mandarin Broadcast News Speech Recognition

Upload: bell

Post on 06-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Investigation on Mandarin Broadcast News Speech Recognition. Mei-Yuh Hwang , Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh. Outline. The task Text training data and language modeling Acoustic training data and acoustic modeling - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Page 2: Investigation on Mandarin Broadcast News Speech Recognition

Outline

The task Text training data and language modeling Acoustic training data and acoustic modeling Decoding structure Experimental results Recent progress and future direction

Page 3: Investigation on Mandarin Broadcast News Speech Recognition

The Task

Mandarin broadcast news (BN) transcription Mainland Mandarin speech TV/radio programs in China, USA

CCTV 中央电视台 NTDTV 新唐人电视台 PHOENIX TV 凤凰卫视 VOA 美国之音 RFA 自由亚洲电台 CNR 中国广播网

Page 4: Investigation on Mandarin Broadcast News Speech Recognition

Text Training Data

LM1: 1997 Mandarin BN Hub4 transcriptions Chinese TDT2,3,4 Multiple-translation Chinese (MTC) corpus, part 1, 2, 3

LM2: Gigaword XIN 2001-2004 (China) LM3: Gigaword ZBN 2001-2004 (Singapore) LM4: Gigaword CNA 2001-2004 (Taiwan) All together 420M words. 4 LMs interpolated

Page 5: Investigation on Mandarin Broadcast News Speech Recognition

Chinese Word Segmentation

BBN 64k-word lexicon, derived from LDC Longest-first match with the 64k-lexicon Choose most frequent 49k words as new

lexicon Train n-gram Use unigram part to re-do word segmentation

based on the ML path

Page 6: Investigation on Mandarin Broadcast News Speech Recognition

Chinese Word Segmentation

Longest-first 民进党 /和亲 /民党… The Green Party made peace with the Min Party

via marriage… Maximum-likelihood

民进党 / 和 /亲民党… The Green Party and the Qin-Min Party...

Page 7: Investigation on Mandarin Broadcast News Speech Recognition

Perplexity

49k-word lexicon

Word perplexity

2-gram 495

4-gram 288

Page 8: Investigation on Mandarin Broadcast News Speech Recognition

Acoustic Training Data

Corpus Size

1997 Hub4 BN 28.5 hrs

*TDT4-CCTV 25 hrs

*TDT4-VOA 43.5 hrs

Total 97 hours

*auto selection via a flexible alignment with closed caption

Page 9: Investigation on Mandarin Broadcast News Speech Recognition

Acoustic Feature Representation

39-dim MFCC cepstra + + 3-dim pitch + + Auto speaker clustering VTLN per auto speaker Speaker-based CMN+CVN for training

Page 10: Investigation on Mandarin Broadcast News Speech Recognition

Acoustic Models

2500 senones (clustered states) x 32 Gaussians ML training vs. MPE training with phone lattices Gender indepdent. nonCW vs. CW triphones Speaker-adaptive training (SAT):

N(x; a+b, AAt) = |A|-1 N(A-1(x-b); , )

Linear transformation A-1x + (-A-1b) applied to the feature domain.

Page 11: Investigation on Mandarin Broadcast News Speech Recognition

2-Pass Search Architecture

Search 1

SAT

MLLR

Search 2

nonCW,nonSAT, ML model

Small bigram

hypothesis

CW,SAT,MPE model

Final word sequence

Big 4-gram

Page 12: Investigation on Mandarin Broadcast News Speech Recognition

Adding Pitch: SA Results (CER)

Smoothing Dev04 Eval04

No pitch 14.5% 24.1%

IBM-style

(mean based)

14.0% 22.2%

SPLINE

(cubic smoothing)

12.7% 21.4%

Page 13: Investigation on Mandarin Broadcast News Speech Recognition

2-pass Search Results (CER)

Acoustic model Dev04 Eval04

nonCW, nonSAT, ML

7.4% --

nonCW, nonSAT, MPE

6.9% --

nonCW, SAT, ML

6.8% --

CW, SAT, ML 6.4% --

CW,SAT,MPE 6.0% 16.0%

Page 14: Investigation on Mandarin Broadcast News Speech Recognition

More Recent Progress

Add more acoustic (440 hrs) and text training data (840M words).

Increased and improved lexicon (60k words). fMPE training. Add ICSI feature as a second system. 5-gram LM. Between MFCC system and ICSI system,

Cross adaptation Rover

3.7% on dev04, 12.1% on eval04. Submitted to ICASSP 2007

Page 15: Investigation on Mandarin Broadcast News Speech Recognition

Challenges

Channel compensation Conversational speech Overlapped speech Speech with music background Commercial Language ID (in addition to English) Is CER the best measurement for MT?