using tdt data to improve bn acoustic models
DESCRIPTION
Using TDT Data to Improve BN Acoustic Models. Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003. Overview. TDT Data Selection procedure Data pre-processing Lightly-supervised decoding Selection Experimental results Conclusion & future work. TDT Data. - PowerPoint PPT PresentationTRANSCRIPT
1
Using TDT Data to Improve BN Acoustic Models
Long Nguyen and Bing Xiang
STT WorkshopMartigny, Switzerland, Sept. 5-6, 2003
2
Overview
TDT Data
Selection procedure– Data pre-processing– Lightly-supervised decoding– Selection
Experimental results
Conclusion & future work
3
TDT Data
TDT2: – Jan 1998 – June 1998– Four sources (ABC, CNN, PRI, VOA)– 1034 shows , 633 hrs
TDT3: – Oct 1998 – Dec 1998– Six sources (ABC, CNN, MNB/MSN, NBC, PRI, VOA)– 731 shows , 475 hrs
TDT4: – Oct 2000 – Jan 2001 – Same six sources as in TDT3– 425 shows , 294 hrs
4
Selection Procedure
H4 AM H4 LMTDT Audio TDT Caption
Normalization
Recognition
Selection
SNOR STM+
Biased LM
Hypotheses Scoring (Sclite)
SGML
Transcripts
5
Closed Caption Format TDT audio have closed caption (CC) transcripts
– TDT2: *.sgm– TDT3: *.src_sgm– TDT4: *.tkn_sgm or *.src_sgm (in different tagging scheme)
Example: 20001103_2100_2200_VOA_ENG.src_sgm
<DOC><DOCNO> VOA20001103.2100.0345 </DOCNO><DOCTYPE> NEWS STORY </DOCTYPE><DATE_TIME> 11/03/2000 21:05:45.18 </DATE_TIME><BODY><TEXT>US share prices closed mixed, Friday. The DOW Jones Industrial average ended the day down 63 points. The NASDAQ Composite Index was 23 points higher. I'm John Bashard, VOA News.
</TEXT></BODY><END_TIME> 11/03/2000 21:06:04.42 </END_TIME></DOC>
6
CC to SNOR Normalize CC to SNOR transcripts for LM training
– Break into sentences– ‘Verbalize’ numbers (63 => SIXTY THREE)
– Normalize acronyms and abbreviation (US => U. S.)– Etc.
U. S. SHARE PRICES CLOSED MIXED FRIDAY
THE DOW JONES INDUSTRIAL AVERAGE ENDED THE DAY DOWN SIXTY THREE POINTS
THE NASDAQ COMPOSITE INDEX WAS TWENTY THREE POINTS HIGHER
I'M JOHN BASHARD V. O. A. NEWS
7
CC and SNOR to STM
Convert CC (and SNOR) to STM format for scoring/aligning later
20001103_2100_2200_VOA_ENG 1 S8 345.000 364.000 <o,f0,unknown> u. s. share prices closed mixed friday the dow jones industrial average ended the day down sixty three points the nasdaq composite index was twenty three points higher i'm john bashard v. o. a. news
8
Lightly-supervised Decoding Start with a reasonable Hub4 system
– Acoustic models: ML-trained on H4-141hrs corpus– Language models: Trigrams estimated on 1998-2000 data
subset of the GigaWord corpus
Make up biased LMs by adding TDT data with bigger weights– Three LMs, one for each TDT 2, 3, and 4– 40k-word lexicon including all new words found in TDT that
have phonetic pronunciations
Decode each show separately (as if it’s a new test set)– N-Best decoder followed by N-Best rescoring using SI acoustic
models (GD, band-specific)– Decode again after adapting acoustic models– Total runtime is about 10xRT
9
Alignment Use sclite to align hypotheses with CC transcripts to take
advantage of the time-stamped word alignments stored in the SGML output
C,"u.","u.",345.270+345.440:C,"s.","s.",345.440+345.610:C,"share","share",345.610+345.840:C,"prices","prices",345.840+346.290:C,"closed","closed",346.290+346.630:C,"mixed","mixed",346.630+347.000:C,"friday","friday",347.000+347.490:C,"the","the",347.490+347.580:C,"dow","dow",347.580+347.800:C,"jones","jones",347.800+348.110:C,"industrial","industrial",348.110+348.720:C,"average","average",348.720+349.130:C,"ended","ended",349.130+349.350:C,"the","the",349.350+349.430:C,"day","day",349.430+349.610:C,"down","down",349.610+350.020:C,"sixty","sixty",350.020+350.410:C,"three","three",350.410+350.680:C,"points","points",350.680+351.220:C,"the","the",351.840+351.940:C,"nasdaq","nasdaq",351.940+352.500:C,"composite","composite",352.500+352.960:C,"index","index",352.960+353.330:C,"was","was",353.330+353.490:C,"twenty","twenty",353.490+353.800:C,"three","three",353.800+354.020:C,"points","points",354.020+354.380:C,"higher","higher",354.380+354.860:C,"i'm","i'm",355.710+355.900:C,"john","john",355.900+356.180:I,,"burr",356.180+356.310:S,"bashard","shard",356.310+356.860:C,"v.","v.",356.860+357.010:C,"o.","o.",357.010+357.160:C,"a.","a.",357.150+357.300:C,"news","news",357.300+358.060
10
Selection Strategy
Search through the SGML file to select– Utterances having no errors– Phrases of 3+ contiguous correct words
In effect, use only a subset of words that both the CC transcripts and the decoder’s hypotheses agree
11
Selection Results The amount of data selected from TDT2, TDT3 and TDT4 (in
hours)
Set Raw Transcribed Cor. Utts Cor. Utts & PhrasesTDT2 633 425 143 305TDT3 475 328 119 241TDT4 294 213 73 156
All 1402 966 335 702
Only 68% of the TDT audio have CC (966/1402hrs)– Based on the observation of long passages of contiguous insertion errors
Selection yield rate is 72% (702/966) [Or 50% yield rate relative to the amount of raw audio data]
12
Scalability
Training Set Amount #spkrs #cbks #gaussh4 141hrs 7k 6k 164k
h4+tdt4 297hrs 12k 13k 354kh4+tdt4+tdt2 602hrs 23k 26k 720k
h4+tdt4+tdt2+tdt3 843hrs 31k 34k 983k
Trained 4 sets of acoustic models – ML, HLDA-SAT only– (not quite ready to use MMI training yet)
System parameters grow as more and more data added if thresholds and/or criteria of speaker clustering, state clustering, and Gaussian mixing stay fixed.
13
Experimental Results
AM trained on SI Adapt 1 Adapt 2141 hrs 17.2 13.0 12.7297 hrs 15.4 12.2 12.0602 hrs 14.7 11.6 11.4843 hrs 14.5 ? ?
Tested on the BN dev03 test set (h4d03) Used same RT03 Eval LMs Double the data (150 => 300hrs) provided 0.7%
abs reduction in WER Double again (300 => 600hrs) provided an
additional 0.6% abs reduction
14
Un-Adapted Results in Detail Significant reduction across all shows when
adding TDT4 data into Hub-4 BN data
Set ABC CNN MSN NBC PRI VOA All141 hrs 15.5 22.3 13.2 13.6 12.1 25.6 17.2297 hrs 13.8 21.4 11.5 11.8 10.6 22.9 15.4602 hrs 13.6 20.0 11.7 10.9 10.3 21.3 14.7
15
Adapted Results in Detail No noticable reduction observed for the MSN
and NBC shows when adding the TDT2 data. [These two types of shows not part of the TDT2 corpus]
Set ABC CNN MSN NBC PRI VOA All141 hrs 11.4 18.6 9.8 11.5 9.7 15.6 12.7297 hrs 10.8 18 9.0 10.2 9.2 15.0 12.0602 hrs 10.3 16.7 9.0 10.0 8.8 14.0 11.4
16
Summary
Proposed an effective strategy for automatic selection of BN audio data having closed caption transcripts as (additional) acoustic training data
68% of TDT audio data are captioned Selection yield rate is 72% of captioned data Adding 450hrs of selected data from the TDT2
and TDT4 corpus provides 1.3% abs reduction in WER for the BN dev03 test set
17
Future Work
Obtain results when adding the TDT3 data Improve the biased LMs and retry Understand the differences/errors in aligning the
hypotheses and the closed caption to refine the selection criteria
Cooperate with other sites to speed up and improve the data selection process
Use MMI training with this large amount of training data