March 24, 2005EARS STT Workshop 1
A Study of Some Factors Impacting SuperARV Language Modeling
Wen Wang1
Andreas Stolcke1
Mary P. Harper2
1. Speech Technology & Research LaboratorySRI International
2. School of Electrical and Computer EngineeringPurdue University
March 24, 2005EARS STT Workshop 2
Motivation
• RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper]
• N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set
– Dev04: achieved 1.0% absolute WER reduction over baseline LM – Eval04: no gain in WER (in fact, a small loss)
• RT-04 SARV LM was developed under considerable time pressure
– Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data
– Did not have time to examine all design choices in combination
• Reexamine all design decisions in detail
March 24, 2005EARS STT Workshop 3
What Changed?
RT-04 SARV training differed from RT-03 in 2 aspects:
• Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank
The 2003 parser was trained on the WSJ Treebank only.• Built a SuperARV LM with additional modifiee lexical feature
constraints (Standard+ model)
The 2003 LM was a SuperARV LM without these additional constraints (Standard model)
Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data.
March 24, 2005EARS STT Workshop 4
Plan of Attack• Revisit changes to training procedure
– Check effect on old and new data sets and systems
• Revisit the backoff N-gram approximation– Did we just get lucky in 2003 ?
– Evaluate full SuperARV LM in N-best rescoring
– Find better approximations
• Start investigation by going back to 2003 LM, then move to current system.
• Validate training software (and document and release)• Work in progress• Holding out on eval04 testing (avoid implicit tuning)
March 24, 2005EARS STT Workshop 5
Perplexity of RT-03 LMs• RT-03 LM training data• LM types tested:
– “Word”: Word backoff 4-gram, KN smoothed
– “SARV N-gram”: N-gram approximation to standard SuperARV LM
– “SARV Standard”: full SuperARV (without additional constraints)
• Full model gains smaller on dev04• N-gram approximation breaks down
Test Sets Word SARV N-gram
SARV Standard
dev2001 64.34 53.74 52.70
eval2003 70.80 56.25 54.18
dev2004 63.45 62.87 56.97
March 24, 2005EARS STT Workshop 6
N-best Rescoring with Full SuperARV LM
• Evaluated full Standard SARV LM in final N-best rescoring
• Based on PLP subsystem of RT-03 CTS system• Full SARV rescoring is expensive, so tried increasingly
longer N-best lists– Top-50– Top-500– Top-2000 (max used in eval system)
• Early passes (including MLLR) use baseline LM, so gains will be limited
March 24, 2005EARS STT Workshop 7
RT-03 LM N-best Rescoring Results
• Standard SuperARV reduces WER on eval02, eval03• No gain on dev04• Identical gains on eval03-SWB and eval03-Fisher• SuperARV gain increases with larger hypothesis space
Test
Set
Top-50 Top-500 Top-2000
Word SARV Standard
Word SARV Standard
Word SARV Standard
eval2002 26.7 26.1 26.6 25.8 26.3 25.6
eval2003 --- --- 26.4 26.1 --- ---
dev2004 18.2 18.2 18.1 18.1 --- ---
March 24, 2005EARS STT Workshop 8
Adding Modifiee Constraints
• Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints
• Expected that this information can improve SuperARV LM.
• In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure.
• This reduced perplexity and WER in early experiments.• But: additional tag constraints could have hurt LM
generalization!
March 24, 2005EARS STT Workshop 9
Perplexity with Modifiee Constraints
• Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model)
• Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard)
• But not on Fisher (dev04) test set!
Test Set Word N-gram
SARV N-gram
SARV Standard
SARV Standard+
dev2001 64.34 53.74 52.70 51.35
eval2003 70.80 56.25 54.18 53.09
dev2004 63.45 62.87 56.97 57.53
March 24, 2005EARS STT Workshop 10
N-best Rescoring with Modifiee Constraints
• WER reductions consistent with perplexity results• No improvement on dev04.
Test Set
Top-50 Top-500
Word N-
gram
SARV Standard
SARV Standard+
Word N-gram
SARV Standard
SARV Standard+
eval2002 26.7 26.1 26.0 26.6 25.8 25.6
eval2003 --- --- --- 26.4 26.1 25.8
dev2004 18.2 18.2 18.2 18.1 18.1 ---
March 24, 2005EARS STT Workshop 11
In-domain vs. Out-of-domain Parser Training
• SuperARVs are collected from CDG parses that are obtained by transforming CFG parses
• CFG parses are generated using existing state-of-the-art parsers.
• In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser)
• In 2004: Obtained trainable version of Charniak parser• Retrained parser on a combination of Switchboard
Treebank and WSJ Treebank (in-domain parser)– Expected improved consistency and accuracy of parse structures– However, there were bugs in that retraining; fixed for the current experiment.
March 24, 2005EARS STT Workshop 12
Rescoring Results with In-domain Parser
• Reparsed the RT-03 LM training with in-domain parser• Retrained Standard SuperARV model (“Standard-retrained”)• N-best rescoring system as before
• In-domain parsing helps• Also: number of distinct SuperARV tags reduced in retraining
(improved parser consistency)
Test Set Top-500 Rescoring WER (%)
Word N-gram
SARV Standard
SARV Standard+
SARV Standard-retrained
SARV Standard-retrained+
eval2002 26.6 25.8 25.6 25.6 25.4
March 24, 2005EARS STT Workshop 13
Summary So Far
• Prior design decisions have been validated• Adding modifiee constraints helps LM on matched data• Reparsing with retrained in-domain parser improves
LM quality• Now: reexamine approximation used in decoding• (work in progress)
March 24, 2005EARS STT Workshop 14
N-best Rescoring with RT-04 Full Standard+ Model
• RT-04 model is “Standard+” model (includes modifee constraints)• RT-04 had been built with in-domain parser• Caveat: old parser runs fraught with some (not catastrophic) bugs,
still need to reparse RT-04 LM training data (significantly more than RT-03 data)
• Improved WER, but smaller gains than on older test sets• Gains improve with more hypotheses• Suggests need for better approximation to enable use of SuperARV
in search
Test set
Top-50 Top-500
Word N-gram
SARV Standard+
Word N-gram
SARV Standard+
dev2004 18.0 17.8 17.9 17.6
March 24, 2005EARS STT Workshop 15
Original N-gram Approximation Algorithm
• Algorithm Description:1. For each ngram observed in the training data (note their SuperARV tag
information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization;
2. For each of these ngrams, w1…wn, (note their tags are t1…tn),
1. Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t1…tn, denoted st1…stn;
2. Find the list of word sequences sharing the same short-SuperARV sequences as st1…stn, using the lexicon constructed after training;
3. We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold;
3. The resulting LM could be pruned to make its size comparable to a word-based LM.
• If the held-out set is small, algorithm will result in overfitting• If the held-out set is large, algorithm will be slow.
March 24, 2005EARS STT Workshop 16
Revised N-gram Approximation for SuperARV LMs
• Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri]
• Include all N-grams that “matter” to the decoder• Method:
Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices
Step 2: extract N-grams from the HTK lattices; prune based on posterior countsStep 3: compute conditional probabilities for these N-grams using a standard
SuperARV language modelStep 4: compute backoff weights based on the conditional probabilitiesStep 5: apply the resulting N-gram LM in all subsequent decoding passes (using
standard tools)
• Some approximations left:– Due to pruning in Step 2– From using only N-gram context, not full sentence prefix
• Drawback: Step 3 takes significant compute time– currently 10xRT, but not optimized for speed yet
March 24, 2005EARS STT Workshop 17
Lattice N-gram ApproximationExperiment
• Based on RT-03 Standard SuperARV LM• Extracted N-grams from first-pass HTK lattices• Pruned N-grams with total posterior count < 10-3
• Left with 3.6M N-grams on a 6h test set• RT-02/03 experiment
– Uses 2003 acoustic models– 2000-best rescoring (1st pass)
• Dev-04 experiment– Uses 2004 acoustic models– Lattice rescoring (1st pass)
March 24, 2005EARS STT Workshop 18
Lattice N-gram Approximation Results
• 1.2% absolute gain on old (matched) test sets• Small 0.2% gain on Fisher (mismatched) test set• Recall: no Fisher gain previously with N-best rescoring• Better exploitation of full hypothesis space yields
results
Test set Word N-gram SARV Lattice N-gram
eval2002 32.1 30.9
eval2003 32.1 30.9
dev2004 20.7 20.5
March 24, 2005EARS STT Workshop 19
Conclusions and Future Work
• There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. – When making a model more constrained, its generality may be
reduced.
– Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data
– We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain.
• For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent.
– Observed LM win from better parses (using retrained parser)– Can expect further gains from advances in parse accuracy
March 24, 2005EARS STT Workshop 20
Conclusions and Future Work (Cont.)
• Old N-gram approximation was flawed• New N-gram approximation looks promising, but also
needs more work– Tests using full system– Rescoring algorithm needs speeding up
• Still to do: reparse current CTS LM training set.
• Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework.