nrc report conclusion tu zhaopeng 2009-09-08. nist06 the portage system for chinese large-track...
TRANSCRIPT
NRC Report Conclusion
Tu Zhaopeng2009-09-08
NIST06
The Portage System
For Chinese large-track entry, used simple,
but carefully-tuned, phrase-based system:
Pre-process source text
Viterbi decoding using loglinear model
Nbest rescoring using fancier loglinear model
Post-process raw translation
NIST06
Pre-processing:
Convert to GB2312, removing traditional
characters with no GB2312 representation
Segment using LDC segmenter
Translate numbers and dates using rules
Strip non-ASCII OOV’s
NIST06
Post-processing
Truecase using 4-gram HMM (via SRILM disambig)
trained on parallel corpus
Detokenization heuristics
NIST06 Rescoring
Rescoring based on 5k-best lists, using Powell’s
algorithm to find max-BLEU weights
Features (22)
All 12 decoder features
Character length
IBM2 scores in both directions
IBM1-based “missing word” feature (compare score of best
translation for each word to best known)
Posterior probabilities calculated from nbest list for:
sentence length, phrases, words, unigrams, and bigrams.
NIST06 Search Parameters
NIST08
Towards Tighter Integration of Rule-
based and Statistical MT in Serial System
Combination
Rule-based
Systran
Phrase-based
Portage
NIST08
Annotation of Systran output, five
different chunk types:
named entities, numbers, dates
unknown words or unlikely sequences of short
words
‘strong’ rules : very reliable chunks, e.g., rules
based on a long distance syntactic relationship, or
a long multiword expression
NIST09 Serial system combination
NIST09
NRC system trained on SY/EN parallel corpus:
use SYSTRAN to translate ZH half of parallel ZH/EN
training corpus, discarding UN, HKH/L corpora for
eciency ! 3M sentence pairs
preprocess SY: strip markup, tokenize, lowercase
standard phrase-based training
NIST09
Two strategies that didn't work:
Exploit SY/EN surface similarity: boost HMM ttable
scores of similar forms, prior to phrase extraction !
no improvement
Use SY case information: adopt SY case for aligned
EN words|no improvement compared to baseline
independent truecaser
NIST09
Common features:
phrase table based on symmetrized HMM word
alignments (4 features: lex+rf, fwd+bkw)
5g mixture LM from parallel corpus (Foster &
Kuhn, WMT07)
6g LM from GW
word count and distortion
NIST09
NIST09
Useful
rescoring with IBM- and nbest-based features
(Ueng and Ney, CL07; Chen et al, IWSLT05):
+0.3 BLEU
greedy feature pruning for rescoring +0.3
BLEU
truecasing with \title trick": +0.3 BLEU