nrc report conclusion tu zhaopeng 2009-09-08. nist06 the portage system for chinese large-track...

14
NRC Report Conclusion Tu Zhaopeng 2009-09-08

Upload: arline-alice-goodwin

Post on 14-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NRC Report Conclusion

Tu Zhaopeng2009-09-08

Page 2: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST06

The Portage System

For Chinese large-track entry, used simple,

but carefully-tuned, phrase-based system:

Pre-process source text

Viterbi decoding using loglinear model

Nbest rescoring using fancier loglinear model

Post-process raw translation

Page 3: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST06

Pre-processing:

Convert to GB2312, removing traditional

characters with no GB2312 representation

Segment using LDC segmenter

Translate numbers and dates using rules

Strip non-ASCII OOV’s

Page 4: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST06

Post-processing

Truecase using 4-gram HMM (via SRILM disambig)

trained on parallel corpus

Detokenization heuristics

Page 5: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST06 Rescoring

Rescoring based on 5k-best lists, using Powell’s

algorithm to find max-BLEU weights

Features (22)

All 12 decoder features

Character length

IBM2 scores in both directions

IBM1-based “missing word” feature (compare score of best

translation for each word to best known)

Posterior probabilities calculated from nbest list for:

sentence length, phrases, words, unigrams, and bigrams.

Page 6: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST06 Search Parameters

Page 7: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST08

Towards Tighter Integration of Rule-

based and Statistical MT in Serial System

Combination

Rule-based

Systran

Phrase-based

Portage

Page 8: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST08

Annotation of Systran output, five

different chunk types:

named entities, numbers, dates

unknown words or unlikely sequences of short

words

‘strong’ rules : very reliable chunks, e.g., rules

based on a long distance syntactic relationship, or

a long multiword expression

Page 9: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09 Serial system combination

Page 10: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09

NRC system trained on SY/EN parallel corpus:

use SYSTRAN to translate ZH half of parallel ZH/EN

training corpus, discarding UN, HKH/L corpora for

eciency ! 3M sentence pairs

preprocess SY: strip markup, tokenize, lowercase

standard phrase-based training

Page 11: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09

Two strategies that didn't work:

Exploit SY/EN surface similarity: boost HMM ttable

scores of similar forms, prior to phrase extraction !

no improvement

Use SY case information: adopt SY case for aligned

EN words|no improvement compared to baseline

independent truecaser

Page 12: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09

Common features:

phrase table based on symmetrized HMM word

alignments (4 features: lex+rf, fwd+bkw)

5g mixture LM from parallel corpus (Foster &

Kuhn, WMT07)

6g LM from GW

word count and distortion

Page 13: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09

Page 14: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based

NIST09

Useful

rescoring with IBM- and nbest-based features

(Ueng and Ney, CL07; Chen et al, IWSLT05):

+0.3 BLEU

greedy feature pruning for rescoring +0.3

BLEU

truecasing with \title trick": +0.3 BLEU