nrc report conclusion tu zhaopeng 2009-09-08. nist06 the portage system for chinese large-track...

NRC Report Conclusion

Tu Zhaopeng2009-09-08

NIST06

The Portage System

For Chinese large-track entry, used simple,

but carefully-tuned, phrase-based system:

Pre-process source text

Viterbi decoding using loglinear model

Nbest rescoring using fancier loglinear model

Post-process raw translation

NIST06

Pre-processing:

Convert to GB2312, removing traditional

characters with no GB2312 representation

Segment using LDC segmenter

Translate numbers and dates using rules

Strip non-ASCII OOV’s

NIST06

Post-processing

Truecase using 4-gram HMM (via SRILM disambig)

trained on parallel corpus

Detokenization heuristics

NIST06 Rescoring

Rescoring based on 5k-best lists, using Powell’s

algorithm to find max-BLEU weights

Features (22)

All 12 decoder features

Character length

IBM2 scores in both directions

IBM1-based “missing word” feature (compare score of best

translation for each word to best known)

Posterior probabilities calculated from nbest list for:

sentence length, phrases, words, unigrams, and bigrams.

NIST06 Search Parameters

NIST08

Towards Tighter Integration of Rule-

based and Statistical MT in Serial System

Combination

Rule-based

Systran

Phrase-based

Portage

NIST08

Annotation of Systran output, five

different chunk types:

named entities, numbers, dates

unknown words or unlikely sequences of short

words

‘strong’ rules : very reliable chunks, e.g., rules

based on a long distance syntactic relationship, or

a long multiword expression

NIST09 Serial system combination

NIST09

NRC system trained on SY/EN parallel corpus:

use SYSTRAN to translate ZH half of parallel ZH/EN

training corpus, discarding UN, HKH/L corpora for

eciency ! 3M sentence pairs

preprocess SY: strip markup, tokenize, lowercase

standard phrase-based training

NIST09

Two strategies that didn't work:

Exploit SY/EN surface similarity: boost HMM ttable

scores of similar forms, prior to phrase extraction !

no improvement

Use SY case information: adopt SY case for aligned

EN words|no improvement compared to baseline

independent truecaser

NIST09

Common features:

phrase table based on symmetrized HMM word

alignments (4 features: lex+rf, fwd+bkw)

5g mixture LM from parallel corpus (Foster &

Kuhn, WMT07)

6g LM from GW

word count and distortion

NIST09

NIST09

Useful

rescoring with IBM- and nbest-based features

(Ueng and Ney, CL07; Chen et al, IWSLT05):

+0.3 BLEU

greedy feature pruning for rescoring +0.3

BLEU

truecasing with \title trick": +0.3 BLEU

nrc report conclusion tu zhaopeng 2009-09-08. nist06 the portage system for chinese large-track...

Documents

phrasebased system

nbestbased features

syen parallel corpus

phrase table

phrase extraction

parallel corpus foster

datesunknown words

missing word feature