machine translation decoder for phrase-based smt
DESCRIPTION
Machine Translation Decoder for Phrase-Based SMT. Stephan Vogel Spring Semester 2011. Decoder. Decoding issues Two step decoding Generation of translation lattice Best path search With limited word reordering Specific Issues (Next Session) Recombination of hypotheses Pruning - PowerPoint PPT PresentationTRANSCRIPT
Stephan Vogel - Machine Translation 1
Machine Translation
Decoder for Phrase-Based SMT
Stephan VogelSpring Semester 2011
Stephan Vogel - Machine Translation 2
Decoder
Decoding issues Two step decoding
Generation of translation lattice Best path search
With limited word reordering
Specific Issues (Next Session) Recombination of hypotheses Pruning N-best list generation Future cost estimation
Stephan Vogel - Machine Translation 3
Decoding Issues
Decoder takes source sentence and all available knowledge (translation model, distortion model, language model etc)and generates a target sentence
Many alternative translations are possible Too many to explore them all -> pruning is necessary Pruning leads to search errors
Decoder outputs model-best translation Ranking of hyps according to model is different from ranking
according to external metric Bad translations get better models scores than good translations
-> model errors
Models see only limited context Different hypotheses become identical under the model -> Hypothesis recombination
Stephan Vogel - Machine Translation 4
Decoding Issues
Languages have different word order Modeled by distortion models Exploring all possible reorderings to expensive (essentially
O(J!)) Need to restrict reordering -> different reordering
strategies
Optimizing the system We use a bunch of models (features), need to optimize scaling
factors (feature weights) Decoding is expensive Optimize on n-best list -> need to generate n-best lists
Stephan Vogel - Machine Translation 5
Decoder: The Knowledge Sources
Translation models Phrase translation table Statistical lexicon and/or manual lexicon Named entities
Translation information stored as transducers or extracted on the fly
Language model: standard n-gram LM
Distortion model: distance-based or lexicalized
Sentence length model Typically simulated by word-count feature
Other features Phrase-count Number of untranslated words …
Stephan Vogel - Machine Translation 6
The Decoder: Two Level Approach
Build translation lattice Run left-right over test sentence Search for matching phrases between source sentence and
phrase table (and other translation tables) For each translation, insert edges into the lattice
First best search (or n-best search) Run left-right over lattice Apply n-gram language model Combine translation model scores and language model score Recombine and prune hypotheses At sentence end: add sentence length model score Trace back best hypothesis (or n-best hypotheses)
Notice: this is convenient for describing decoder Implementation can interleave both processes Implementation can make a difference due to pruning
Stephan Vogel - Machine Translation 7
Building Translation Lattice
Sentence: ich komme morgen zu dirReference: I will come to you tomorrow
Search in corpus for phrases and their translations Insert edges into the lattice
I come
morgen zu dir
I come
I will come
ich komme
tomorrowto you
to your office
0 1 2 … … J
Stephan Vogel - Machine Translation 8
Phrase Table in Hash Map
Store phrase table in hash map (source phrase as key) For each n-gram in source sentence access hash map
foreach j = 1 to J-1 // start position of phrase foreach l = 0 to lmax-1 // phrase length
SourcePhrase = (wj … wj+l)
TargetPhrases = Hashmap.Get( SourcePhrase ) foreach TargetPhrase t in TargetPhrases create new edge ’ = (j-1, j+l, t ) // add TM scores
Works fine for sentence input, but too expensive for lattices Lattices from speech recognizer Paraphrases Reordering as preprocessing step Hierarchical transducers
Stephan Vogel - Machine Translation 9
Example: Paraphrase Lattice
Large: top-5 paraphrases
Pruned
Stephan Vogel - Machine Translation 10
Phrase Table as Prefix Tree
Stephan Vogel - Machine Translation 11
Phrase Table as Prefix Tree
ja , okay dann Montag bei mir
okay then
okay on Monday then
Stephan Vogel - Machine Translation 12
Building the Translation Lattice
Book-keeping: hypothesis h = (, , 0, hprev, ) – node0 – initial state in transducerhprev – previous hypothesis – edge
Convert sentence into lattice structure At each node , insert ‘empty’ hypothesis
h = (, , 0, hprev = nil, = nil ) as starting point for phrase search from this position
Note: Previous hyp and edge are only needed for hierarchical transducers, to be able to ‘propagate’ partial translations
Stephan Vogel - Machine Translation 13
Algorithm for Building Translation Lattice
foreach node = 0 to J create empty hypothesis h0 = (NIL, NIL)
Hyps( ) = Hyps( ) + h0
foreach incoming edge in w = WordAt( ) prev = FromNode( )
foreach hypothesis hprev = (startprevprevhx, x ) in Hyps( prev )
if transducer T has transition (->’ : w ) if ’ is emitting state foreach translation t emitted in ’ create new edge ’ = (s, t ) // add TM scores
if ’ is not final state create new hypothesis h’ = (s, ’hprev, )
Hyps( ) = Hyps( ) + h’
Stephan Vogel - Machine Translation 14
Searching for Best Translation
We have constructed a graph Directed No cycles Each edge carries a partial translation (with scores)
Now we need to find the best path Adding additional information (DM, LM, ….) Allowing for some reordering
Stephan Vogel - Machine Translation 15
Monotone Search
Hypotheses describe partial translations Coverage information, translation, scores
Expand hypothesis over outgoing edges
I come
morgen zu dir
I come
I will come
ich komme
tomorrowto you
to your office
h: c=0..3, t=I will come tomorrow h: c=0..4, t=I will come tomorrow to
h: c=0..4, t=I will come tomorrow zu
h: c=0..5, t=I will come tomorrow to your office
Stephan Vogel - Machine Translation 16
Reordering Strategies
All permutations Any re-ordering possible Complexity of traveling salesman -> only possible for very short
sentences
Small jumps ahead – filling the gaps pretty soon Only local word reordering Implemented in STTK decoder
Leaving small number of gaps – fill in at any time Allows for global but limited reordering Similar decoding complexity – exponential in number of gaps IBM-style reordering (described in IBM patent)
Merging neighboring regions with swap – no gaps at all Allows for global reordering Complexity lower than 1, but higher than 2 and 3
Stephan Vogel - Machine Translation 17
IBM Style Reordering
Example: first word translated last!
0
1
2
3
4
5
6
7
gap
another gap
partially filled
Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1
Stephan Vogel - Machine Translation 18
Sliding Window Reordering
Local reordering within sliding window of size 6
0
1
2
3
4
5
6
7
gap
another gap
partially filled
[ ]
[ ]
[ ]
[ ]
[ ]
new gap
[ ]
[ ]
[ ]
8 [ ]
Stephan Vogel - Machine Translation 19
Coverage Information
Need to know which source words have already been translated Don’t want to miss some words Don’t want to translate words twice Can compare hypotheses which cover the same words
Use Coverage vector to store this information For ‘small jumps ahead’: position of first gap plus short bit
vector For ‘small number of gaps’: array of positions of uncovered
words For ‘merging neighboring regions’: left and right position
Stephan Vogel - Machine Translation 20
Limited Distance Word Reordering
Word and phrase reordering within a given window From first un-translated source word next k positions Window length 1: monotone decoding Restrict total number of reordering (typically 3 per 10 words)
Simple ‘Jump’ model or lexicalized distortion model
Use bit vector 1001100… = words 1, 4, and 5 translated
For long sentences long bit vectors, but only limited reordering allowed, therefore:
Coverage = ( first untranslated word, bit vector)i.e. 111100110… -> (4, 00110…)
Stephan Vogel - Machine Translation 21
Jumping ahead in the Lattice
Hypotheses describe a partial translation Coverage information, translation, scores
Expand hypothesis over uncovered position (within window)
I come
morgen zu dir
I come
I will come
ich komme
tomorrowto you
to your office
h: c=11000, t=I will come
h: c=11011, t=I will come to your office
h: c=11111, t=I will come to your office tomorrow
Stephan Vogel - Machine Translation 22
Hypothesis for Search
Organize search according to number of translated words c
It is expensive to expand the translation Replace by back-trace information Generate full translation only for the best (n-best) final translation
Book-keeping: hypothesis h = (Q, C, , i, hprev, ) Q – total cost (we keep also cumulative costs for individual
models) C – coverage information: positions already translated – language model state: e.g. last n-1 words for n-gram LM i – number of target words hprev – pointer to previous hypothesis – edge traversed to expand hprev into h
hprev and is the back-trace information: used to reconstruct the full translation
Stephan Vogel - Machine Translation 23
Algorithm for Applying Language Model
for coverage c = 0 to J-1 foreach h in Hyps( c ) foreach node within reordering window foreach outgoing edge in if no coverage collision between h.C and C() TMScore = -log( p( t | s ) // typically several scores DMScore = -log p( jump ) // or lexicalized DM score // other scores like word count, phrase count, etc foreach target word tk in t
LMScore += -log p (tk | k-1 )
k = k-1ti
endfor Q’ = k1*TMScore + k2*LMScore + k3*DMScore + … h’ = ( h.Q + Q’, h.C & C(), ’, h.i + |t|, h, ) Hyps( c’ ) += h’
Stephan Vogel - Machine Translation 24
Algorithm for Applying LM cont.
// coverage is now J, i.e. sentence end reachedforeach h in Hyps( J ) SLScore = -log p( h.i | J ) // sentence length model LMScore += -log p (</s> | h ); // end-of-sentence LM
score ’ = h</s>
Q’ = a*LMScore + b*SLScore h’ = ( h.Q + Q’, h.C , ’, h.i, h, ) Hyps( J+1 ) += h’
Sort Hyps( J+1 ) according to total score Q
Trace back over sequence of (h, ) to construct actual translation
Stephan Vogel - Machine Translation 25
Sentence Length Model
Different language have different level of ‘wordiness’ Histogram over source sentence length – target
sentence length shows that distribution is rather flat -> p( J | I ) is not very helpful
Very simple sentence length model: the more – the better i.e. give bonus for each word (not a probabilistic model) Balances shortening effect of LM Can be applied immediately, as absolute length is not important
However: this is insensitive to what’s in the sentence Optimize length of translations for entire test set, not each
sentence Some sentences are made too long to cover for sentences which
are too short