machine translation decoder for phrase-based smt

25
Stephan Vogel - Machine Transl ation 1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011

Upload: chipo

Post on 19-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Machine Translation Decoder for Phrase-Based SMT. Stephan Vogel Spring Semester 2011. Decoder. Decoding issues Two step decoding Generation of translation lattice Best path search With limited word reordering Specific Issues (Next Session) Recombination of hypotheses Pruning - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 1

Machine Translation

Decoder for Phrase-Based SMT

Stephan VogelSpring Semester 2011

Page 2: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 2

Decoder

Decoding issues Two step decoding

Generation of translation lattice Best path search

With limited word reordering

Specific Issues (Next Session) Recombination of hypotheses Pruning N-best list generation Future cost estimation

Page 3: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 3

Decoding Issues

Decoder takes source sentence and all available knowledge (translation model, distortion model, language model etc)and generates a target sentence

Many alternative translations are possible Too many to explore them all -> pruning is necessary Pruning leads to search errors

Decoder outputs model-best translation Ranking of hyps according to model is different from ranking

according to external metric Bad translations get better models scores than good translations

-> model errors

Models see only limited context Different hypotheses become identical under the model -> Hypothesis recombination

Page 4: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 4

Decoding Issues

Languages have different word order Modeled by distortion models Exploring all possible reorderings to expensive (essentially

O(J!)) Need to restrict reordering -> different reordering

strategies

Optimizing the system We use a bunch of models (features), need to optimize scaling

factors (feature weights) Decoding is expensive Optimize on n-best list -> need to generate n-best lists

Page 5: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 5

Decoder: The Knowledge Sources

Translation models Phrase translation table Statistical lexicon and/or manual lexicon Named entities

Translation information stored as transducers or extracted on the fly

Language model: standard n-gram LM

Distortion model: distance-based or lexicalized

Sentence length model Typically simulated by word-count feature

Other features Phrase-count Number of untranslated words …

Page 6: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 6

The Decoder: Two Level Approach

Build translation lattice Run left-right over test sentence Search for matching phrases between source sentence and

phrase table (and other translation tables) For each translation, insert edges into the lattice

First best search (or n-best search) Run left-right over lattice Apply n-gram language model Combine translation model scores and language model score Recombine and prune hypotheses At sentence end: add sentence length model score Trace back best hypothesis (or n-best hypotheses)

Notice: this is convenient for describing decoder Implementation can interleave both processes Implementation can make a difference due to pruning

Page 7: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 7

Building Translation Lattice

Sentence: ich komme morgen zu dirReference: I will come to you tomorrow

Search in corpus for phrases and their translations Insert edges into the lattice

I come

morgen zu dir

I come

I will come

ich komme

tomorrowto you

to your office

0 1 2 … … J

Page 8: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 8

Phrase Table in Hash Map

Store phrase table in hash map (source phrase as key) For each n-gram in source sentence access hash map

foreach j = 1 to J-1 // start position of phrase foreach l = 0 to lmax-1 // phrase length

SourcePhrase = (wj … wj+l)

TargetPhrases = Hashmap.Get( SourcePhrase ) foreach TargetPhrase t in TargetPhrases create new edge ’ = (j-1, j+l, t ) // add TM scores

Works fine for sentence input, but too expensive for lattices Lattices from speech recognizer Paraphrases Reordering as preprocessing step Hierarchical transducers

Page 9: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 9

Example: Paraphrase Lattice

Large: top-5 paraphrases

Pruned

Page 10: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 10

Phrase Table as Prefix Tree

Page 11: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 11

Phrase Table as Prefix Tree

ja , okay dann Montag bei mir

okay then

okay on Monday then

Page 12: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 12

Building the Translation Lattice

Book-keeping: hypothesis h = (, , 0, hprev, ) – node0 – initial state in transducerhprev – previous hypothesis – edge

Convert sentence into lattice structure At each node , insert ‘empty’ hypothesis

h = (, , 0, hprev = nil, = nil ) as starting point for phrase search from this position

Note: Previous hyp and edge are only needed for hierarchical transducers, to be able to ‘propagate’ partial translations

Page 13: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 13

Algorithm for Building Translation Lattice

foreach node = 0 to J create empty hypothesis h0 = (NIL, NIL)

Hyps( ) = Hyps( ) + h0

foreach incoming edge in w = WordAt( ) prev = FromNode( )

foreach hypothesis hprev = (startprevprevhx, x ) in Hyps( prev )

if transducer T has transition (->’ : w ) if ’ is emitting state foreach translation t emitted in ’ create new edge ’ = (s, t ) // add TM scores

if ’ is not final state create new hypothesis h’ = (s, ’hprev, )

Hyps( ) = Hyps( ) + h’

Page 14: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 14

Searching for Best Translation

We have constructed a graph Directed No cycles Each edge carries a partial translation (with scores)

Now we need to find the best path Adding additional information (DM, LM, ….) Allowing for some reordering

Page 15: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 15

Monotone Search

Hypotheses describe partial translations Coverage information, translation, scores

Expand hypothesis over outgoing edges

I come

morgen zu dir

I come

I will come

ich komme

tomorrowto you

to your office

h: c=0..3, t=I will come tomorrow h: c=0..4, t=I will come tomorrow to

h: c=0..4, t=I will come tomorrow zu

h: c=0..5, t=I will come tomorrow to your office

Page 16: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 16

Reordering Strategies

All permutations Any re-ordering possible Complexity of traveling salesman -> only possible for very short

sentences

Small jumps ahead – filling the gaps pretty soon Only local word reordering Implemented in STTK decoder

Leaving small number of gaps – fill in at any time Allows for global but limited reordering Similar decoding complexity – exponential in number of gaps IBM-style reordering (described in IBM patent)

Merging neighboring regions with swap – no gaps at all Allows for global reordering Complexity lower than 1, but higher than 2 and 3

Page 17: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 17

IBM Style Reordering

Example: first word translated last!

0

1

2

3

4

5

6

7

gap

another gap

partially filled

Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1

Page 18: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 18

Sliding Window Reordering

Local reordering within sliding window of size 6

0

1

2

3

4

5

6

7

gap

another gap

partially filled

[ ]

[ ]

[ ]

[ ]

[ ]

new gap

[ ]

[ ]

[ ]

8 [ ]

Page 19: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 19

Coverage Information

Need to know which source words have already been translated Don’t want to miss some words Don’t want to translate words twice Can compare hypotheses which cover the same words

Use Coverage vector to store this information For ‘small jumps ahead’: position of first gap plus short bit

vector For ‘small number of gaps’: array of positions of uncovered

words For ‘merging neighboring regions’: left and right position

Page 20: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 20

Limited Distance Word Reordering

Word and phrase reordering within a given window From first un-translated source word next k positions Window length 1: monotone decoding Restrict total number of reordering (typically 3 per 10 words)

Simple ‘Jump’ model or lexicalized distortion model

Use bit vector 1001100… = words 1, 4, and 5 translated

For long sentences long bit vectors, but only limited reordering allowed, therefore:

Coverage = ( first untranslated word, bit vector)i.e. 111100110… -> (4, 00110…)

Page 21: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 21

Jumping ahead in the Lattice

Hypotheses describe a partial translation Coverage information, translation, scores

Expand hypothesis over uncovered position (within window)

I come

morgen zu dir

I come

I will come

ich komme

tomorrowto you

to your office

h: c=11000, t=I will come

h: c=11011, t=I will come to your office

h: c=11111, t=I will come to your office tomorrow

Page 22: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 22

Hypothesis for Search

Organize search according to number of translated words c

It is expensive to expand the translation Replace by back-trace information Generate full translation only for the best (n-best) final translation

Book-keeping: hypothesis h = (Q, C, , i, hprev, ) Q – total cost (we keep also cumulative costs for individual

models) C – coverage information: positions already translated – language model state: e.g. last n-1 words for n-gram LM i – number of target words hprev – pointer to previous hypothesis – edge traversed to expand hprev into h

hprev and is the back-trace information: used to reconstruct the full translation

Page 23: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 23

Algorithm for Applying Language Model

for coverage c = 0 to J-1 foreach h in Hyps( c ) foreach node within reordering window foreach outgoing edge in if no coverage collision between h.C and C() TMScore = -log( p( t | s ) // typically several scores DMScore = -log p( jump ) // or lexicalized DM score // other scores like word count, phrase count, etc foreach target word tk in t

LMScore += -log p (tk | k-1 )

k = k-1ti

endfor Q’ = k1*TMScore + k2*LMScore + k3*DMScore + … h’ = ( h.Q + Q’, h.C & C(), ’, h.i + |t|, h, ) Hyps( c’ ) += h’

Page 24: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 24

Algorithm for Applying LM cont.

// coverage is now J, i.e. sentence end reachedforeach h in Hyps( J ) SLScore = -log p( h.i | J ) // sentence length model LMScore += -log p (</s> | h ); // end-of-sentence LM

score ’ = h</s>

Q’ = a*LMScore + b*SLScore h’ = ( h.Q + Q’, h.C , ’, h.i, h, ) Hyps( J+1 ) += h’

Sort Hyps( J+1 ) according to total score Q

Trace back over sequence of (h, ) to construct actual translation

Page 25: Machine Translation Decoder for Phrase-Based SMT

Stephan Vogel - Machine Translation 25

Sentence Length Model

Different language have different level of ‘wordiness’ Histogram over source sentence length – target

sentence length shows that distribution is rather flat -> p( J | I ) is not very helpful

Very simple sentence length model: the more – the better i.e. give bonus for each word (not a probabilistic model) Balances shortening effect of LM Can be applied immediately, as absolute length is not important

However: this is insensitive to what’s in the sentence Optimize length of translations for entire test set, not each

sentence Some sentences are made too long to cover for sentences which

are too short