stephan vogel - machine translation1 machine translation decoder for phrase-based smt stephan vogel...

Stephan Vogel - Machine Translation 1

Machine Translation

Decoder for Phrase-Based SMT

Stephan VogelSpring Semester 2011


Decoder

Decoding issues (Previous Session) Two step decoding

Generation of translation lattice Best path search

With limited word reordering

Specific Issues Recombination of hypotheses Pruning N-best list generation Future cost estimation


Recombination of Hypotheses

Recombination: Of two hypotheses keep only the better one if no future information can switch their current ranking

Notice: this depends on the models Model score depends on current partial translation and the

extension, e.g. LM Model score depends on global features known only at the

sentence end, e.g. sentence length model

The models define equivalence classes for the hypotheses Expand only best hypothesis in each equivalence class


Recombination of Hypotheses: Example

n-gram LM Hypotheses

H1: I would like to goH2: I would not like to go

Assume as possible expansions:to the movies | to the cinema | and watch a film

LMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMs E.g : 3-gram LMscore Expansion 1 is:

-log p( to | to go ) – log p( the | go to ) – log p( movies | to the)

Therefore: Cost(H1) < Cost(H2) => Cost(H1+E) < Cost(H2+E)for all possible expansions E


Recombination of Hypotheses: Example 2

Sentence length model p( I | J ) Hypothesis

H1: I would like to goH2: I would not like to go

Assume as possible expansions:to the movies | to the cinema | and watch a film

Length( H1 ) = 5, Length( H2 ) = 6 For identical expansions the lengths will remain different

Situation at sentence end Possible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J ) Then possible that TotalCost( H1 + E ) > TotalCost( H2 + E ) I.e. reranking of hypotheses Therefore: can not recombine H2 into H1


Recombination: Keep ‘em around

Expand only best hyp Store pointers to recombined hyps for n-best list

generation

hb

hb

hr

hr

hr

hr

Bet

ter

Increasing coverage


Recombination of Hypotheses

Typical features for recombination of partial hypotheses LM history Positions of covered source words – some translations are more

expensive Number of generated words on target side – for sentence length model

Often only number of covered source words is considered, rather then actual positions Fits with typical organization of decoder: hyps are stored according to

number of covered source words Hyps are recombined which are not strictly comparable Use future cost estimate to lessen its impact

Overall: trade-off between speed and ‘correctness’ of search Ideally: only compare (and recombine) hyps if all models used in the

search see them as equivalent Realistically: use fewer, coarser equivalence classes by ‘forgetting’

some of the models (they still add to the scores)


Pruning

Pruning Even after recombination too many hyps Remove bad hyps and keep only the best ones

In recombination we compared hyps which are equivalent under the models

Now we need to compare hyps, which are not strictly equivalent under the models We risk to remove hyps which would have won the race in the long

run I.e. we introduce errors into the search

Search Error – Model Errors Model errors: our models give higher probability to worse

translation Search errors: our decoder looses translations with higher

probability


Pruning: Which Hyps to Compare?

Which hyps are we comparing? How many should we keep?

Recombination

Pruning


Pruning: Which Hyps to Compare?

Coarser equivalence relation => need to drop at least one of the models, or replace by simpler model Recombination according to translated positions and LM state

Pruning according to number of translated positions and LM state Recombination according to number of translated positions and LM

statePruning according to number of translated positions OR LM state

Recombination with 5-gram LMPruning with 3-gram LM

Question: which is the more important feature? Which leads to more search errors? How much loss in translation quality? Quality more important than speed in most applications!

Not one correct answer – depends on other components of the system Ideally, decoder allows for different recombination and pruning

settings


How Many Hyps to Keep?

Beam search: keep hyp h if Cost(h) < Cost(hbest) + const

Cost

Models separate alternatives a lot-> keep few hyps

Models do not separate alternatives-> keep many hyps

# translated words

Prune badhyps


Additive Beam

Is additive constant (in log domain) the right thing to do?

Hyps may spread more and more

Cost

Fewer and fewer hypsInside beam

# translated words


Multiplicative Beam

Beam search: keep hyp h if Cost(h) < Cost(hbest) * const

Cost

# translated words

Opening beamCovers more hyps


Pruning and Optimization

Each feature has a feature weight Optimization by adjusting feature weights Can result in compressing or spreading the scores

This actually happened in our first MERT implementation:

Higher and higher feature weights=> Hyps spreading further and further appart => Fewer hyps inside the beam=> Lower and lower Bleu score

Two-pronged repair: Normalizing feature weights Not proper beam pruning, but restricting number of hyps


How Many Hyps to Keep?

Keep n-best hyps Does not use the information from the models to decide how

many hyps to keep

Cost

Keep constant number of hyps

# translated words

Prune badhyps


Efficiency

Two problems Sorting Generating lots of hyps which are pruned (what a waste of

time)

Can we avoid generating hyps, which would most likely be pruned?


Efficiency

Assumptions: We want to generate hyps which cover n positions All hyp sets Hk, k < n, are sorted according to total score All phrase pairs (edges in translation lattice), which can be

used to expand a hyp h in Hk to cover n positions, are sorted according to their score (weighted sum of individual scores)

h1

h2

h3

h4

h5

p1

p2

p3

p4

h1p2

h1p1

h2p3

h4p2

h1p3

h3p2

Hyps sorted Phrases sorted New Hyps sorted

h2p1 prune


Naïve Way

Naïve way:

Foreach hyp h

Foreach phrasepair p

newhyp = h p

Cost(newhyp) = Cost(h)+ Cost(p)+ CostLM + CostDM + …

This generates many hyps which will be pruned


Early Termination

If Cost(newhyp) = Cost(h) + Cost(p) it would be easy

Besthyp = h1 p1

Loop h = next hyp

Loop p = next p

newhyp = h p Cost(newhyp) = Cost(h) + Cost(p) Until Cost(newhyp) > Cost(besthyp) + constUntil Cost(newhyp) > Cost(besthyp) + const

That’s for proper beam pruning, would still generate too many hyps for max number of hyp strategy

In addition, we have LM and DM, etc


2

‘Cube’ Pruning

Expand always best hyp until No hyps within beam anymore Or max number of hyps reached

p1 p2 p3

h1

h2

h3

h4

1

1 2

3

4

3.1 3.4

4.2 6.5

4.6

5.6

4


Effect of Recombination and Pruning

Average number of expanded hypotheses and NIST scores for different recombination (R) and pruning (P) combinations and different beam sizes (= number of hyps)

Test Set: Arabic DevTest (203 sentences)Beam Width

R : P 1 2 5 10 20

Av. Hyps exp.

C : c 825 899 1,132 1,492 1,801

C L : c 1,174 1,857 6,213 30,293 214,402

C L : C 2,792 4,248 12,921 53,228 287,278

NIST

C : c 8.18 8.81 8.21 8.22 8.27

C L : c 8.41 8.62 8.88 8.95 8.96

C L : C 8.47 8.68 8.85 8.98 8.98c = number of translation words, C = coverage vector, i.e. positions, L = LM historyNIST scores: higher is better


Number of Hypotheses versus NIST

Language model state required as recombination feature More hypotheses – better quality Different ways to achieve similar translation quality CL : C generates more ‘useless’ hypotheses (number of

bad hyps grows faster than number of good hyps)

88.28.48.68.8

99.2

100 1000 10000 100000 1000000

C : c

CL : c

CL : C


N-Best List Generation

Benefit: Required for optimizing model scaling factors Rescoring with richer models For down-stream processing

Translation with pivot language: L1 -> L2 -> L3 Information extraction …

We have n-best translations at sentence end

But: Hypotheses are recombined -> many good translations don’t reach the sentence end

Recover those translations


Storing Multiple Backpointers

When recombining hypotheses, store them with the best (i.e. surviving) hypothesis, but don’t expand them

hb

hb

hr

hr

hr

hr


Calculating True Score

Propagate final score backwards For best hypothesis we have correct final score Qf(hb) For recombined hypothesis we know current score

Qc(hr) and difference to current score Qc(hb) of best hypothesis

Final score of recombined hypothesis is then:

Q(hr) = Qf(hb) + ( Qc(hr) - Qc(hb) )

Use B = (Q, h, B’ ) to store sequences of hypotheses which make up a translation

Start with n-best final hypotheses For each of top n Bs, go to predecessor hypothesis and

to recombined hypotheses of predecessor hypothesis Store Bs according to coverage


Problem with N-Best Generation

Duplicates when using phrases US # companies # and # other # institutions

US companies # and # other # institutions

US # companies and # other # institutions

US # companies # and other # institutions

. . .

Example run: 1000-best -> ~400 different strings on average Extreme case: only 10 different strings

Possible solution: Checking uniqueness during backtracking,i.e. creating and hashing partial translations


Rest-Cost Estimation

In Pruning we compare hyps, which are not strictly equivalent under the models Risk: prefer hypotheses which have covered the easy parts Remedy: estimate remaining cost for each hypothesis

compare hypotheses based on ActualCost + FutureCost

Want to know minimum expected cost (similar to A* search) Gives a bound for pruning However, not possible with acceptable effort for all models

Want to include as many models as possible Translation model costs, word count, phrase count Language model costs Distortion model costs

Calculate expected cost R(l, r) for each span (l, r)


Rest Cost for Translation Models

Translation model, word count and phrase count features are ‘local’ costs Depend only on current phrase pair Strictly additive: R(l, m) + R(m, r) = R(l, r)

Minimize over alternative translations For each source phrase span (l, r): initialize with cost for best

translation Combine adjacent spans, take best combination

)},(),(,{min),( :Recursion

)},({min),( :tionInitializa

0

0

rmRmlRRrlR

rltrlR

TMTMTMm

TM

tTM


Rest Cost for Language Models

We do not have history -> only approximation For each span (l, r) calculate LM score without history Combine LM scores for adjacent spans

Notice: p(e1 … em) * p(em+1 … en) != p(e1 … en) beyond 1-gram LM


)},(|)(log{min),( :tionInitializa

)...|()...|()()...(p

0

0

111211LM

rmRmlRRrlR

rlteeprlR

eeepeepepee

LMLMLMm

LM

LMe

LM

nnn

Alternative: fast monotone decoding with TM-best translations History available Then R(l,r) = R(1,r) – R(1,l)


Rest Cost for Distance-Based DM

Distance-based DM: rest cost depends on coverage pattern To many different coverage patterns, can not pre-calculate Estimate by jumping to first gap, then filling gaps in sequence

Moore & Quirk 2007: DM cost plus rest cost

S adjacent S’’:d=0

S left of S’:d=2L(S)

S’ subsequence of S’’:d=2(D(S,S’’)+L(S))

Otherwise:d=2(D(S,S’)+L(S))

S S’’S’Current phrase Previous phrase Gap-free initial segment

L(.) = length of phrase, D(.,.) = distance between phrases


Rest Cost for Lexicalized DM

Lexicalized DM per phrase (f, e) = (f, t(f)) DM(f,e) scores: in-mon, in-swap, in-dist, out-mon, out-swap, out-dist Treat as local cost for each span (l, r) Minimize over alternative translations and different orientations in-*

and out-*


)}}),(,({min)}),(,({{minmin),( :Initialize

0

)(

0

rmRmlRRrlR

oftfCostoftfCostrlR

DMDMDMm

DM

outrl

rl

oin

rl

rl

oftDM

outinrl


Effect of Rest-Cost Estimation

From Richard Zens 2008 We did not describe ‘per Position’ LM is important, DM is important


Summary

Different translation strategies – related to word reordering

Two level decoding strategy (one possible way to do it) Generating translation lattice: contains all word and phrase

translations Finding best path

Word reordering as extension to best path search Jump ahead in lattice, fill in gap later Short reordering window: decoding time exponential in size of

window

Recombination of hypotheses If models can not re-rank hypotheses, keep only best Depends on models used


Summary

Pruning of hypotheses Beam pruning Problem with too few hyps in beam (e.g. when running MERT) Keeping a maximum number of hyps

Efficiency of implementation Try to avoid generating hyps, which are pruned Cube pruning

N-best list generation Needed for MERT Spurious ambiguity

stephan vogel - machine translation1 machine translation decoder for phrase-based smt stephan vogel...

Documents

log p movies

costh1 e costh2 e

length h2

sentence length model

h2 expansion

best hypothesis

h1 expansion

recombined hyps