stephan vogel - machine translation1 machine translation decoder for phrase-based smt stephan vogel...
Post on 14-Jan-2016
236 Views
Preview:
TRANSCRIPT
Stephan Vogel - Machine Translation 1
Machine Translation
Decoder for Phrase-Based SMT
Stephan VogelSpring Semester 2011
Stephan Vogel - Machine Translation 2
Decoder
Decoding issues (Previous Session) Two step decoding
Generation of translation lattice Best path search
With limited word reordering
Specific Issues Recombination of hypotheses Pruning N-best list generation Future cost estimation
Stephan Vogel - Machine Translation 3
Recombination of Hypotheses
Recombination: Of two hypotheses keep only the better one if no future information can switch their current ranking
Notice: this depends on the models Model score depends on current partial translation and the
extension, e.g. LM Model score depends on global features known only at the
sentence end, e.g. sentence length model
The models define equivalence classes for the hypotheses Expand only best hypothesis in each equivalence class
Stephan Vogel - Machine Translation 4
Recombination of Hypotheses: Example
n-gram LM Hypotheses
H1: I would like to goH2: I would not like to go
Assume as possible expansions:to the movies | to the cinema | and watch a film
LMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMs E.g : 3-gram LMscore Expansion 1 is:
-log p( to | to go ) – log p( the | go to ) – log p( movies | to the)
Therefore: Cost(H1) < Cost(H2) => Cost(H1+E) < Cost(H2+E)for all possible expansions E
Stephan Vogel - Machine Translation 5
Recombination of Hypotheses: Example 2
Sentence length model p( I | J ) Hypothesis
H1: I would like to goH2: I would not like to go
Assume as possible expansions:to the movies | to the cinema | and watch a film
Length( H1 ) = 5, Length( H2 ) = 6 For identical expansions the lengths will remain different
Situation at sentence end Possible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J ) Then possible that TotalCost( H1 + E ) > TotalCost( H2 + E ) I.e. reranking of hypotheses Therefore: can not recombine H2 into H1
Stephan Vogel - Machine Translation 7
Recombination: Keep ‘em around
Expand only best hyp Store pointers to recombined hyps for n-best list
generation
hb
hb
hr
hr
hr
hr
Bet
ter
Increasing coverage
Stephan Vogel - Machine Translation 8
Recombination of Hypotheses
Typical features for recombination of partial hypotheses LM history Positions of covered source words – some translations are more
expensive Number of generated words on target side – for sentence length model
Often only number of covered source words is considered, rather then actual positions Fits with typical organization of decoder: hyps are stored according to
number of covered source words Hyps are recombined which are not strictly comparable Use future cost estimate to lessen its impact
Overall: trade-off between speed and ‘correctness’ of search Ideally: only compare (and recombine) hyps if all models used in the
search see them as equivalent Realistically: use fewer, coarser equivalence classes by ‘forgetting’
some of the models (they still add to the scores)
Stephan Vogel - Machine Translation 11
Pruning
Pruning Even after recombination too many hyps Remove bad hyps and keep only the best ones
In recombination we compared hyps which are equivalent under the models
Now we need to compare hyps, which are not strictly equivalent under the models We risk to remove hyps which would have won the race in the long
run I.e. we introduce errors into the search
Search Error – Model Errors Model errors: our models give higher probability to worse
translation Search errors: our decoder looses translations with higher
probability
Stephan Vogel - Machine Translation 12
Pruning: Which Hyps to Compare?
Which hyps are we comparing? How many should we keep?
Recombination
Pruning
Stephan Vogel - Machine Translation 13
Pruning: Which Hyps to Compare?
Coarser equivalence relation => need to drop at least one of the models, or replace by simpler model Recombination according to translated positions and LM state
Pruning according to number of translated positions and LM state Recombination according to number of translated positions and LM
statePruning according to number of translated positions OR LM state
Recombination with 5-gram LMPruning with 3-gram LM
Question: which is the more important feature? Which leads to more search errors? How much loss in translation quality? Quality more important than speed in most applications!
Not one correct answer – depends on other components of the system Ideally, decoder allows for different recombination and pruning
settings
Stephan Vogel - Machine Translation 14
How Many Hyps to Keep?
Beam search: keep hyp h if Cost(h) < Cost(hbest) + const
Cost
Models separate alternatives a lot-> keep few hyps
Models do not separate alternatives-> keep many hyps
# translated words
Prune badhyps
Stephan Vogel - Machine Translation 15
Additive Beam
Is additive constant (in log domain) the right thing to do?
Hyps may spread more and more
Cost
Fewer and fewer hypsInside beam
# translated words
Stephan Vogel - Machine Translation 16
Multiplicative Beam
Beam search: keep hyp h if Cost(h) < Cost(hbest) * const
Cost
# translated words
Opening beamCovers more hyps
Stephan Vogel - Machine Translation 17
Pruning and Optimization
Each feature has a feature weight Optimization by adjusting feature weights Can result in compressing or spreading the scores
This actually happened in our first MERT implementation:
Higher and higher feature weights=> Hyps spreading further and further appart => Fewer hyps inside the beam=> Lower and lower Bleu score
Two-pronged repair: Normalizing feature weights Not proper beam pruning, but restricting number of hyps
Stephan Vogel - Machine Translation 18
How Many Hyps to Keep?
Keep n-best hyps Does not use the information from the models to decide how
many hyps to keep
Cost
Keep constant number of hyps
# translated words
Prune badhyps
Stephan Vogel - Machine Translation 19
Efficiency
Two problems Sorting Generating lots of hyps which are pruned (what a waste of
time)
Can we avoid generating hyps, which would most likely be pruned?
Stephan Vogel - Machine Translation 20
Efficiency
Assumptions: We want to generate hyps which cover n positions All hyp sets Hk, k < n, are sorted according to total score All phrase pairs (edges in translation lattice), which can be
used to expand a hyp h in Hk to cover n positions, are sorted according to their score (weighted sum of individual scores)
h1
h2
h3
h4
h5
p1
p2
p3
p4
h1p2
h1p1
h2p3
h4p2
h1p3
h3p2
Hyps sorted Phrases sorted New Hyps sorted
h2p1 prune
Stephan Vogel - Machine Translation 21
Naïve Way
Naïve way:
Foreach hyp h
Foreach phrasepair p
newhyp = h p
Cost(newhyp) = Cost(h)+ Cost(p)+ CostLM + CostDM + …
This generates many hyps which will be pruned
Stephan Vogel - Machine Translation 22
Early Termination
If Cost(newhyp) = Cost(h) + Cost(p) it would be easy
Besthyp = h1 p1
Loop h = next hyp
Loop p = next p
newhyp = h p Cost(newhyp) = Cost(h) + Cost(p) Until Cost(newhyp) > Cost(besthyp) + constUntil Cost(newhyp) > Cost(besthyp) + const
That’s for proper beam pruning, would still generate too many hyps for max number of hyp strategy
In addition, we have LM and DM, etc
Stephan Vogel - Machine Translation 23
2
‘Cube’ Pruning
Expand always best hyp until No hyps within beam anymore Or max number of hyps reached
p1 p2 p3
h1
h2
h3
h4
1
1 2
3
4
3.1 3.4
4.2 6.5
4.6
5.6
4
Stephan Vogel - Machine Translation 24
Effect of Recombination and Pruning
Average number of expanded hypotheses and NIST scores for different recombination (R) and pruning (P) combinations and different beam sizes (= number of hyps)
Test Set: Arabic DevTest (203 sentences)Beam Width
R : P 1 2 5 10 20
Av. Hyps exp.
C : c 825 899 1,132 1,492 1,801
C L : c 1,174 1,857 6,213 30,293 214,402
C L : C 2,792 4,248 12,921 53,228 287,278
NIST
C : c 8.18 8.81 8.21 8.22 8.27
C L : c 8.41 8.62 8.88 8.95 8.96
C L : C 8.47 8.68 8.85 8.98 8.98c = number of translation words, C = coverage vector, i.e. positions, L = LM historyNIST scores: higher is better
Stephan Vogel - Machine Translation 25
Number of Hypotheses versus NIST
Language model state required as recombination feature More hypotheses – better quality Different ways to achieve similar translation quality CL : C generates more ‘useless’ hypotheses (number of
bad hyps grows faster than number of good hyps)
88.28.48.68.8
99.2
100 1000 10000 100000 1000000
C : c
CL : c
CL : C
Stephan Vogel - Machine Translation 26
N-Best List Generation
Benefit: Required for optimizing model scaling factors Rescoring with richer models For down-stream processing
Translation with pivot language: L1 -> L2 -> L3 Information extraction …
We have n-best translations at sentence end
But: Hypotheses are recombined -> many good translations don’t reach the sentence end
Recover those translations
Stephan Vogel - Machine Translation 27
Storing Multiple Backpointers
When recombining hypotheses, store them with the best (i.e. surviving) hypothesis, but don’t expand them
hb
hb
hr
hr
hr
hr
Stephan Vogel - Machine Translation 28
Calculating True Score
Propagate final score backwards For best hypothesis we have correct final score Qf(hb) For recombined hypothesis we know current score
Qc(hr) and difference to current score Qc(hb) of best hypothesis
Final score of recombined hypothesis is then:
Q(hr) = Qf(hb) + ( Qc(hr) - Qc(hb) )
Use B = (Q, h, B’ ) to store sequences of hypotheses which make up a translation
Start with n-best final hypotheses For each of top n Bs, go to predecessor hypothesis and
to recombined hypotheses of predecessor hypothesis Store Bs according to coverage
Stephan Vogel - Machine Translation 29
Problem with N-Best Generation
Duplicates when using phrases US # companies # and # other # institutions
US companies # and # other # institutions
US # companies and # other # institutions
US # companies # and other # institutions
. . .
Example run: 1000-best -> ~400 different strings on average Extreme case: only 10 different strings
Possible solution: Checking uniqueness during backtracking,i.e. creating and hashing partial translations
Stephan Vogel - Machine Translation 30
Rest-Cost Estimation
In Pruning we compare hyps, which are not strictly equivalent under the models Risk: prefer hypotheses which have covered the easy parts Remedy: estimate remaining cost for each hypothesis
compare hypotheses based on ActualCost + FutureCost
Want to know minimum expected cost (similar to A* search) Gives a bound for pruning However, not possible with acceptable effort for all models
Want to include as many models as possible Translation model costs, word count, phrase count Language model costs Distortion model costs
Calculate expected cost R(l, r) for each span (l, r)
Stephan Vogel - Machine Translation 31
Rest Cost for Translation Models
Translation model, word count and phrase count features are ‘local’ costs Depend only on current phrase pair Strictly additive: R(l, m) + R(m, r) = R(l, r)
Minimize over alternative translations For each source phrase span (l, r): initialize with cost for best
translation Combine adjacent spans, take best combination
)},(),(,{min),( :Recursion
)},({min),( :tionInitializa
0
0
rmRmlRRrlR
rltrlR
TMTMTMm
TM
tTM
Stephan Vogel - Machine Translation 32
Rest Cost for Language Models
We do not have history -> only approximation For each span (l, r) calculate LM score without history Combine LM scores for adjacent spans
Notice: p(e1 … em) * p(em+1 … en) != p(e1 … en) beyond 1-gram LM
)},(),(,{min),( :Recursion
)},(|)(log{min),( :tionInitializa
)...|()...|()()...(p
0
0
111211LM
rmRmlRRrlR
rlteeprlR
eeepeepepee
LMLMLMm
LM
LMe
LM
nnn
Alternative: fast monotone decoding with TM-best translations History available Then R(l,r) = R(1,r) – R(1,l)
Stephan Vogel - Machine Translation 33
Rest Cost for Distance-Based DM
Distance-based DM: rest cost depends on coverage pattern To many different coverage patterns, can not pre-calculate Estimate by jumping to first gap, then filling gaps in sequence
Moore & Quirk 2007: DM cost plus rest cost
S adjacent S’’:d=0
S left of S’:d=2L(S)
S’ subsequence of S’’:d=2(D(S,S’’)+L(S))
Otherwise:d=2(D(S,S’)+L(S))
S S’’S’Current phrase Previous phrase Gap-free initial segment
L(.) = length of phrase, D(.,.) = distance between phrases
Stephan Vogel - Machine Translation 34
Rest Cost for Lexicalized DM
Lexicalized DM per phrase (f, e) = (f, t(f)) DM(f,e) scores: in-mon, in-swap, in-dist, out-mon, out-swap, out-dist Treat as local cost for each span (l, r) Minimize over alternative translations and different orientations in-*
and out-*
)},(),(,{min),( :Recursion
)}}),(,({min)}),(,({{minmin),( :Initialize
0
)(
0
rmRmlRRrlR
oftfCostoftfCostrlR
DMDMDMm
DM
outrl
rl
oin
rl
rl
oftDM
outinrl
Stephan Vogel - Machine Translation 35
Effect of Rest-Cost Estimation
From Richard Zens 2008 We did not describe ‘per Position’ LM is important, DM is important
Stephan Vogel - Machine Translation 36
Summary
Different translation strategies – related to word reordering
Two level decoding strategy (one possible way to do it) Generating translation lattice: contains all word and phrase
translations Finding best path
Word reordering as extension to best path search Jump ahead in lattice, fill in gap later Short reordering window: decoding time exponential in size of
window
Recombination of hypotheses If models can not re-rank hypotheses, keep only best Depends on models used
Stephan Vogel - Machine Translation 37
Summary
Pruning of hypotheses Beam pruning Problem with too few hyps in beam (e.g. when running MERT) Keeping a maximum number of hyps
Efficiency of implementation Try to avoid generating hyps, which are pruned Cube pruning
N-best list generation Needed for MERT Spurious ambiguity
top related