1 machine translation mira and mbr stephan vogel spring semester 2011
TRANSCRIPT
11-711 Machine Translation 2
Overview
MBR – Minimum Bayes Risk Decoding MIRA – Margin Infused Relaxed Algorithm
11-711 Machine Translation 3
Optimization
We learned about tuning the MT system Decoder used a set of feature functions Optimize towards MT metric by adjusting feature weights
Different optimization approaches Simplex Powell MERT a la Och
Problem Only work well for small number (<30) of features
11-711 Machine Translation 5
MBR for Translation
Translate source sentence f into target sentence e Decoder generates many alternative translations Select least risky translation:
)|()',(argmin
)'(argminˆ
'
'
fePeeL
eRe
ee
e
eh EE
E
True distribution
Loss Function
Evidence Space
Hypothesis space
11-711 Machine Translation 6
Hypothesis Space
Hypothesis Space: We can only select a translation, which we have generated Decoder prunes most of the possible hypotheses We typically generate n-best translations from the search graph Search graph contains many more (typically > J10*k, J sentence
length and k reordering window) paths Ueffing et al. 2002: Generation of Word Graphs in Statistical
Machine Translation, describe generation of output lattice from search graph
Typically n-best list is used (e.g. in Moses package) Tromble et al. describe lattice MBR
Note about terminology: I used ‘Translation Lattice’ for the lattice, which includes all phrase translations for
the source sentence Others use ‘Translation Lattice’ for the output word graph
11-711 Machine Translation 7
The Loss Function
Loss function gives ‘cost’ for generating a wrong translation
Kumar & Byrne (2004) studies different loss functions Lexical: compare on the word level only, e.g. WER, PER, 1-BLEU Target language parse tree, e.g. tree edit distance between parse
trees Bilingual parse tree: uses information from word strings, alignments
and parse-trees in both languages
Ehling, Zens & Ney (2007) use BLEU
Any automatic MT evaluation metric – or appropriate approximation - can be used Some metrics, like BLEU, are defined on test set Need sentence-level approximation May require some ‘smoothing’, e.g. simple count +1 smoothing in
BLEU
11-711 Machine Translation 8
Probability Distribution
We don’t have the true distribution Approximate with model distribution
Use scaling factor to smooth the distribution
<1 flattens > 1 sharpens Needs to be tuned (simple line search)
mmm
e
fehHfeH
feHfeP ),( with ,
),'(exp(
),(exp()|(
'
eE
eE'),'(exp(
),(exp()|(
efeH
feHfeP
11-711 Machine Translation 9
Evidence Space
Summation over ‘all’ translations for source sentence f
Can use more translations than those from which we want to collect N-best list
e.g. use top 10k as evidence, but only top 1k to select new 1-best Entire Lattice
)|()',(argminˆ'
fePeeLeee
eh EE
11-711 Machine Translation 10
MBR on n-best List
for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) { joint_prob = … marginal += joint_prob; }
/* Main MBR computation done here */ for (unsigned int i = 0; i < nBestList.GetSize(); i++){ weightedLossCumul = 0; for (unsigned int j = 0; j < nBestList.GetSize(); j++){ if ( i != j) { bleu = calculate_score(translations, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss){ minMBRLoss = weightedLossCumul; minMBRLossIdx = i; } } /* Find sentence that minimises Bayes Risk under 1- BLEU loss */ return translations[minMBRLossIdx];}
11-711 Machine Translation 11
MBR on n-best List
(From Moses::scripts/training/mbr/mbr.cpp)
void process(int sent, const vector<candidate_t*> & sents){
for (int i = 0; i < sents.size(); i++){ //Calculate marginal and cache the posteriors joint_prob = calculate_probability(sents[i]->features,weights,SCALE); marginal += joint_prob; … }… /* Main MBR computation done here */ for (int i = 0; i < sents.size(); i++){ weightedLossCumul = 0; for (int j = 0; j < sents.size(); j++){ if ( i != j) { bleu = calculate_score(sents, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss) { minMBRLoss = weightedLossCumul;
minMBRLossIdx = i; } }}
11-711 Machine Translation 12
MBR on n-best List
Runtime is O(Nh*Ne) with Nh is number of (top-ranking) hypotheses considered for
selection And Ne the number of hypotheses summed over
Typically, full n-best list for hypothesis and evidence space Runtime quadratic in n-best list size
11-711 Machine Translation 13
MBR on Lattice
A bit more complicated :-) Cannot enumerate all paths – need local loss (gain)
function Tromble et al show how this can be done
Assume gain function can be written as sum of local gain functions, i.e gains for individual n-grams
Calculate local gain function in terms of n-gram posterior Reduce summation over exponentially many paths to summation
over number of n-grams, which is polynomial worst case Different approximations to test set BLEU Same parameters need to be tuned on (second) development set
2 pass decoding Pass 1: standard decoding with lattice generation Pass 2: MBR decoding over lattice
11-711 Machine Translation 14
MBR on Lattice - Results
Ar-En Ch-En En-Ch
MAP 43.7 27.9 41.4
N-best MBR 43.9 28.3 42.0
Lattice MBR 44.9 28.5 42.6
Reported are Bleu scores on NIST 2008 test set MBR on lattice outperforms MBR on n-best list outperforms
MAP decoding
11-711 Machine Translation 15
Hypothesis Space and Evidence Space
Hyp Space Evid Space Ar-En Ch-En En-Ch
Lattice Lattice 44.9 28.5 42.6
1000-best Lattice 44.6 28.5 42.6
Lattice 1000-best 44.1 28.0 42.1
1000-best 1000-best 44.2 28.1 42.2
Larger evidence space is more important then large hypothesis space
Notice: this experiment used a different BLEU approximation, giving higher scores
11-711 Machine Translation 16
Tuning MBR Decoder
Tuning parameter is important Flatting distribution ( <1) makes it easier to select
other then the MAP hypothesis
11-711 Machine Translation 17
MBR for System Combination
We have seen system combination based on combined n-best lists e.g. Silja’s Hypothesis Selection system Essentially n-best list rescoring on combined n-best list
MBR works on n-best translations -> can be used to combine systems
Example: Gispert et al. 2008, MBR Combination of Translation Hypotheses from Alternative Morphological Decomposition Preprocessing with MADA and
Sakhr tagger Build 2 translation systems MBR combination
mt02-mt05
Dev Test mt08
MADA-based 53.3 52.7 43.7
+MBR 53.7 53.3 44.0
SAKHR-based
52.7 52.8 43.3
+MBR 53.2 53.2 43.8
MBR-combi 54.6 54.6 45.6
11-711 Machine Translation 18
MBR for System Combination
In-house experiments combining 200-best lists of 3 decoders Results on newswire test set MBR improvements for individual systems ~0.4-0.5 BLEU, 0.1-
0.4 TER MBR-combi improvement over best: 0.9 BLEU, 0.6 TER
System TER BLEU
PSMT1 60.79 28.82
PSMT2 60.76 27.09
SAMT 60.27 28.98
PSMT1-MBR 60.51 29.36
PSMT2-MBR 60.57 27.61
SAMT-MBR 60.11 29.40
MBR-Combi 59.53 30.23
11-711 Machine Translation 20
MIRA
Online large margin discriminative training Online: update after each training example Large margin: move training examples away from ‘gray’
aread Discriminative: compare against alternatives also means that its supervised
Originally described by Crammer and Singer, 2003 Applied to statistical MT by Watanabe et al., 2007 Also used by Chiang et al, 2009
11-711 Machine Translation 21
Training Algorithm
GenerateN-best translations
Update oracle list
Update feature weighs
Return averaged feature weighs
11-711 Machine Translation 22
Weight Update
on translatireference is and
),(}{),(with
',ˆ
0)',ˆ(
);',ˆ()',ˆ()ˆ,()ˆ,(
)',ˆ(||||minargˆ
11
',ˆ
11
1
t
tTit
tt
ttiti
ee
ii
w
i
e
efhwefs
CeOe
ee
eeeLeeefsefs
eeCwwwi
Slack variable C >=0; larger C means larger updates to weight vector
L(…) is loss function, e.g. loss in Bleu argmin mean that we want to change the weights
only as much as needed
11-711 Machine Translation 25
Results (Chiang 2009)
Gale 2008 Chinese - English
System Training Features BLEU
Hiero MERT 11 36.1
MIRA 10,990 37.6
Syntax MERT 25 39.5
MIRA 285 40.6