1 machine translation mira and mbr stephan vogel spring semester 2011

1

Machine Translation

MIRA and MBR

Stephan VogelSpring Semester 2011

11-711 Machine Translation 2

Overview

MBR – Minimum Bayes Risk Decoding MIRA – Margin Infused Relaxed Algorithm


Optimization

We learned about tuning the MT system Decoder used a set of feature functions Optimize towards MT metric by adjusting feature weights

Different optimization approaches Simplex Powell MERT a la Och

Problem Only work well for small number (<30) of features


MBR – Minimum Bayes Risk


MBR for Translation

Translate source sentence f into target sentence e Decoder generates many alternative translations Select least risky translation:

)|()',(argmin

)'(argminˆ

'

'

fePeeL

eRe

ee

e

eh EE

E

True distribution

Loss Function

Evidence Space

Hypothesis space


Hypothesis Space

Hypothesis Space: We can only select a translation, which we have generated Decoder prunes most of the possible hypotheses We typically generate n-best translations from the search graph Search graph contains many more (typically > J10*k, J sentence

length and k reordering window) paths Ueffing et al. 2002: Generation of Word Graphs in Statistical

Machine Translation, describe generation of output lattice from search graph

Typically n-best list is used (e.g. in Moses package) Tromble et al. describe lattice MBR

Note about terminology: I used ‘Translation Lattice’ for the lattice, which includes all phrase translations for

the source sentence Others use ‘Translation Lattice’ for the output word graph


The Loss Function

Loss function gives ‘cost’ for generating a wrong translation

Kumar & Byrne (2004) studies different loss functions Lexical: compare on the word level only, e.g. WER, PER, 1-BLEU Target language parse tree, e.g. tree edit distance between parse

trees Bilingual parse tree: uses information from word strings, alignments

and parse-trees in both languages

Ehling, Zens & Ney (2007) use BLEU

Any automatic MT evaluation metric – or appropriate approximation - can be used Some metrics, like BLEU, are defined on test set Need sentence-level approximation May require some ‘smoothing’, e.g. simple count +1 smoothing in

BLEU


Probability Distribution

We don’t have the true distribution Approximate with model distribution

Use scaling factor to smooth the distribution

<1 flattens > 1 sharpens Needs to be tuned (simple line search)

mmm

e

fehHfeH

feHfeP ),( with ,

),'(exp(

),(exp()|(

'

eE

eE'),'(exp(

),(exp()|(

efeH

feHfeP


Evidence Space

Summation over ‘all’ translations for source sentence f

Can use more translations than those from which we want to collect N-best list

e.g. use top 10k as evidence, but only top 1k to select new 1-best Entire Lattice

)|()',(argminˆ'

fePeeLeee

eh EE


MBR on n-best List

for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) { joint_prob = … marginal += joint_prob; }

/* Main MBR computation done here */ for (unsigned int i = 0; i < nBestList.GetSize(); i++){ weightedLossCumul = 0; for (unsigned int j = 0; j < nBestList.GetSize(); j++){ if ( i != j) { bleu = calculate_score(translations, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss){ minMBRLoss = weightedLossCumul; minMBRLossIdx = i; } } /* Find sentence that minimises Bayes Risk under 1- BLEU loss */ return translations[minMBRLossIdx];}


MBR on n-best List

(From Moses::scripts/training/mbr/mbr.cpp)

void process(int sent, const vector<candidate_t*> & sents){

for (int i = 0; i < sents.size(); i++){ //Calculate marginal and cache the posteriors joint_prob = calculate_probability(sents[i]->features,weights,SCALE); marginal += joint_prob; … }… /* Main MBR computation done here */ for (int i = 0; i < sents.size(); i++){ weightedLossCumul = 0; for (int j = 0; j < sents.size(); j++){ if ( i != j) { bleu = calculate_score(sents, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss) { minMBRLoss = weightedLossCumul;

minMBRLossIdx = i; } }}


MBR on n-best List

Runtime is O(Nh*Ne) with Nh is number of (top-ranking) hypotheses considered for

selection And Ne the number of hypotheses summed over

Typically, full n-best list for hypothesis and evidence space Runtime quadratic in n-best list size


MBR on Lattice

A bit more complicated :-) Cannot enumerate all paths – need local loss (gain)

function Tromble et al show how this can be done

Assume gain function can be written as sum of local gain functions, i.e gains for individual n-grams

Calculate local gain function in terms of n-gram posterior Reduce summation over exponentially many paths to summation

over number of n-grams, which is polynomial worst case Different approximations to test set BLEU Same parameters need to be tuned on (second) development set

2 pass decoding Pass 1: standard decoding with lattice generation Pass 2: MBR decoding over lattice


MBR on Lattice - Results

Ar-En Ch-En En-Ch

MAP 43.7 27.9 41.4

N-best MBR 43.9 28.3 42.0

Lattice MBR 44.9 28.5 42.6

Reported are Bleu scores on NIST 2008 test set MBR on lattice outperforms MBR on n-best list outperforms

MAP decoding


Hypothesis Space and Evidence Space

Hyp Space Evid Space Ar-En Ch-En En-Ch

Lattice Lattice 44.9 28.5 42.6

1000-best Lattice 44.6 28.5 42.6

Lattice 1000-best 44.1 28.0 42.1

1000-best 1000-best 44.2 28.1 42.2

Larger evidence space is more important then large hypothesis space

Notice: this experiment used a different BLEU approximation, giving higher scores


Tuning MBR Decoder

Tuning parameter is important Flatting distribution ( <1) makes it easier to select

other then the MAP hypothesis


MBR for System Combination

We have seen system combination based on combined n-best lists e.g. Silja’s Hypothesis Selection system Essentially n-best list rescoring on combined n-best list

MBR works on n-best translations -> can be used to combine systems

Example: Gispert et al. 2008, MBR Combination of Translation Hypotheses from Alternative Morphological Decomposition Preprocessing with MADA and

Sakhr tagger Build 2 translation systems MBR combination

mt02-mt05

Dev Test mt08

MADA-based 53.3 52.7 43.7

+MBR 53.7 53.3 44.0

SAKHR-based

52.7 52.8 43.3

+MBR 53.2 53.2 43.8

MBR-combi 54.6 54.6 45.6


MBR for System Combination

In-house experiments combining 200-best lists of 3 decoders Results on newswire test set MBR improvements for individual systems ~0.4-0.5 BLEU, 0.1-

0.4 TER MBR-combi improvement over best: 0.9 BLEU, 0.6 TER

System TER BLEU

PSMT1 60.79 28.82

PSMT2 60.76 27.09

SAMT 60.27 28.98

PSMT1-MBR 60.51 29.36

PSMT2-MBR 60.57 27.61

SAMT-MBR 60.11 29.40

MBR-Combi 59.53 30.23


MIRA – Margin Infused Relaxed Algorithm


MIRA

Online large margin discriminative training Online: update after each training example Large margin: move training examples away from ‘gray’

aread Discriminative: compare against alternatives also means that its supervised

Originally described by Crammer and Singer, 2003 Applied to statistical MT by Watanabe et al., 2007 Also used by Chiang et al, 2009


Training Algorithm

GenerateN-best translations

Update oracle list

Update feature weighs

Return averaged feature weighs


Weight Update

on translatireference is and

),(}{),(with

',ˆ

0)',ˆ(

);',ˆ()',ˆ()ˆ,()ˆ,(

)',ˆ(||||minargˆ

11

',ˆ

11

1

t

tTit

tt

ttiti

ee

ii

w

i

e

efhwefs

CeOe

ee

eeeLeeefsefs

eeCwwwi

Slack variable C >=0; larger C means larger updates to weight vector

L(…) is loss function, e.g. loss in Bleu argmin mean that we want to change the weights

only as much as needed


MIRA

Model Score

Met

ric S

core

margin

loss


MIRA

Model Score

Met

ric S

core


Results (Chiang 2009)

Gale 2008 Chinese - English

System Training Features BLEU

Hiero MERT 11 36.1

MIRA 10,990 37.6

Syntax MERT 25 39.5

MIRA 285 40.6


Summary

MBR decoding Select less risky hypothesis from n-best list (or lattice)

MIRA Optimize feature weights for decoder Works with very large number of features

1 machine translation mira and mbr stephan vogel spring semester 2011

Documents