1 machine translation mira and mbr stephan vogel spring semester 2011

26
1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011

Upload: elaine-george

Post on 01-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

1

Machine Translation

MIRA and MBR

Stephan VogelSpring Semester 2011

11-711 Machine Translation 2

Overview

MBR – Minimum Bayes Risk Decoding MIRA – Margin Infused Relaxed Algorithm

11-711 Machine Translation 3

Optimization

We learned about tuning the MT system Decoder used a set of feature functions Optimize towards MT metric by adjusting feature weights

Different optimization approaches Simplex Powell MERT a la Och

Problem Only work well for small number (<30) of features

11-711 Machine Translation 4

MBR – Minimum Bayes Risk

11-711 Machine Translation 5

MBR for Translation

Translate source sentence f into target sentence e Decoder generates many alternative translations Select least risky translation:

)|()',(argmin

)'(argminˆ

'

'

fePeeL

eRe

ee

e

eh EE

E

True distribution

Loss Function

Evidence Space

Hypothesis space

11-711 Machine Translation 6

Hypothesis Space

Hypothesis Space: We can only select a translation, which we have generated Decoder prunes most of the possible hypotheses We typically generate n-best translations from the search graph Search graph contains many more (typically > J10*k, J sentence

length and k reordering window) paths Ueffing et al. 2002: Generation of Word Graphs in Statistical

Machine Translation, describe generation of output lattice from search graph

Typically n-best list is used (e.g. in Moses package) Tromble et al. describe lattice MBR

Note about terminology: I used ‘Translation Lattice’ for the lattice, which includes all phrase translations for

the source sentence Others use ‘Translation Lattice’ for the output word graph

11-711 Machine Translation 7

The Loss Function

Loss function gives ‘cost’ for generating a wrong translation

Kumar & Byrne (2004) studies different loss functions Lexical: compare on the word level only, e.g. WER, PER, 1-BLEU Target language parse tree, e.g. tree edit distance between parse

trees Bilingual parse tree: uses information from word strings, alignments

and parse-trees in both languages

Ehling, Zens & Ney (2007) use BLEU

Any automatic MT evaluation metric – or appropriate approximation - can be used Some metrics, like BLEU, are defined on test set Need sentence-level approximation May require some ‘smoothing’, e.g. simple count +1 smoothing in

BLEU

11-711 Machine Translation 8

Probability Distribution

We don’t have the true distribution Approximate with model distribution

Use scaling factor to smooth the distribution

<1 flattens > 1 sharpens Needs to be tuned (simple line search)

mmm

e

fehHfeH

feHfeP ),( with ,

),'(exp(

),(exp()|(

'

eE

eE'),'(exp(

),(exp()|(

efeH

feHfeP

11-711 Machine Translation 9

Evidence Space

Summation over ‘all’ translations for source sentence f

Can use more translations than those from which we want to collect N-best list

e.g. use top 10k as evidence, but only top 1k to select new 1-best Entire Lattice

)|()',(argminˆ'

fePeeLeee

eh EE

11-711 Machine Translation 10

MBR on n-best List

for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) { joint_prob = … marginal += joint_prob; }

/* Main MBR computation done here */ for (unsigned int i = 0; i < nBestList.GetSize(); i++){ weightedLossCumul = 0; for (unsigned int j = 0; j < nBestList.GetSize(); j++){ if ( i != j) { bleu = calculate_score(translations, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss){ minMBRLoss = weightedLossCumul; minMBRLossIdx = i; } } /* Find sentence that minimises Bayes Risk under 1- BLEU loss */ return translations[minMBRLossIdx];}

11-711 Machine Translation 11

MBR on n-best List

(From Moses::scripts/training/mbr/mbr.cpp)

void process(int sent, const vector<candidate_t*> & sents){

for (int i = 0; i < sents.size(); i++){ //Calculate marginal and cache the posteriors joint_prob = calculate_probability(sents[i]->features,weights,SCALE); marginal += joint_prob; … }… /* Main MBR computation done here */ for (int i = 0; i < sents.size(); i++){ weightedLossCumul = 0; for (int j = 0; j < sents.size(); j++){ if ( i != j) { bleu = calculate_score(sents, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } } if (weightedLossCumul < minMBRLoss) { minMBRLoss = weightedLossCumul;

minMBRLossIdx = i; } }}

11-711 Machine Translation 12

MBR on n-best List

Runtime is O(Nh*Ne) with Nh is number of (top-ranking) hypotheses considered for

selection And Ne the number of hypotheses summed over

Typically, full n-best list for hypothesis and evidence space Runtime quadratic in n-best list size

11-711 Machine Translation 13

MBR on Lattice

A bit more complicated :-) Cannot enumerate all paths – need local loss (gain)

function Tromble et al show how this can be done

Assume gain function can be written as sum of local gain functions, i.e gains for individual n-grams

Calculate local gain function in terms of n-gram posterior Reduce summation over exponentially many paths to summation

over number of n-grams, which is polynomial worst case Different approximations to test set BLEU Same parameters need to be tuned on (second) development set

2 pass decoding Pass 1: standard decoding with lattice generation Pass 2: MBR decoding over lattice

11-711 Machine Translation 14

MBR on Lattice - Results

Ar-En Ch-En En-Ch

MAP 43.7 27.9 41.4

N-best MBR 43.9 28.3 42.0

Lattice MBR 44.9 28.5 42.6

Reported are Bleu scores on NIST 2008 test set MBR on lattice outperforms MBR on n-best list outperforms

MAP decoding

11-711 Machine Translation 15

Hypothesis Space and Evidence Space

Hyp Space Evid Space Ar-En Ch-En En-Ch

Lattice Lattice 44.9 28.5 42.6

1000-best Lattice 44.6 28.5 42.6

Lattice 1000-best 44.1 28.0 42.1

1000-best 1000-best 44.2 28.1 42.2

Larger evidence space is more important then large hypothesis space

Notice: this experiment used a different BLEU approximation, giving higher scores

11-711 Machine Translation 16

Tuning MBR Decoder

Tuning parameter is important Flatting distribution ( <1) makes it easier to select

other then the MAP hypothesis

11-711 Machine Translation 17

MBR for System Combination

We have seen system combination based on combined n-best lists e.g. Silja’s Hypothesis Selection system Essentially n-best list rescoring on combined n-best list

MBR works on n-best translations -> can be used to combine systems

Example: Gispert et al. 2008, MBR Combination of Translation Hypotheses from Alternative Morphological Decomposition Preprocessing with MADA and

Sakhr tagger Build 2 translation systems MBR combination

mt02-mt05

Dev Test mt08

MADA-based 53.3 52.7 43.7

+MBR 53.7 53.3 44.0

SAKHR-based

52.7 52.8 43.3

+MBR 53.2 53.2 43.8

MBR-combi 54.6 54.6 45.6

11-711 Machine Translation 18

MBR for System Combination

In-house experiments combining 200-best lists of 3 decoders Results on newswire test set MBR improvements for individual systems ~0.4-0.5 BLEU, 0.1-

0.4 TER MBR-combi improvement over best: 0.9 BLEU, 0.6 TER

System TER BLEU

PSMT1 60.79 28.82

PSMT2 60.76 27.09

SAMT 60.27 28.98

PSMT1-MBR 60.51 29.36

PSMT2-MBR 60.57 27.61

SAMT-MBR 60.11 29.40

MBR-Combi 59.53 30.23

11-711 Machine Translation 19

MIRA – Margin Infused Relaxed Algorithm

11-711 Machine Translation 20

MIRA

Online large margin discriminative training Online: update after each training example Large margin: move training examples away from ‘gray’

aread Discriminative: compare against alternatives also means that its supervised

Originally described by Crammer and Singer, 2003 Applied to statistical MT by Watanabe et al., 2007 Also used by Chiang et al, 2009

11-711 Machine Translation 21

Training Algorithm

GenerateN-best translations

Update oracle list

Update feature weighs

Return averaged feature weighs

11-711 Machine Translation 22

Weight Update

on translatireference is and

),(}{),(with

',ˆ

0)',ˆ(

);',ˆ()',ˆ()ˆ,()ˆ,(

)',ˆ(||||minargˆ

11

',ˆ

11

1

t

tTit

tt

ttiti

ee

ii

w

i

e

efhwefs

CeOe

ee

eeeLeeefsefs

eeCwwwi

Slack variable C >=0; larger C means larger updates to weight vector

L(…) is loss function, e.g. loss in Bleu argmin mean that we want to change the weights

only as much as needed

11-711 Machine Translation 23

MIRA

Model Score

Met

ric S

core

margin

loss

11-711 Machine Translation 24

MIRA

Model Score

Met

ric S

core

11-711 Machine Translation 25

Results (Chiang 2009)

Gale 2008 Chinese - English

System Training Features BLEU

Hiero MERT 11 36.1

MIRA 10,990 37.6

Syntax MERT 25 39.5

MIRA 285 40.6

11-711 Machine Translation 26

Summary

MBR decoding Select less risky hypothesis from n-best list (or lattice)

MIRA Optimize feature weights for decoder Works with very large number of features