translation models: taking translation direction into account

Translation Models: Taking Translation Direction into Account

Gennadi LemberskyNoam OrdanShuly WintnerISCOL, 2011

Statistical Machine Translation (SMT)• Given foreign sentence f:

▫ “Maria no dio una bofetada a la bruja verde”• Find the most likely English translation e:

▫“Maria did not slap the green witch”• Most likely English translation e is given by:

arg max P(e|f):• P(e|f) estimates conditional probability of any e

given f• How to estimate P(e|f)?• Noisy channel:

▫ Decompose P(e|f) into P(f|e) * P(e) / P(f)▫ Estimate P(f|e) using parallel corpus (translation model)▫ Estimate P(e) using monolingual corpus (language model)

2

Translation Model•How to model P(f|e)?

▫Learn parameters of P(f|e) from a parallel corpus▫Estimate translation model parameters at the

phrase level explicit modeling of word context captures local reorderings, local dependencies

•IBM Models define how words in a source sentence can be aligned to words in a parallel target sentence▫EM is used to estimate the parameters

•Aligned words are extended to phrases•Results: phrase-table

3

Log-Linear Models•Log-linear models

▫where hi are the feature functions and λi are the model parameters

▫typical feature functions: phrase translation probabilities, lexical translation probabilities, language model probability, reordering model

•Model parameter estimation (tuning) using discriminative training; MERT algorithm (Och,2003)

4

iii feh

eefeP

),(

)|(maxarg

Evaluation •Human evaluation is not practical – too slow

and costly•Automatic evaluation is based on a human

reference translation▫The output of an MT system is compared to

the human translation of the same set of sentences

▫ The metric basically calculate the distance between MT output and the reference translation

•Tens of metrics were developed▫BLEU is the most popular one▫METEOR and TER are close

5

Original vs. Translated Texts

Given this simplified model:

Two points are made with regard to the “intermediate component” (TM and LM):

1. TM is blind to direction (but see Kurokawa et al., 2009)

2. LMs are based on originally written texts.

6

Source Text Target TextLM

TM

Original vs. Translated Texts

Translated texts are ontologically different from non-translated texts ; they generally exhibit

1. Simplification of the message, the grammar or both (Al-Shabab, 1996, Laviosa, 1998) ;

2. Explicitation, the tendency to spell out implicit utterances that occur in the source text (Blum-Kulka, 1986).

7

Original vs. Translated Texts•Translated texts can be distinguished

from non-translated texts with high accuracy (87% and more)- For Italian (Baroni & Bernardini, 2006)- For Spanish (Iliseiet al., 2010);- For English (Koppel & Ordan, 2011)

8

How Translation Direction Affects MT?

•Language Models▫Our work (accepted to EMNLP) shows that

translated LMs are better for MT systems than the original ones.

•Translation Models▫Kurokawa et al, 2009 showed that when

translating French into English it is better to use French-translated-to-English parallel corpus and vice versa.

▫This work supports this claim and extends it (in review for WMT)

9

Our Setup• Canadian Hansard corpus: parallel French-English

corpus▫80% Original English (EO)▫20% Original French (FO)▫The ‘source’ language is marked

• Two scenarios:▫Balanced: 750K FO sentences and 750K EO sentences▫Biased: 750K FO sentences and 3M EO sentences

• MOSES PB-SMT toolkit• Tuning & Evaluation:

▫1000 FO sentences for tuning and 5000 FO sentences for evaluation

10

Baseline Experiments•We translate French-to-English•EO – train the phrase-table on EO portion

of the parallel corpus•FO – train the phrase-table on FO portion

of the parallel corpus•FO+EO – train the phrase-table on all the

parallel corpus

11

Baseline Results12

Time Size BLEU System Set1.04 1,391,365 28.44 EO

Balanced0.98 1,308,726 31.92 FO1.09 2,429,807 31.72 FO+EO1.22 4,236,189 29.53 EO

Biased0.98 1,308,726 31.92 FO1.15 5,101,973 32.85 FO+EO

SystemA: Two Phrase-Tables•EO – train the phrase-table on EO portion

of the parallel corpus•FO – train the phrase-table on FO portion

of the parallel corpus•SystemA – let MOSES use both phrase-

tables▫Log-linear model training gives each

phrase-table different scores

13

SystemA Results

14

Time Size BLEU System Set1.89 2,700,091 33.21 SystemA Balanced2.39 5,544,915 33.54 SystemA Biased

• In the balanced scenario we gained 1.29 BLEU

• In the biased scenario we gained 0.69 BLEU• The cost is the decoding time and memory

resources

Looking Inside…15

•Complete table – a phrase-table after training

•Filtered table – a phrase-table that contains only phrases that appear in the evaluation set

Few Observations… / 1•Balanced Set / Complete tables

▫FO table has many more unique French phrases (15.8M vs. 13M)

▫EO table has more translation options per each source phrase (1.42 vs. 1.33)

▫The sources phrases in the intersection are shorter (3.76 vs. 5.07-5.16), but they have more translations (3.08-3.21 vs. 1.09-1.10)

16

Few Observations… / 2•Balanced Set / Filtered tables

▫The intersection comprises 96.1% of the translation phrase-pairs in the FO table and 98.3% of the translation phrase-pairs in the EO table.

17

Few Observations… / 3• Biased Set – we added 2,250,000 English-

original sentences. What happens?▫In ‘complete’ EO table – everything grows

• In Filtered Tables▫number of phrase-pairs increases by a factor of 3▫number of unique source phrases increases by 1/3

Coverage of French phrases haven’t improved by much▫The average number of translations increases by a

factor of 2.3 (from 13.2 to 30.3) Long tail – the probability is split between larger

number of translations. Good translations get lower probability than in FO table

18

How does MOSES Select Phrases?• Balanced Set

• 96.5% comes from FO table• 99.3% of the phrase-pairs

selected from the intersection originated in the FO table

• Biased Set

• 94.5% comes from FO table• 98.2% of the phrase-pairs

selected from the intersection originated in the FO table

19

The tuning effect /1•A question: Is FO phrase-table better than

the EO phrase-table or it becomes better during the tuning.

•Let’s test SystemA with initial (pre-tuning) configuration and with the configuration generated by tuning.

20

The tuning effect /221

• Balanced Set / Before tuning

• 58% only comes from the FO table

• 57.7% of the phrase-pairs selected from the intersection originated in the FO table

• Balanced Set / After tuning

• 95.4% comes from FO table

• 99.3% of the phrase-pairs selected from the intersection originated in the FO table

The tuning effect /3•The decoder prefers the FO table in the

initial configuration (58%). •The preference becomes much stronger

after tuning (95.4%)•Interestingly, the decoder doesn’t just

replace EO phrases with FO phrases; it searches for the longer phrases;▫The average length of a phrase selected

from the EO table increases by about 1.5 words.

22

New Experiment: SystemB•Based on these results, we can through

away the intersection subset of the EO phrase-table▫We expect a small loss in quality, but a

significant improvement in translation speed.

23

SystemB Results24

Time Size BLEU System Set1.04 1,391,365 28.44 EO

Balanced0.98 1,308,726 31.92 FO1.09 2,429,807 31.72 FO+EO1.89 2,700,091 33.21 SystemA0.94 1,327,955 33.19 SystemB1.22 4,236,189 29.53 EO

Biased0.98 1,308,726 31.92 FO1.15 5,101,973 32.85 FO+EO2.39 5,544,915 33.54 SystemA0.95 1,382,017 33.34 SystemB

What about classified corpus?

•Annotation of the source language is rarely available in the parallel corpora.▫Will our SystemA and SystemB outperform

FO+EO and FO MT systems?•We use we use SVM for classification, and

our features are punctuation marks and the n-grams of part-of-speech tags.

•We train the classifier on an English-French subset of the Europarl corpus.

•Accuracy is about 73.5%

25

Classified System Results26

BLEU System Set31.72 EO+FO

Balanced

31.92 FO (annotated)32.04 FO (classified)32.91 SystemA

(classified)32.57 SystemB

(classified)33.21 SystemA

(annotated)33.19 SystemB

(annotated)

Thank You!

27

translation models: taking translation direction into account

Documents

translation direction

human translation

english parallel corpus

nontranslated texts

translation modelskurokawa

translated lms

model pfe

translated textsgiven