Transcript
Page 1: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

Predicting Success in Machine Translation

Alexandra Birch Miles Osborne Philipp [email protected] [email protected] [email protected]

School of InformaticsUniversity of Edinburgh

10 Crichton StreetEdinburgh, EH8 9AB, UK

Abstract

The performance of machine translation sys-tems varies greatly depending on the sourceand target languages involved. Determiningthe contribution of different characteristics oflanguage pairs on system performance is keyto knowing what aspects of machine transla-tion to improve and which are irrelevant. Thispaper investigates the effect of different ex-planatory variables on the performance of aphrase-based system for 110 European lan-guage pairs. We show that three factors arestrong predictors of performance in isolation:the amount of reordering, the morphologicalcomplexity of the target language and the his-torical relatedness of the two languages. To-gether, these factors contribute 75% to thevariability of the performance of the system.

1 Introduction

Statistical machine translation (SMT) has improvedover the last decade of intensive research, but forsome language pairs, translation quality is still low.Certain systematic differences between languagescan be used to predict this. Many researchers havespeculated on the reasons why machine translation ishard. However, there has never been, to our knowl-edge, an analysis of what the actual contribution ofdifferent aspects of language pairs is to translationperformance. This understanding of where the diffi-culties lie will allow researchers to know where tomost gainfully direct their efforts to improving thecurrent models of machine translation.

Many of the challenges of SMT were first out-lined by Brown et al. (1993). The original IBMModels were broken down into separate translation

and distortion models, recognizing the importanceof word order differences in modeling translation.Brown et al. also highlighted the importance of mod-eling morphology, both for reducing sparse countsand improving parameter estimation and for the cor-rect production of translated forms. We see these twofactors, reordering and morphology, as fundamentalto the quality of machine translation output, and wewould like to quantify their impact on system per-formance.

It is not sufficient, however, to analyze the mor-phological complexity of the source and target lan-guages. It is also very important to know how sim-ilar the morphology is between the two languages,as two languages which are morphologically com-plex in very similar ways, could be relatively easyto translate. Therefore, we also include a measure ofthe family relatedness of languages in our analysis.

The impact of these factors on translation is mea-sured by using linear regression models. We performthe analysis with data from 110 different languagepairs drawn from the Europarl project (Koehn,2005). This contains parallel data for the 11 officiallanguage pairs of the European Union, providing arich variety of different language characteristics forour experiments. Many research papers report re-sults on only one or two languages pairs. By analyz-ing so many language pairs, we are able to providea much wider perspective on the challenges facingmachine translation. This analysis is important as itprovides very strong motivation for further research.

The findings of this paper are as follows: (1) eachof the main effects, reordering, target language com-plexity and language relatedness, is a highly signif-icant predictor of translation performance, (2) indi-vidually these effects account for just over a third of

Page 2: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

the variation of the BLEU score, (3) taken together,they account for 75% of the variation of the BLEU

score, (4) when removing Finnish results as out-liers, reordering explains the most variation, and fi-nally (4) the morphological complexity of the sourcelanguage is uncorrelated with performance, whichsuggests that any difficulties that arise with sparsecounts are insignificant under the experimental con-ditions outlined in this paper.

2 Europarl

In order to analyze the influence of different lan-guage pair characteristics on translation perfor-mance, we need access to a large variety of compa-rable parallel corpora. A good data source for this isthe Europarl Corpus (Koehn, 2005). It is a collectionof the proceedings of the European Parliament, dat-ing back to 1996. Version 3 of the corpus consists ofup to 44 million words for each of the 11 official lan-guages of the European Union: Danish (da), German(de), Greek (el), English (en), Spanish (es), Finnish(fi), French (fr), Italian (it), Dutch (nl), Portuguese(pt), and Swedish (sv).

In trying to determine the effect of properties ofthe languages involved in translation performance,it is very important that other variables be kept con-stant. Using Europarl, the size of the training datafor the different language pairs is very similar, andthere are no domain differences as all sentences areroughly trained on translations of the same data.

3 Morphological Complexity

The morphological complexity of the language pairsinvolved in translation is widely recognized as oneof the factors influencing translation performance.However, most statistical translation systems treatdifferent inflected forms of the same lemma as com-pletely independent of one another. This can result insparse statistics and poorly estimated models. Fur-thermore, different variations of the lemma may re-sult in crucial differences in meaning that affect thequality of the translation.

Work on improving MT systems’ treatment ofmorphology has focussed on either reducing wordforms to lemmas to reduce sparsity (Goldwaterand McClosky, 2005; Talbot and Osborne, 2006)or including morphological information in decod-

Language

Av.

Voc

abul

ary

Siz

e

en fr it es pt el nl sv da de fi

100k

200k

300k

400k

500k

Figure 1. Average vocabulary size for each language.

ing (Dyer, 2007).Although there is a significant amount of research

into improving the treatment of morphology, in thispaper we aim to discover the effect that different lev-els of morphology have on translation. We measurethe amount of morphological complexity that existsin both languages and then relate this to translationperformance.

Some languages seem to be intuitively more com-plex than others, for instance Finnish appears morecomplex than English. There is, however, no obvi-ous way of measuring this complexity. One methodof measuring complexity is by choosing a numberof hand-picked, intuitive properties called complex-ity indicators (Bickel and Nichols, 2005) and thento count their occurrences. Examples of morpholog-ical complexity indicators could be the number of in-flectional categories or morpheme types in a typicalsentence. This method suffers from the major draw-back of finding a principled way of choosing whichof the many possible linguistic properties should beincluded in the list of indicators.

A simple alternative employed by Koehn (2005)is to use vocabulary size as a measure of morpho-logical complexity. Vocabulary size is strongly in-fluenced by the number of word forms affected bynumber, case, tense etc. and it is also affected by thenumber of agglutinations in the language. The com-plexity of the morphology of languages can there-fore be approached by looking at vocabulary size.

Page 3: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

Figure 1 shows the vocabulary size for all rele-vant languages. Each language pair has a slightlydifferent parallel corpus, and so the size of the vo-cabularies for each language needs to be averaged.You can see that the size of the Finnish vocabulary isabout six times larger (510,632 words) than the En-glish vocabulary size (88,880 words). The reason forthe large vocabulary size is that Finnish is character-ized by a rich inflectional morphology, and it is typo-logically classified as an agglutinative-fusional lan-guage. As a result, words are often polymorphemic,and become remarkably long.

4 Language Relatedness

The morphological complexity of each language inisolation could be misleading. Large differences inmorphology between two languages could be morerelevant to translation performance than a complexmorphology that is very similar in both languages.Languages which are closely related could sharemorphological forms which might be captured rea-sonably well in translation models. We include ameasure of language relatedness in our analyses totake this into account.

Comparative linguistics is a field of linguisticswhich aims to determine the historical relatednessof languages. Lexicostatistics, developed by MorrisSwadesh in the 1950s (Swadesh, 1955), is an ap-proach to comparative linguistics that is appropriatefor our purposes because it results in a quantitativemeasure of relatedness by comparing lists of lexicalcognates.

The lexicostatistic percentages are extracted asfollows. First, a list of universal culture-free mean-ings are generated. Words are then collected forthese meanings for each language under consider-ation. Lists for particular purposes have been gen-erated. For example, we use the data from Dyen etal. (1992) who developed a list of 200 meanings for84 Indo-European languages. Cognacy decisions arethen made by a trained linguist. For each pair of liststhe cognacy of a form can be positive, negative or in-determinate. Finally, the lexicostatistic percentage iscalculated. This percentage is related to the propor-tion of meanings for a particular language pair thatare cognates, i.e. relative to the total without inde-terminacy. Factors such as borrowing, tradition and

Language “animal” “black”French animal noirItalian animale nero

Spanish animal negroEnglish animal blackGerman tier schwarzSwedish djur svartDanish dyr sortDutch dier zwart

Table 1. An example from the (Dyen et al., 1992) cognatelist.

taboo can skew the results.A portion of the Dyen et al. (1992) data set is

shown in Table 1 as an example. From this data atrained linguist would calculate the relatedness ofFrench, Italian and Spanish as 100% because theirwords for “animal” and “black” are cognates. TheRomance languages share one cognate with English,“animal” but not “black”, which means that the lex-icostatistic percentage here would be 50%, and nocognates with the rest of the languages, 0%.

We use the Dyen lexicostatistic percentages as ourmeasure of language relatedness or similarity for allbidirectional language pairs except for Finnish, forwhich there is not data. Finnish is a Finno-Ugriclanguage and is not part of the Indo-European lan-guage family and is therefore not included in theDyen results. We were not able to recreate the con-ditions of this study to generate the data for Finnish- expert linguists with knowledge of all the lan-guages would be required. Excluding Finnish wouldhave been a shame as it is an interesting languageto look at, however we took care to confirm whicheffects found in this paper still held when exclud-ing Finnish. Not being part of the Indo-Europeanlanguages means that its historical similarity withour other languages is very low. For example, En-glish would be more closely related to Hindu than toFinnish. We therefore assume that Finnish has zerosimilarity with the other languages in the set.

Figure 2 shows the symmetric matrix of languagerelatedness, where the width of the square is pro-portional to the value of relatedness. Finnish is thelanguage which is least related to the other lan-guages and has a relatedness score of 0%. Spanish-Portuguese is the most related language pair with a

Page 4: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

it sv en el pt da es fr nl fi de

it

sv

en

el

pt

da

es

fr

nl

fi

de

= 0.17 = 0.35 = 0.52 = 0.7 = 0.87

Figure 2. Language relatedness - the width of the squaresindicates the lexicostatical relatedness.

score of 0.87%.A measure of family relatedness should improve

our understanding of the relationship between mor-phological complexity and translation performance.

5 Reordering

Reordering refers to differences in word order thatoccur in a parallel corpus and the amount of reorder-ing affects the performance of a machine translationsystem. In order to determine how much it affectsperformance, we first need to measure it.

5.1 Extracting Reorderings

Reordering is largely driven by syntactic differencesbetween languages and can involve complex rear-rangements between nodes in synchronous trees.Modeling reordering exactly would require a syn-chronous tree-substitution grammar. This represen-tation would be sparse and heterogeneous, limitingits usefulness as a basis for analysis. We make animportant simplifying assumption in order for thedetection and extraction of reordering data to betractable and useful. We assume that reordering isa binary process occurring between two blocks thatare adjacent in the source. This is similar to theITG constraint (Wu, 1997), however our reorder-ings are not dependent on a synchronous grammaror a derivation which covers the sentences. There arealso similarities with the Human-Targeted Transla-

tion Edit Rate metric (HTER) (Snover et al., 2006)which attempts to find the minimum number of hu-man edits to correct a hypothesis, and admits mov-ing blocks of words, however our algorithm is auto-matic and does not consider inserts or deletes.

Before describing the extraction of reorderings weneed to define some concepts. We define a block Aas consisting of a source span, As, which containsthe positions from Asmin to Asmax and is aligned toa set of target words. The minimum and maximumpositions (Atmin and Atmax) of the aligned targetwords mark the block’s target span, At.

A reordering r consists of the two blocks rA andrB , which are adjacent in the source and where therelative order of the blocks in the source is reversedin the target. More formally:

rAs < rBs , rAt> rBt

, rAsmax = rBsmin − 1

A consistent block means that between Atmin andAtmax there are no target word positions alignedto source words outside of the block’s source spanAs. A reordering is consistent if the block projectedfrom rAsmin to rBsmax is consistent.

The following algorithm detects reorderings anddetermines the dimensions of the blocks involved.We step through all the source words, and if a wordis reordered in the target with respect to the previ-ous source word, then a reordering is said to haveoccurred. These two words are initially defined asthe blocks A and B. Then the algorithm attemptsto grow block A from this point towards the sourcestarting position, while the target span of A is greaterthan that of block B, and the new block A is consis-tent. Finally it attempts to grow block B towards thesource end position, while the target span of B isless than that of A and the new reordering is incon-sistent.

See Figure 3 for an example of a sentence pairwith two reorderings. Initially a reordering is de-tected between the Chinese words aligned to “from”and “late”. The block A is grown from “late” to in-clude the whole phrase pair “late last night”. Thenthe block B is grown from “from” to include “Bei-jing” and stops because the reordering is then con-sistent. The next reordering is detected between “ar-rived in” and “Beijing”. We can see that block A at-tempts to grow as large a block as possible and block

Page 5: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

Figure 3. A sentence pair from the test corpus, with itsalignment. Two reorderings are shown with two differentdash styles.

B attempts to grow the smallest block possible. Thereorderings thus extracted would be comparable tothose of a right-branching ITG with inversions. Thisallows for syntactically plausible embedded reorder-ings. This algorithm has the worst case complexityof O(n2

2 ) when the words in the target occur in thereverse order to the words in the source.

5.2 Measuring Reordering

Our reordering extraction technique allows us to an-alyze reorderings in corpora according to the dis-tribution of reordering widths. In order to facilitatethe comparison of different corpora, we combinestatistics about individual reorderings into a sen-tence level metric which is then averaged over a cor-pus.

RQuantity =∑

r∈R |rAs | + |rBs |I

where R is the set of reorderings for a sentence, Iis the source sentence length, A and B are the twoblocks involved in the reordering, and |rAs | is thesize or span of block A on the source side. RQuan-tity is thus the sum of the spans of all the reorderingblocks on the source side, normalized by the lengthof the source sentence.

RQuantityEuroparl, auto align 0.620

WMT06 test, auto align 0.647WMT06 test, manual align 0.668

Table 2. The reordering quantity for the different reorder-ing corpora for DE-EN.

5.3 Automatic Alignments

Reorderings extracted from manually aligned datacan be reliably assumed to be correct. The onlyexception to this is that embedded reorderings arealways right branching and these might contradictsyntactic structure. In this paper, however, we usealignments that are automatically extracted from thetraining corpus using GIZA++. Automatic align-ments could give very different reordering results.In order to justify using reordering data extractedfrom automatic alignments, we must show that theyare similar enough to gold standard alignments to beuseful as a measure of reordering.

5.3.1 Experimental DesignWe select the German-English language pair be-

cause it has a reasonably high level of reordering. Amanually aligned German-English corpus was pro-vided by Chris Callison-Burch and consists of thefirst 220 sentences of test data from the 2006 ACLWorkshop on Machine Translation (WMT06) testset. This test set is from a held out portion of theEuroparl corpus.

The automatic alignments were extracted by ap-pending the manually aligned sentences on to therespective Europarl v3 corpora and aligning themusing GIZA++ (Och and Ney, 2003) and the grow-final-diag algorithm (Koehn et al., 2003).

5.3.2 ResultsIn order to use automatic alignments to extract re-

ordering statistics, we need to show that reorderingsfrom automatic alignments are comparable to thosefrom manual alignments.

We first look at global reordering statistics andthen we look in more detail at the reordering dis-tribution of the corpora. Table 2 shows the amountof reordering in the WMT06 test corpora, with bothmanual and automatic alignments, and in the auto-matically aligned Europarl DE-EN parallel corpus.

Page 6: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Reordering Width

Av.

Reo

rder

ings

per

Sen

tenc

e

ACL Test ManualACL Test AutomaticEuromatrix

Figure 4. Average number of reorderings per sentencemapped against the total width of the reorderings for DE-EN.

We can see that all three corpora show a similaramount of reordering.

Figure 4 shows that the distribution of reorder-ings between the three corpora is also very similar.These results provide evidence to support our use ofautomatic reorderings in lieu of manually annotatedalignments. Firstly, they show that our WMT06 testcorpus is very similar to the Europarl data, whichmeans that any conclusions that we reach using theWMT06 test corpus will be valid for the Europarldata. Secondly, they show that the reordering behav-ior of this corpus is very similar when looking atautomatic vs. manual alignments.

Although differences between the reorderings de-tected in the manually and automatically alignedGerman-English corpora are minor, there we acceptthat there could be a language pair whose real re-ordering amount is very different to the expectedamount given by the automatic alignments. A par-ticular language pair could have alignments that arevery unsuited to the stochastic assumptions of theIBM or HMM alignment models. However, manu-ally aligning 110 language pairs is impractical.

5.4 Amount of reordering for the matrix

Extracting the amount of reordering for each of the110 language pairs in the matrix required a sam-pling approach. We randomly extracted a subset of2000 sentences from each of the parallel trainingcorpora. From this subset we then extracted the av-

Sou

rce

Lang

uage

s

it sv en el pt da es fr nl fi de

it

sv

en

el

pt

da

es

fr

nl

fi

de

= 0.13 = 0.25 = 0.38 = 0.51 = 0.64

Target Languages

Figure 5. Reordering amount - the width of the squaresindicates the amount of reordering or RQuantity.

erage RQuantity.In Figure 5 the amount of reordering for each

of the language pairs is proportional to the widthof the relevant square. Note that the matrix is notquite symmetrical - reordering results differ de-pending on which language is chosen to measurethe reordering span. The lowest reordering scoresare generally for languages in the same languagegroup (like Portuguese-Spanish, 0.20, and Danish-Swedish, 0.24) and the highest for languages fromdifferent groups (like German-French, 0.64, andFinnish-Spanish, 0.61).

5.5 Language similarity and reordering

In this paper we use linear regression models to de-termine the correlation and significance of variousexplanatory variables with the dependent variable,the BLEU score. Ideally the explanatory variablesinvolved should be independent of each other, how-ever the amount of reordering in a parallel corpuscould easily be influenced by family relatedness. Weinvestigate the correlation between these variables.

Figure 6 shows the plot of the reordering amountagainst language similarity. The regression is highlysignificant and has an R2 of 0.2347. This means thatreordering is correlated with language similarity andthat 23% of reordering can be explained by languagesimilarity.

Page 7: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

●●

●● ●●●●

●● ●

●●

● ●

●●

● ●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●● ●●●●

● ●● ●●● ●●●●

●● ●●●●

●● ●

0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

Reordering Amount

Lang

uage

Sim

ilarit

y

Figure 6. Reordering compared to language similaritywith regression.

6 Experimental Design

We used the phrase-based model Moses (Koehn etal., 2007) for the experiments with all the standardsettings, including a lexicalized reordering model,and a 5-gram language model. Tests were run onthe ACL WSMT 2008 test set (Callison-Burch et al.,2008).

6.1 Evaluation of Translation Performance

We use the BLEU score (Papineni et al., 2002) toevaluate our systems. While the role of BLEU inmachine translation evaluation is a much discussedtopic, it is generally assumed to be a adequate metricfor comparing systems of the same type.

Figure 7 shows the BLEU score results for the ma-trix. Comparing this figure to Figure 5 there seemsto be a clear negative correlation between reorderingamount and translation performance.

6.2 Regression Analysis

We perform multiple linear regression analyses us-ing measures of morphological complexity, lan-guage relatedness and reordering amount as our in-dependent variables. The dependent variable is thetranslation performance metric, the BLEU score.

We then use a t-test to determine whether the co-efficients for the independent variables are reliablydifferent from zero. We also test how well the modelexplains the data using an R2 test. The two-tailedsignificance levels of coefficients and R2 are also

Sou

rce

Lang

uage

s

it sv en el pt da es fr nl fi de

it

sv

en

el

pt

da

es

fr

nl

fi

de

= 0.08 = 0.16 = 0.24 = 0.32 = 0.4

Target Languages

Figure 7. System performance - the width of the squaresindicates the system performance in terms of the BLEUscore.

Explanatory Variable CoefficientTarget Vocab. Size -3.885 ***Language Similarity 3.274 ***Reordering Amount -1.883 ***Target Vocab. Size2 1.017 ***Language Similarity2 -1.858 **Interaction: Reord/Sim -1.4536 ***

Table 3. The impact of the various explanatory featureson the BLEU score via their coefficients in the minimal ad-equate model.

given where * means p < 0.05, ** means p < 0.01,and *** means p < 0.001.

7 Results

7.1 Combined ModelThe first question we are interested in answering iswhich factors contribute most and how they interact.We fit a multiple regression model to the data. Thesource vocabulary size has no significant effect onthe outcome. All explanatory variable vectors werenormalized to be more comparable.

In Table 3 we can see the relative contribution ofthe different features to the model. Source vocabu-lary size did not contribute significantly to the ex-planatory power of this multiple regression modeland was therefore not included. The fraction of thevariance explained by the model, or its goodness offit, the R2, is 0.750 which means that 75% of the

Page 8: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

variation in BLEU can be explained by these threefactors. The interaction of reordering amount andlanguage relatedness is the product of the values ofthese two features, and in itself it is an important ex-planatory feature.

To make sure that our regression is valid, we needto consider the special case of Finnish. Data pointswhere Finnish is the target language are outliers.Finnish has the lowest language similarity with allother languages, and the largest vocabulary size. Italso has very high amounts of reordering, and thelowest BLEU scores when it is the target language.The multiple regression of Table 3 where Finnish asthe source and target language is excluded, showsthat all the effects are still very significant, with themodel’s R2 dropping only slightly to 0.68.

The coefficients of the variables in the multipleregression model have only limited usefulness as ameasure of the impact of the explanatory variablesin the model. One important factor to consider is thatif the explanatory variables are highly correlated,then the values of the coefficients are unstable. Themodel could attribute more importance to one or theother variable without changing the overall fit of themodel. This is the problem of multicollinearity. Ourexplanatory variables are all correlated, but a largeamount of this correlation can be explained by look-ing at language pairs with Finnish as the target lan-guage. Excluding these data points, only languagerelatedness and reordering amount are still corre-lated, see Section 5.5 for more details.

7.2 Contribution in isolationIn order to establish the relative contribution of vari-ables, we isolate their impact on the BLEU score bymodeling them in separate linear regression models.

Figure 8 shows a simple regression model overthe plot of BLEU scores against target vocabularysize. This figure shows groups of data points with thesame target language in almost vertical lines. Eachlanguage pair has a separate parallel training corpus,but the target vocabulary size for one language willbe very similar in all of them. The variance in BLEU

amongst the group with the same target language isthen largely explained by the other factors, similarityand reordering.

Figure 9 shows a simple regression model over theplot of BLEU scores against source vocabulary size.

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

● ●

●●●

●●

● ●

●●

●●

●●●●●●●

0.15

0.20

0.25

0.30

0.35

0.40

Target Vocabulary Size

BLE

U

| | | | |100k 200k 300k 400k 500k

Figure 8. BLEU score of experiments compared to targetvocabulary size showing regression

This regression model shows that in isolation sourcevocabulary size is significant (p < 0.05), but that thisis due to the distorting effect of Finnish. Excludingresults that include Finnish, there is no longer anysignificant correlation with BLEU. The source mor-phology might be significant for models trained onsmaller data sets, where model parameters are moresensitive to sparse counts.

Figure 10 shows the simple regression model overthe plot of BLEU scores against the amount of re-ordering. This graph shows that with more reorder-ing, the performance of the translation model re-duces. Data points with low levels of reordering andhigh BLEU scores tend to be language pairs whereboth languages are Romance languages. High BLEU

scores with high levels of reordering tend to haveGerman as the source language and a Romance lan-guage as the target.

Figure 11 shows the simple regression model overthe plot of BLEU scores against the amount of lan-guage relatedness. The left hand line of points arethe results involving Finnish. The vertical group ofpoints just to the right, are results where Greekis involved. The next set of points are the resultswhere the translation is between Germanic and Ro-mance languages. The final cloud to the right are re-sults where languages are in the same family, eitherwithin the Romance or the Germanic languages.

Table 4 shows the amount of the variance ofBLEU explained by the different models. As these

Page 9: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●● ●●

●●

0.15

0.20

0.25

0.30

0.35

0.40

Source Vocabulary Size

BLE

U

| | | | |100k 200k 300k 400k 500k

●●

●●

FinnishOther

Figure 9. BLEU score of experiments compared to sourcevocabulary size highlighting the Finnish source vocabu-lary data points. The regression includes Finnish in themodel.

Explanatory Variable R2

Target Vocab. Size 0.388 ***Reordering Amount 0.384 ***Language Similarity 0.366 ***Source Vocab. Size 0.045 *Excluding FinnishTarget Vocab. Size 0.219 ***Reordering Amount 0.332 ***Language Similarity 0.188 ***Source Vocab. Size 0.007

Table 4. Goodness of fit of different simple linear regres-sion models which use just one explanatory variable. Thesignificance level represents the level of probability thatthe regression is appropriate. The second set of resultsexcludes Finnish in the source and target language.

are simple regression models, with just one explana-tory variable, multicolinearity is avoided. This tableshows that each of the main effects explains about athird of the variance of BLEU, which means that theycan be considered to be of equal importance. WhenFinnish examples are removed, only reordering re-tains its power, and target vocabulary and languagesimilarity reduce in importance and source vocabu-lary size no longer correlates with performance.

8 Conclusion

We have broken down the relative impact of thecharacteristics of different language pairs on trans-

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

● ●●●●●

●●

0.2 0.3 0.4 0.5 0.6

0.15

0.20

0.25

0.30

0.35

0.40

Reordering Amount

BLE

U

Figure 10. BLEU score of experiments compared toamount of reordering.

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●●●●●●●

0.0 0.2 0.4 0.6 0.8

0.15

0.20

0.25

0.30

0.35

0.40

Language Similarity

BLE

U

Figure 11. BLEU score of experiments compared to lan-guage relatedness.

lation performance. The analysis done is able to ac-count for a large percentage (75%) of the variabil-ity of the performance of the system, which showsthat we have captured the core challenges for thephrase-based model. We have shown that their im-pact is about the same, with reordering and targetvocabulary size each contributing about 0.38%.

These conclusions are only strictly relevant to themodel for which this analysis has been performed,the phrase-based model. However, we suspect thatthe conclusions would be similar for most statisti-cal machine translation models because of their de-pendence on automatic alignments. This will be thetopic of future work.

Page 10: Predicting Success in Machine Translationhomepages.inf.ed.ac.uk/miles/papers/emnlp08b.pdf · Predicting Success in Machine Translation ... of different characteristics of language

ReferencesBalthasar Bickel and Johanna Nichols, 2005. The World

Atlas of Language Structures, chapter Inflectional syn-thesis of the verb. Oxford University Press.

Peter F. Brown, Stephen A. Della Pietra, Vincent J. DellaPietra, and Robert L. Mercer. 1993. The mathematicsof machine translation: Parameter estimation. Compu-tational Linguistics, 19(2):263–311.

Chris Callison-Burch, Cameron Fordyce, Philipp Koehn,Christof Monz, and Josh Schroeder. 2008. Furthermeta-evaluation of machine translation. In Proceed-ings of the Third Workshop on Statistical MachineTranslation, pages 70–106, Columbus, Ohio, June. As-sociation for Computational Linguistics.

Isidore Dyen, Joseph Kruskal, and Paul Black. 1992. Anindoeuropean classification, a lexicostatistical experi-ment. Transactions of the American Philosophical So-ciety, 82(5).

Chris Dyer. 2007. The ’noisier channel’: Transla-tion from morphologically complex languages. InProceedings on the Workshop on Statistical MachineTranslation, Prague, Czech Republic.

Sharon Goldwater and David McClosky. 2005. Im-proving statistical MT through morphological analy-sis. In Proceedings of Empirical Methods in NaturalLanguage Processing.

Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Sta-tistical phrase-based translation. In Proceedings of theHuman Language Technology and North American As-sociation for Computational Linguistics Conference,pages 127–133, Edmonton, Canada. Association forComputational Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran, RichardZens, Chris Dyer, Ondrej Bojar, Alexandra Constantin,and Evan Herbst. 2007. Moses: Open source toolkitfor statistical machine translation. In Proceedings ofthe Association for Computational Linguistics Com-panion Demo and Poster Sessions, pages 177–180,Prague, Czech Republic. Association for Computa-tional Linguistics.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proceedings of MT-Summit.

Franz Josef Och and Hermann Ney. 2003. A system-atic comparison of various statistical alignment mod-els. Computational Linguistics, 29(1):9–51.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evalu-ation of machine translation. In Proceedings of the As-sociation for Computational Linguistics, pages 311–318, Philadelphia, USA.

M Snover, B Dorr, R Schwartz, L Micciulla, andJ Makhoul. 2006. A study of translation edit rate withtargeted human annotation. In AMTA.

Morris Swadesh. 1955. Lexicostatistic dating of prehis-toric ethnic contacts. In Proceedings American Philo-sophical Society, volume 96, pages 452–463.

David Talbot and Miles Osborne. 2006. Modelling lex-ical redundancy for machine translation. In Proceed-ings of the Association of Computational Linguistics,Sydney, Australia.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 23(3):377–403.


Top Related