efficient solutions for word reordering in german-english phrase

Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 440–451,Sofia, Bulgaria, August 8-9, 2013 c©2013 Association for Computational Linguistics

Efficient solutions for word reordering in German-Englishphrase-based statistical machine translation

Arianna Bisazza and Marcello FedericoFondazione Bruno Kessler

Trento, Italy{bisazza,federico}@fbk.eu

AbstractDespite being closely related languages,German and English are characterized byimportant word order differences. Long-range reordering of verbs, in particular,represents a real challenge for state-of-the-art SMT systems and is one of the mainreasons why translation quality is often sopoor in this language pair. In this work,we review several solutions to improvethe accuracy of German-English word re-ordering while preserving the efficiency ofphrase-based decoding. Among these, weconsider a novel technique to dynamicallyshape the reordering search space andeffectively capture long-range reorderingphenomena. Through an extensive eval-uation including diverse translation qual-ity metrics, we show that these solutionscan significantly narrow the gap betweenphrase-based and hierarchical SMT.

1 Introduction

Modeling the German-English language pair isknown to be a challenging task for state-of-the-art statistical machine translation (SMT) methods.A major factor of difficulty is given by word or-der differences that yield important long-range re-ordering phenomena.

Thanks to specific reordering modeling compo-nents, phrase-based SMT (PSMT) systems (Zenset al., 2002; Koehn et al., 2003; Och and Ney,2002) are generally good at handling local re-ordering phenomena that are not captured insidephrases. However, they typically fail to predictlong reorderings. On the other hand, hierarchi-cal SMT (HSMT) systems (Chiang, 2005) canlearn reordering patterns by means of discontinu-ous translation rules, and are therefore considereda better choice for language pairs characterized bymassive and hierarchical reordering.

Looking at the results of the Workshop ofMachine Translation’s last edition (WMT12)(Callison-Burch et al., 2012), no particular SMTapproach appears to be clearly dominating. Inboth language directions (official results excludingthe online systems) the rule-based systems outper-formed all SMT approaches, and among the bestSMT systems we find a variety of approaches:pure phrase-based, phrase-based and hierarchicalsystems combination, n-gram based, a rich syntax-based approach, and a phrase-based system cou-pled with POS-based pre-ordering. This gives anidea of how challenging this language pair is forSMT and raises the question of which SMT ap-proach is best suited to model it.

In this work, we aim at answering this ques-tion by focussing on the word reordering problem,which is known to be an important factor of SMTperformance (Birch et al., 2008). We hypothe-size that PSMT can be as successful for German-English as the more computationally costly HSMTapproach, provided that the reordering-related pa-rameters are carefully chosen and the best avail-able reordering models are used. More specifi-cally, our study covers the following topics: dis-tortion functions and limits, and dynamic shapingof the reordering search space based on a discrim-inative reordering model.

We first review these topics, and then evaluatethem systematically on the WMT task using bothgeneric and reordering-specific metrics, with theaim of providing a reference for future system de-velopers’ choices.

2 Background

Word order differences between German and En-glish are mainly found at the clause (global) level,as opposed to the phrase (local) level. We refer toCollins et al. (2005) and Gojun and Fraser (2012)for a detailed description of the German clausestructure. To briefly summarize, we can say that

440

the verb-second order of German main clausescontrasts with the rigid SVO structure of English,as does the clause-final verb position of Germansubordinate clauses. A further difficulty is givenby the German discontinuous verb phrases, wherethe main verb is separated from the inflected auxil-iary or modal. The distance between the two partsof a verb phrase can be arbitrarily long as shownin the following example:

[DE] Jedoch konnten sie Kinder in Teilen von Helmandund Kandahar im Suden aus Sicherheitsgrund nicht er-reichen.

[EN] But they could not reach children in parts of Hel-mand and Kandahar in the south for security reasons.

Translating this sentence with a PSMT engineimplies performing two very long jumps that arenot even considered by typical systems employinga distortion limit of 6 or 8 words. At the sametime, increasing the distortion limit to very highvalues is known to have a negative impact on bothefficiency and translation quality (cf. results pre-sented later in this paper).

Because reordering patterns of this kind arevery common between German and English, thispaper focuses on techniques that enable the PSMTdecoder to explore long jumps and thus improvereordering accuracy without hurting efficiency norgeneral translation quality.

2.1 Alternative approaches

German-English reordering in SMT has beenwidely studied and is still an open topic. In thiswork, we only consider efficient solutions that arefully integrated into the decoding process, and thatdo not require syntactic parsers or manual reorder-ing rules. Still, it has to be mentioned that sev-eral alternative solutions were proposed in the lit-erature. A well-known strategy consists of pre-ordering the German sentence in an English-likeorder by applying a set of manually written rulesto its syntactic parse tree (Collins et al., 2005).1

Other approaches learn the pre-ordering rules au-tomatically, from syntactic parses (Xia and Mc-Cord, 2004; Genzel, 2010) or from part-of-speechlabels (Niehues and Kolss, 2009). In the formercase, pre-ordering decisions are typically taken de-terministically (i. e. one permuation per sentence),whereas in the latter, multiple alternatives are rep-resented as word lattices, and the optimal path is

1A similar solution for the opposite translation direction(English-German) was proposed by Gojun and Fraser (2012).

selected by the decoder at translation time. In(Tromble and Eisner, 2009), pre-ordering is castas a permutation problem and solved by a modelthat estimates the probability of reversing the rel-ative order of any two input words.

In the field of tree-based SMT, positive resultsin German-English were achieved by combiningsyntactic translation rules with unlabeled hierar-chical SMT rules (Hoang and Koehn, 2010). Morerecently, Braune et al. (2012) proposed to improvethe long-range reordering capability of an HSMTsystem by integrating constraints based on clausalboundaries and by manually selecting the rule pat-terns applicable to long word spans. The paperdid not analyse the impact of the technique on ef-ficiency.

2.2 Evaluation methods

A large number of previous works on word re-ordering measured their success with general-purpose metrics such as BLEU (Papineni et al.,2001) or METEOR (Banerjee and Lavie, 2005).These metrics, however, are only indirectly sensi-tive to word order and do not sufficiently penalizelong-range reordering errors, as demonstrated forinstance by Birch et al. (2010). While BLEU re-mains a standard choice for many evaluation cam-paigns, we believe it is extremely important tocomplement it with metrics that are specificallydesigned to capture word order differences. In thiswork, we adopt two reordering-specific metrics inaddition to BLEU and METEOR:

Kendall Reordering Score (KRS). As pro-posed by Birch et al. (2010), the KRS measuresthe similarity between the input-output reorderingand the input-reference reordering. This is done byconverting word alignments to permutations andcomputing a permutation distance among them.When interpolated with BLEU, this score is calledLRscore.2

Verb-specific KRS (KRS-V). The ideal wayto automatically evaluate our systems would beto use syntax- or semantics-based metrics, as theimpact of long reordering errors is particularlyimportant at these levels. As a light-weight al-ternative, we instead concentrate the evaluationon those word classes that are typically crucialto guess the general structure of a sentence. Tothis end, we adopt a word-weighted version of the

2Thus, our KRS results correspond exactly to theLRscore(!=1) presented in other papers.

441

KRS and set the weights to 1 for verbs and 0 forall other words, so that only verb reordering errorsare captured. We call the resulting metric KRS-V.The KRS-V rates a translation hypothesis as per-fect (100%) when the translations of all sourceverbs are located in their correct position, regard-less of the other words’ ordering.

3 Early distortion cost

In its original formulation, the PSMT approach in-cludes a basic reordering model, called distortioncost, that exponentially penalizes longer jumpsamong consecutively translated phrases simplybased on their distance. Thus, a completely mono-tonic translation has a total distortion cost of zero.

A weakness of this model is that it penalizeslong jumps only when they are performed, ratherthan accumulating their cost gradually. As an ef-fect, hypotheses with gaps (i. e. uncovered inputpositions) can proliferate and cause the pruningof more monotonic hypotheses that could lead tooverall better translations.

To solve this problem, Moore and Quirk (2007)proposed an improved version of the distortioncost function that anticipates the gradual accumu-lation of the total distortion cost, making hypothe-ses with the same number of covered words morecomparable with one another. Early distortioncost (as called in Moses, or “distortion penalty es-timation” in the original paper) is computed by asimple algorithm that keeps track of the uncoveredinput positions. Note that this option affects thedistortion feature function, but not the distortionlimit, which always corresponds to the maximumdistance allowed between consecutively translatedphrases.

Early distortion cost was shown by its authors toyield similar BLEU scores as the standard one butwith stricter pruning parameters, i. e. faster decod-ing. Experiments were performed on an English-French task, with a fixed distortion limit of 5 andwithout lexicalized reordering models. Our studydeals with a language pair that is arguably moredifficult at the level of reordering. Moreover, westart from a stronger baseline and measure the im-pact of early distortion cost in various distortionlimit settings, using also reordering-specific met-rics. Results are presented in Section 6.2.

4 Word-after-word reorderingmodeling and pruning

Phrase orientation (lexicalized reordering) mod-els (Tillmann, 2004; Koehn et al., 2005; Galleyand Manning, 2008) have proven very useful forshort and medium-range reordering and are prob-ably the most widely used in PSMT nowadays.However, their coarse classification of reorderingsteps makes them unsuitable to capture long-rangereordering phenomena, such as those attested inGerman-English. Indeed, Galley and Manning(2008) reported a decrease of translation qual-ity when the distortion limit was set beyond 6 inChinese-English and beyond 4 in Arabic-English.

To address this problem, we have developed adifferent reordering model that predicts what in-put word should be translated at a given decod-ing state (Bisazza, 2013; Bisazza and Federico,2013). The model is similar to the one proposedby Visweswariah et al. (2011), however we useit differently: that is, not simply for data pre-processing but as an additional feature functionfully integrated in the phrase-based decoder. Moreimportantly, we propose to use the same modelto dynamically shape the space of reorderings ex-plored during decoding (cf. Section 4.2), whichwas never done before.

Another related work is the source-side decod-ing sequence model by Feng et al. (2010), that isa generative n-gram model trained on a corpus ofpre-ordered source sentences. Although reminis-cent of a source-side bigram model, our model hastwo important differences: (i) the discriminativemodeling framework enables us to design a muchricher feature set including, for instance, the con-text of the next word to pick; (ii) all our featuresare independent from the decoding history, whichallows for an efficient decoder-integration with noeffect on hypothesis recombination.

Finally, we have to mention the models by Al-Onaizan and Papineni (2006) and Green et al.(2010), who predict the direction and (binned)length of a jump to perform after a given inputword. Those models too were only used as ad-ditional feature functions, and were not shown tomaintain translation quality and efficiency at veryhigh distortion limits.

4.1 The model

The Word-after-word (WaW) reordering model istrained to predict whether a given input position

442

should be translated right after another, given thewords at those positions and their contexts. It isbased on the following maximum-entropy binaryclassifier:

P (Ri,j=Y |fJ1 , i, j) =

exp[!

m !mhm(fJ1 , i, j, Ri,j=Y )]!

Y ! exp[!

m !mhm(fJ1 , i, j, Ri,j=Y !)]

where fJ1 is a source sentence of J words, hm are

feature functions and !m the corresponding fea-ture weights. The outcome Y can be either 1 or 0,with Ri,j=1 meaning that the word at position j istranslated right after the word at position i.

Training examples are extracted from a corpusof reference reorderings, obtained by convertingthe word-aligned parallel data into a set of sourcesentence permutations. A heuristic similar to theone proposed by Visweswariah et al. (2011) isused to this end. For each input word, we gen-erate: (i) one positive example for the word thatshould be translated right after it; (ii) negative ex-amples for all the uncovered words that lie within agiven sampling window or ". The latter parameterserves to control the proportion between positiveand negative examples.

The WaW model builds on binary features thatare extracted from the local context of positionsi and j, and from the words occurring betweenthem. In addition to the actual words, the featuresmay include POS tags and shallow syntax labels(i. e. chunk types and boundaries). For instance,one feature may indicate that the last translatedword (wi) is an adjective while the currently trans-lated one (wj) is a noun:

POS(wi)=adj ! POS(wj)=noun

Other features indicate that a given word or punc-tuation is found between wi and wj :

wb=‘jedoch’ ... wb=‘.’

or that wi and wj belong to the same shallow syn-tax chunk.

The WaW reordering model can be seamlessyintegrated into a standard phrase-based decoderthat already includes phrase orientation models.When a partial hypothesis is expanded with agiven phrase pair, the model returns the log-probability of translating its words in the orderdefined by the phrase-internal word alignment.Moreover, the global WaW score is independentfrom phrase segmentation, and normalized acrossoutputs of different lengths.

The complete list of features, training data gen-eration algorithm and other implementation detailsare presented in (Bisazza, 2013) and (Bisazza andFederico, 2013).

4.2 Early reordering pruningBesides providing an additional feature functionfor the log-linear PSMT framework, the WaWmodel’s predictions can be used as an early indi-cation of whether or not a given reordering pathshould be further explored. In fact, we have men-tioned that the existing reordering models are notcapable of guiding the search through very largereordering search spaces. As a solution, we pro-pose to decode with loose reordering constraints(i. e. high distortion limit) but only explore thoselong reorderings that are promising according tothe WaW model.

More specifically, at each hypothesis expansion,we consider the set of input positions that arereachable within the fixed distortion limit. Onlybased on the WaW score, we apply histogram andthreshold pruning to this set and then proceed toexpand only the non-pruned positions.3 Further-more, it is possible to ensure that local reorderingsare always allowed, by setting a so-called non-prunable-zone of width # around the last coveredinput position.4 In this way, we can ensure that theusual space of short to medium-range reordering isexhaustively explored in addition to few promisinglong-range reorderings.

The rationale of this approach is two-fold: First,to avoid costly hypothesis expansions for very un-likely reordering steps and thus speed up decod-ing under loose reordering constraints. Second, todecrease the risk of model errors by exploiting thefact that some components of the PSMT log-linearmodel are more important than others at differentstages of the translation process.

The WaW model is not the only scoring func-tion that can be used for early reordering prun-ing. In principle, even phrase orientation modelscores could be used, but we expect them to per-form poorly due to the coarse classification of re-ordering steps (all phrases that are not adjacent tothe current one are treated as discontinuous steps).

3The idea is reminiscent of early pruning by Moore andQuirk (2007): an optimization technique that consists of dis-carding hypothesis extensions based on their estimated scorebefore computing the exact language model score.

4See (Bisazza, 2013) for technical details on the integra-tion of word-level pruning with phrase-level hypothesis ex-pansion.

443

5 Reordering in hierarchical SMT

To allow for a fair evaluation of our systems,we also perform a contrastive experiment using atree-based SMT approach: namely, hierarchicalphrase-based SMT (HSMT) (Chiang, 2005).

Reordering in HSMT is not modeled separatelybut is embedded in the translation model itself,which contains lexicalized, non syntactically mo-tivated rules that are directly learnt from word-aligned parallel text. The major strength of HSMTcompared to PSMT, is the ability to learn discon-tinous phrases and long-range lexicalized reorder-ing rules. However, this modeling power has a costin terms of model size and decoding complexity.

To have a concrete idea, consider that thephrase-table trained on our SMT training data (cf.Section 6.1) with a maximum phrase length of 7contains 127 million entries (before phrase tablepruning). The hierarchical rule table trained on thesame data with a comparable span constraint (10)contains instead 1.2 billion entries – one order ofmagnitude larger.

Furthermore, the HSMT decoder is based on achart parsing algorithm, whose complexity is cu-bic in the input length, and even higher when tak-ing into account the target language model. Thisissue can be partially addressed by different strate-gies such as cube pruning (Chiang, 2007), whichreduces the LM complexity to a constant, or ruleapplication constraints. One of such constraints isthe maximum number of source words that maybe covered by non-terminal symbols (span con-straint). Setting a span constraint – which is essen-tial to obtain reasonable decoding times – meanspreventing long-range reordering similarly to set-ting a distortion limit in PSMT. In our experi-ments, we consider two settings for this parameter:10 to capture short to medium-range reorderings,and 20 to also capture long-range reorderings.

6 Experiments

In this section we evaluate the impact on transla-tion quality and efficiency of the techniques pre-sented above. Our main objective is to empiri-cally verify the hypothesis that better reorderingmodeling and better reordering space definitioncan significantly improve the accuracy of PSMT inGerman-English without sacrificing its efficiency.

6.1 Experimental setup

We choose the WMT German-English news trans-lation task as our case study. More specificallywe use the WMT10 training data: Europarl (v.5)plus News-commentary-2010 for a total of 1.6Mparallel sentences, 44M German tokens. The tar-get LM is trained on the monolingual news dataprovided for the constrained track of WMT10(1133M English tokens). For development we usethe WMT08 news benchmark, while for testing weuse the following data sets:

tests(09-11): the concatentation of three previousyears’ benchmarks from 2009 to 2011 (8017sentences, 21K German tokens).

test12: the latest released benchmark (3003 sen-tences, 8K German tokens).

Each data set includes one reference translation.Note that our goal is not to reach the performanceof the best systems participating at the last WMTedition, but rather to assess the usefulness of ourtechniques on a larger and therefore more reliabletest set, while starting from a reasonable baseline.5

For German tokenization and compound split-ting we use Tree Tagger (Schmid, 1994) and theGertwol morphological analyser (Koskenniemiand Haapalainen, 1994).6

All our SMT systems are built with the Mosestoolkit (Koehn et al., 2007; Hoang et al., 2009),and word alignments are generated by the Berke-ley Aligner (Liang et al., 2006). The target lan-guage model is estimated by the IRSTLM toolkit(Federico et al., 2008) with modified Kneser-Neysmoothing (Chen and Goodman, 1999).

The phrase-based baseline decoder includes aphrase translation model (two phrasal and two lex-ical probability features), a lexicalized reorder-ing model (six features), a 6-gram target languagemodel, distortion cost, word and phrase penalties.As lexicalized reordering model, we use a hierar-chical phrase orientation model (Galley and Man-ning, 2008) trained on all the parallel data usingthree orientation classes – monotone, swap or dis-continuous – in bidirectional mode. Statistically

5Our results on test12 are not directly comparable to theWMT12 submissions due to the different training data: thatis, the WMT12 parallel data includes 50M German tokensof Europarl data and 4M of news-commentary, as opposedto the 41M and 2.5M released for WMT10 and used in ourexperiments.

6http://www2.lingsoft.fi/cgi-bin/gertwol

444

improbable phrase pairs are pruned from the trans-lation model as proposed by Johnson et al. (2007).

The hierarchical system is trained and testedusing the standard Moses configuration which in-cludes: a rule table (two phrasal and two lexi-cal probability features), a 6-gram target languagemodel, word and rule penalties. We set the spanconstraint (cf. Section 5) to the default value of10 words for rule extraction, while for decodingwe consider two different settings: the default 10words and a large value of 20 to enable very long-range reorderings.

Feature weights for all systems are optimizedby minimum BLEU-error training (Och, 2003) ontest08. To reduce the effects of the optimizer insta-bility, we tune each configuration four times anduse the average of the resulting weight vectors fortesting, as suggested by Cettolo et al. (2011).

The source-to-reference word alignments thatare needed to compute the reordering scores aregenerated by the Berkeley Aligner previouslytrained on the training data. Source-to-outputalignments are obtained from the decoder’s trace.

6.2 Distortion function and limit

We start by measuring the difference betweenstandard and early distortion cost.7 Figure 1shows the results in terms of BLEU and KRS, plot-ted against the distortion limit (DL).

Indeed, early distortion cost (Moore and Quirk,2007) outperforms the standard one in all thetested configurations and according to both met-rics. We can see that the quality of both systemsdeteriorates as the distortion limit increases, how-ever the system with early distortion cost is morerobust to this effect. In particular, when passingfrom DL=12 to DL=18, the baseline system loses1.2 BLEU and no less than 6.8 KRS, whereas thesystem with early distortion cost loses 0.8 BLEUand 4.9 KRS. Given these results, we decide to useearly distortion cost in all the remaining experi-ments.

6.3 WaW reordering pruning

We have seen that early distortion cost can effec-tively reduce the loss of translation quality, butcannot totally prevent it. Moreover, increasingthe distortion limit means exploring many more

7For this first series of experiments, feature weights aretuned in the DL=8 setting and the two resulting weight vec-tors (one for standard, one for early distortion) are re-used inthe higher-DL experiments.

Figure 1: Standard vs early distortion cost perfor-mance measured in terms of BLEU and KRS ontests(09-11) under different distortion limits.

hypotheses and, consequently, slowing down thedecoding process. With our WaW model-basedreordering pruning technique, we aim at solvingboth issues.

We generate the WaW training data from thefirst 30K sentences of the News-commentary-2010 parallel corpus, using a sampling window ofwidth "=10. This results in 8 million training sam-ples, which are fed to the binary classifier imple-mentation of the MegaM Toolkit8. Features withless than 20 occurrences are ignored and the max-imum number of training iterations is set to 100.

Evaluated intrinsically on test08, the modelachieves the following classification accuracy:67.0% precision, 50.0% recall, 57.2% F-score.While these figures are rather low, we recall thatthe WaW model is not meant to be used as a stand-alone classifier, but rather as one of several SMTfeature functions and as a way to detect very un-likely reordering steps. Hence, we also evaluate itsability to rank a typical set of reordering optionsduring decoding: that is, we traverse the sourcewords in target order and, for each of them, we ex-

8http://www.cs.utah.edu/˜hal/megam/ (Daume III, 2004).

445

tests(09-11) test12 ms/System DL bleu met krs krs-V bleu met krs krs-V word

Allowing only short to medium-range reordering:

PSMT, early disto8

19.2 28.1 67.4 65.4 19.0 28.1 67.8 66.1! 202+WaW (feature only) 19.4! 28.2! 67.6! 65.5! 19.5! 28.3! 67.8 66.2 212

HSMT, max.span=10 20.1! 28.5! 68.4! 66.7! 19.7" 28.4" 68.6! 67.3! 406

Allowing also long-range reordering:

PSMT, early disto18

18.2 28.0 62.9 62.0 18.2 28.1 63.4 62.5 408+WaW (feature only) 18.4! 28.0 61.8# 61.3# 18.1 28.1 62.2# 61.7# 428+WaW reo.pruning (#=5) 19.5! 28.3! 67.9! 66.3! 19.3! 28.4! 67.8! 66.3! 142

HSMT, max.span=20 20.0! 28.5! 68.1! 66.7! 19.7! 28.4 68.2! 67.1! 706

Table 1: Effects of WaW reordering model and early reordering pruning on PSMT translation qualityand efficiency, compared against a hierarchical SMT baseline. Translation quality is measured with% BLEU, METEOR, and Kendall Reordering Score: regular (KRS) and verb-specific (KRS-V). Statisticallysignificant differences with respect to the previous row are marked with !# at the p " .05 level and "$

at the p " .10 level. Decoding time is measured in milliseconds per input word.

amine the ranking of all words that may be trans-lated next (i. e. the uncovered positions withina given DL). We find that, even when the DL isvery high (18), the correct jump is ranked amongthe top 3 reachable jumps in the large majority ofcases (81.4%). If we only consider long jumps –i. e. spanning more than 6 words – the Top-3 accu-racy is 56.4% while that of a baseline that simplyfavors shorter jumps (as the distortion cost does)is only 26.5%.

For the early reordering pruning experiment, weset the pruning parameters to 2 for histogram and0.25 for relative threshold.9 A non-prunable-zoneof width #=5 is set around the last covered posi-tion. The resulting configuration is re-optimizedby MERT on test08 for the final experiment.

Table 1 shows the effects of integrating theWaW reordering model into a PSMT decoderthat already includes a state-of-the-art hierarchi-cal phrase orientation model. The same table alsopresents the results of the HSMT constrastive ex-periments. Two scenarios are considered: in thefirst block, the PSMT distortion limit is set to amedium value (8) and the HSMT maximum spanconstraint is set to 10. Although not directly com-parable, these settings have the same effect of dis-allowing long-range reorderings. In the secondblock, long-range reorderings are instead allowed

9Pruning parameters were optimized for BLEU with agrid search over the values (1, 2, 3, 4, 5) for histogram and(0.5, 0.25, 0.1) for threshold.

with a DL of 18 and a HSMT span constraint of20.

Feature weights are optimized for each exper-iment using the procedure described above (fouraveraged MERT runs). Statistical significance iscomputed for each experiment against the pre-vious one (i. e. previous row), using approxi-mate randomization as in (Riezler and Maxwell,2005). Run times are obtained by an Intel XeonX5650 processor on the first 500 sentences oftests(09-11), excluding loading time of all models.

Medium reordering space. Integrating theWaW model as an additional feature functionyields small but consistent improvements (secondrow of Table 1). Concerning the run time, we no-tice just a small overload of about 5%: that is, from202 to 212 ms/word.

In comparison, the tree-based system (thirdrow) has almost double decoding time butachieves statistically significant higher translationquality, especially at the level of reordering.

Large reordering space. As expected, raisingthe DL to 18 with no special pruning (fourth row)results in much slower decoding (from 202 to 408ms/word) but also in very poor translation qual-ity. This loss is especially visible on the reorderingscores: e. g. from 67.4 to 62.9 KRS on tests(09-11). Unfortunately, adding the WaW model as afeature function (fifth row) does not appear to behelpful under the high DL condition.

On the other hand, when using the WaW model

446

adv. verbmod subj. obj. compl.Jedoch konnten sie Kinder in Teilen von Helmand und Kandahar im Suden aus Sicherheit! grund

SRC however could they children in parts of Helmand and Kandahar in South for security reasons(de) neg verbinf

nicht erreichen .not reach

REF But they could not reach children in parts of Helm. and Kand. in the south for security reasons.BASE-8 However, they were children in parts of Helm. and Kand. in the south, for security reasons.HIER-10 However, they were children in parts of Helm. and Kand. in the south not reach for security reasons.BASE-18 However, they were children in parts of Helm. and Kand. in the south do not reach for security reasons.WAWP-18 However, they could not reach children in parts of Helm. and Kand. in the south for security reasons.HIER-20 However, they were children in parts of Helm. and Kand. in the south not reach for security reasons.

Table 2: Long-range reordering example showing the behavior of different systems: [BASE-*] are phrase-based systems with a DL of 8 and 18 respectively; [WAWP-18] refers to the WaW-pruning PSMT system;[HIER-*] are hierarchical SMT systems with a span constraint of 10 and 20 words respectively.

also for reordering pruning (sixth row) we are ableto recover the performance of the medium-DLbaseline performance and even to slightly improveit. It is interesting to note that the largest improve-ment concerns the accuracy of verb reordering ontests(09-11): from 65.4 to 66.3 KRS-V. Althoughthe other gains are rather small, we emphasize thefact that our solutions mostly affect rare and iso-lated events, which have a limited impact on thegeneral purpose evaluation metrics but are are es-sential to produce readable translations. WaW re-ordering pruning has also a remarkable effect onefficiency, making decoding time decrease from428 ms/word to 142 ms/word, that is even fasterthan a baseline that does not explore any long-range reordering at all (202 ms/word).

Finally, we can see from the last row of Ta-ble 1 that the gap between PSMT and HSMT hasbeen narrowed significantly. While more work isneeded to reach and outperform the quality of theHSMT system, we were able to closely approachit with five times lower decoding time (142 versus706 ms/word) and about ten times smaller mod-els (cf. Section 5). Comparing our best systemwith the best HSMT system (i. e. span constraint10), we see that the gap in translation accuracyis slightly larger and that the decoding speed-upis smaller (142 versus 406 ms/word). However,the better performance and efficiency of HSMT-10comes at the expense of all long-range reorderings.

Thus, our enhanced PSMT appears as an opti-mal choice in terms of trade-off between transla-tion quality and efficiency.

Table 3 reports two kinds of decoding statisticsthat allow us to explain the very different decod-

ing times observed, and to verify that the WaW-pruning system actually performs long-range re-orderings: #hyp/sent is the average number ofpartial translation hypotheses created10 per testsentence; (#jumps/sent)#100 is the averagenumber of phrase-to-phrase jumps included inthe 1-best translation of every 100 test sentences.Only medium and long jumps are shown (distor-tion D$6), divided into three distortion buckets.

System DL #hyp/sent(#jumps/sent)"100

D: [6..8] [9..12] [13..18]baseline 8 600K 90 – –baseline 18 1278K 88 61 48+WaW r.prun. 18 364K 52 29 17

Table 3: Decoding statistics of three PSMT sys-tems exploring different reordering search spacesfor the translation of test12.

We can see that the early-pruning system in-deed performed several long jumps but it exploreda much smaller search space compared to the high-distortion baseline (364K versus 1278K partial hy-potheses). As for the lower number of long jumps(e. g. 29 versus 61 with D in [9..12] and 17 versus48 in [13..18]) it suggests that the early-pruningsystem is more precise, while the high-distortionbaseline is over-reordering.

The output of different systems for our exam-ple sentence is shown in Table 2. In this sentence,a jump forward with D=12 and a jump backwardwith D=14 were necessary to achieve the correctreordering of the verb and its negation. Although

10That is, the hypotheses that were scored by all the PSMTmodel components and added to a hypothesis stack.

447

Figure 2: Effects of beam size on translation quality measured by BLEU, KRS and KRS-V, in two base-line PSMT systems (DL=8 and DL=18) and in the WaW early-pruning system (test12). For comparison,the hierarchical system performance (span constraint 20) is provided as a dotted line.

these jumps were reachable for both the [PSMT-18] and the [HSMT-20] systems, only the WaW-pruning PSMT system actually performed them.

6.4 Interaction with beam-search pruning

During the beam-search decoding process, earlyreordering pruning interacts with regular hypoth-esis pruning based on the weighted sum of allmodel scores. In particular, all the PSMT systemspresented so far apply a default histogram thresh-old of 200 to each hypothesis stack. To examinethis interaction, we increase the histogram thresh-old (beam size) from the default value of 200 up to800, while keeping all other parameters and fea-ture weights fixed. The results on test12 are plot-ted against the beam size and reported in Figure 2.The dotted line in each plot represents the perfor-mance of the hierarchical system presented in thelast row of Table 1 (span constraint 20).

We can see that increasing the beam size is morebeneficial for the high-DL baseline (baseDL18)than for the medium one (baseDL8). This is notsurprising as the risk of search error is higher whena larger search space is explored with equal mod-els and pruning parameters. Nevertheless, bas-eDL18 remains by far the worst performing sys-tem, even in our largest beam setting (800) corre-sponding to four times longer decoding time (1582ms/word). What is remarkable, instead, is thatthe larger beam size also results in better perfor-mances by the WaW-pruning system, which is thePSMT system that explores by far the smallestsearch space (cf. Table 3). The superiority of theWaW-pruning system over the PSMT baselines is

maintained in all tested settings and according toall metrics, which confirms the usefulness of ourmethods not only as optimization techniques, butalso for reducing model errors of a baseline thatalready includes strong reordering models.

With a very large beam size (800) our en-hanced PSMT system can closely approach theperformance of HSMT-20 in terms of BLEU andKRS-V, and even surpass it in terms of KRS (sta-tistically significant) while still remaining faster:that is, 554 versus 706 ms/word.

Overall HSMT-10 remains the best system, withslightly higher KRS and KRS-V and lower de-coding time than our best enhanced PSMT sys-tem (406 versus 554 ms/word). However, we noteonce more that this performance comes at the ex-pense of all long-range reorderings. For a com-pletely fair comparison, the HSMT system shouldalso be enhanced with similar reordering-pruningtechniques – a research path that we plan to ex-plore in the future, possibly inspiring from the ap-proach of Braune et al. (2012).

7 Conclusions

We have presented a few techniques that can im-prove the accuracy of the word reordering per-formed by a German-English phrase-based SMTsystem. In particular, we have shown how long-range reorderings can be captured without worsen-ing the general quality of translation and withoutrenouncing to efficiency. Our best PSMT systemis actually faster than a system that does not evenattempt to perform long-range reordering, and it

448

obtains significantly higher evaluation scores.In comparison to a more computationally costly

tree-based approach (hierarchical SMT), our en-hanced PSMT system produces slightly lowertranslation quality but in five times lower decod-ing time when long-range reordering is allowed.Moreover, when a larger beam size is explored,the performance of our system can equal that ofthe long-reordering hierarchical system, but stillwith faster decoding.

In summary, we have shown that an appropri-ate modeling of the word reordering problem canlead to narrow or even fill the gap between phrase-based and hierarchical SMT in this difficult lan-guage pair. We have also disproved the commonbelief that sacrificing long-range reorderings bysetting a low distortion limit is the only way toobtain well-performing PSMT systems.

Acknowledgments

This work was partially funded by the EuropeanUnion under FP7 grant agreement EU-BRIDGE,Project Number 287658.

ReferencesYaser Al-Onaizan and Kishore Papineni. 2006. Dis-

tortion models for statistical machine translation. InProceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meet-ing of the Association for Computational Linguis-tics, pages 529–536, Sydney, Australia, July.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan, June.

Alexandra Birch, Miles Osborne, and Philipp Koehn.2008. Predicting success in machine translation. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 745–754, Stroudsburg, PA, USA.

Alexandra Birch, Miles Osborne, and Phil Blunsom.2010. Metrics for MT evaluation: evaluating re-ordering. Machine Translation, 24(1):15–26.

Arianna Bisazza and Marcello Federico. 2013. Dy-namically shaping the reordering search space ofphrase-based statistical machine translation. To ap-pear in Transactions of the ACL.

Arianna Bisazza. 2013. Linguistically MotivatedReordering Modeling for Phrase-Based Statistical

Machine Translation. Ph.D. thesis, University ofTrento. http://eprints-phd.biblio.unitn.it/1019/.

Fabienne Braune, Anita Gojun, and Alexander Fraser.2012. Long-distance reordering during search forhierarchical phrase-based SMT. In Proceedings ofthe Annual Conference of the European Associa-tion for Machine Translation (EAMT), pages 28–30,Trento, Italy.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.Findings of the 2012 workshop on statistical ma-chine translation. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, pages10–51, Montreal, Canada, June.

Mauro Cettolo, Nicola Bertoldi, and Marcello Fed-erico. 2011. Methods for smoothing the optimizerinstability in SMT. In MT Summit XIII: the Thir-teenth Machine Translation Summit, pages 32–39,Xiamen, China.

Stanley F. Chen and Joshua Goodman. 1999. Anempirical study of smoothing techniques for lan-guage modeling. Computer Speech and Language,4(13):359–393.

David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In Pro-ceedings of the 43rd Annual Meeting of the Associa-tion for Computational Linguistics (ACL’05), pages263–270, Ann Arbor, Michigan, June.

David Chiang. 2007. Hierarchical phrase-based trans-lation. Computational Linguistics, 33(2):201–228.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In Proceedings of the 43rd AnnualMeeting of the Association for Computational Lin-guistics (ACL’05), pages 531–540, Ann Arbor,Michigan, June.

Hal Daume III. 2004. Notes on CG and LM-BFGSoptimization of logistic regression. Paper avail-able at http://pub.hal3.name, implementa-tion available at http://hal3.name/megam.

Marcello Federico, Nicola Bertoldi, and Mauro Cet-tolo. 2008. IRSTLM: an Open Source Toolkitfor Handling Large Scale Language Models. InProceedings of Interspeech, pages 1618–1621, Bris-bane, Australia.

Minwei Feng, Arne Mauser, and Hermann Ney. 2010.A source-side decoding sequence model for statis-tical machine translation. In Conference of the As-sociation for Machine Translation in the Americas(AMTA), Denver, Colorado, USA.

Michel Galley and Christopher D. Manning. 2008. Asimple and effective hierarchical phrase reorderingmodel. In EMNLP ’08: Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 848–856, Morristown, NJ, USA.

449

Dmitriy Genzel. 2010. Automatically learning source-side reordering rules for large scale machine trans-lation. In Proceedings of the 23rd InternationalConference on Computational Linguistics, COLING’10, pages 376–384, Stroudsburg, PA, USA.

Anita Gojun and Alexander Fraser. 2012. Determin-ing the placement of German verbs in English-to-German SMT. In Proceedings of the 13th Confer-ence of the European Chapter of the Associationfor Computational Linguistics, pages 726–735, Avi-gnon, France, April.

Spence Green, Michel Galley, and Christopher D. Man-ning. 2010. Improved models of distortion costfor statistical machine translation. In Human Lan-guage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics (NAACL), pages 867–875, Los Angeles, California.

Hieu Hoang and Philipp Koehn. 2010. Improved trans-lation with source syntax labels. In Proceedingsof the Joint Fifth Workshop on Statistical MachineTranslation and MetricsMATR, pages 409–417, Up-psala, Sweden, July.

Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.A unified framework for phrase-based, hierarchical,and syntax-based statistical machine translation. InInternational Workshop on Spoken Language Trans-lation (IWSLT), pages 152–159, Tokyo, Japan.

H. Johnson, J. Martin, G. Foster, and R. Kuhn. 2007.Improving translation quality by discarding mostof the phrasetable. In In Proceedings of EMNLP-CoNLL 07, pages 967–975.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of HLT-NAACL 2003, pages 127–133, Ed-monton, Canada.

Philipp Koehn, Amittai Axelrod, Alexandra BirchMayne, Chris Callison-Burch, Miles Osborne, andDavid Talbot. 2005. Edinburgh system descriptionfor the 2005 IWSLT speech translation evaluation.In Proc. of the International Workshop on SpokenLanguage Translation, October.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,and E. Herbst. 2007. Moses: Open Source Toolkitfor Statistical Machine Translation. In Proceedingsof the 45th Annual Meeting of the Association forComputational Linguistics Companion Volume Pro-ceedings of the Demo and Poster Sessions, pages177–180, Prague, Czech Republic.

Kimmo Koskenniemi and Mariikka Haapalainen,1994. GERTWOL – Lingsoft Oy, chapter 11, pages121–140. Roland Hausser, Niemeyer, Tubingen.

Percy Liang, Ben Taskar, and Dan Klein. 2006. Align-ment by agreement. In Proceedings of the HumanLanguage Technology Conference of the NAACL,Main Conference, pages 104–111, New York City,USA, June.

Robert C. Moore and Chris Quirk. 2007. Fasterbeam-search decoding for phrasal statistical ma-chine translation. In In Proceedings of MT SummitXI, pages 321–327, Copenhagen, Denmark.

Jan Niehues and Muntsin Kolss. 2009. A POS-basedmodel for long-range reorderings in SMT. In Pro-ceedings of the Fourth Workshop on Statistical Ma-chine Translation, pages 206–214, Athens, Greece.

F. Och and H. Ney. 2002. Discriminative trainingand maximum entropy models for statistical ma-chine translation. In Proceedings of the 40th AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 295–302, Philadelhpia, PA.

Franz Josef Och. 2003. Minimum Error Rate Train-ing in Statistical Machine Translation. In ErhardHinrichs and Dan Roth, editors, Proceedings of the41st Annual Meeting of the Association for Compu-tational Linguistics, pages 160–167.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. Bleu: a method for automaticevaluation of machine translation. Research ReportRC22176, IBM Research Division, Thomas J. Wat-son Research Center.

Stefan Riezler and John T. Maxwell. 2005. On somepitfalls in automatic evaluation and significance test-ing for MT. In Proceedings of the ACL Workshopon Intrinsic and Extrinsic Evaluation Measures forMachine Translation and/or Summarization, pages57–64, Ann Arbor, Michigan, June.

Helmut Schmid. 1994. Probabilistic part-of-speechtagging using decision trees. In Proceedings ofInternational Conference on New Methods in Lan-guage Processing.

Christoph Tillmann. 2004. A Unigram OrientationModel for Statistical Machine Translation. In Pro-ceedings of the Joint Conference on Human Lan-guage Technologies and the Annual Meeting of theNorth American Chapter of the Association of Com-putational Linguistics (HLT-NAACL).

Roy Tromble and Jason Eisner. 2009. Learning linearordering problems for better translation. In Proceed-ings of the 2009 Conference on Empirical Methodsin Natural Language Processing, pages 1007–1016,Singapore, August.

Karthik Visweswariah, Rajakrishnan Rajkumar, AnkurGandhe, Ananthakrishnan Ramanathan, and JiriNavratil. 2011. A word reordering model for im-proved machine translation. In Proceedings of the2011 Conference on Empirical Methods in NaturalLanguage Processing, pages 486–496, Edinburgh,Scotland, UK., July.

450

Fei Xia and Michael McCord. 2004. Improvinga statistical MT system with automatically learnedrewrite patterns. In Proceedings of Coling 2004,pages 508–514, Geneva, Switzerland, Aug 23–Aug27. COLING.

R. Zens, F. J. Och, and H. Ney. 2002. Phrase-based sta-tistical machine translation. In 25th German Con-ference on Artificial Intelligence (KI2002), pages18–32, Aachen, Germany. Springer Verlag.

451

efficient solutions for word reordering in german-english phrase

Documents