report on the first quality translation shared taskchris hokamp (dcu), amir kamran (uva), matteo...

This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”.This project has received funding from the European Union’s Horizon 2020 program for ICT

under grant agreement no. 645452.

Deliverable D4.1

Report on the First QualityTranslation Shared Task

Matthias Huck (UEDIN), Philipp Koehn (UEDIN), Ondřej Bojar (CUNI),Chris Hokamp (DCU), Amir Kamran (UvA), Matteo Negri (FBK),

Khalil Sima’an (UvA), Lucia Specia (USFD), Miloš Stanojević (UvA),Marco Turchi (FBK)

Dissemination Level: Public

30th January, 2016

Quality Translation 21D4.1: Report on the First Quality Translation Shared Task

Grant agreement no. 645452Project acronym QT21Project full title Quality Translation 21Type of action Research and Innovation ActionCoordinator Prof. Josef van Genabith (DFKI)Start date, duration 1st February, 2015, 36 monthsDissemination level PublicContractual date of delivery 31st January, 2016Actual date of delivery 30th January, 2016Deliverable number D4.1Deliverable title Report on the First Quality Translation Shared TaskType ReportStatus and version Final (Version 1.0)Number of pages 84Contributing partners UEDIN, UvA, DCU, CUNI, FBK, USFDWP leader UEDINAuthor(s) Matthias Huck (UEDIN), Philipp Koehn (UEDIN),

Ondřej Bojar (CUNI), Chris Hokamp (DCU),Amir Kamran (UvA), Matteo Negri (FBK),Khalil Sima’an (UvA), Lucia Specia (USFD),Miloš Stanojević (UvA), Marco Turchi (FBK)

EC project officer Susan FraserThe partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz

GmbH (DFKI), Germany• Rheinisch-Westfälische Technische Hochschule Aachen

(RWTH), Germany• Universiteit van Amsterdam (UvA), Netherlands• Dublin City University (DCU), Ireland• University of Edinburgh (UEDIN), United Kingdom• Karlsruher Institut für Technologie (KIT), Germany• Centre National de la Recherche Scientifique (CNRS), France• Univerzita Karlova v Praze (CUNI), Czech Republic• Fondazione Bruno Kessler (FBK), Italy• University of Sheffield (USFD), United Kingdom• TAUS b.v. (TAUS), Netherlands• text & form GmbH (TAF), Germany• TILDE SIA (TILDE), Latvia• Hong Kong University of Science and Technology (HKUST),

Hong Kong

For copies of reports, updates on project activities and other QT21-related information, contact:Prof. Stephan Busemann, DFKI GmbHStuhlsatzenhausweg 366123 Saarbrücken, Germany

[email protected]: +49 (681) 85775 5286Fax: +49 (681) 85775 5338

Copies of reports and other material can also be accessed via the project’s homepage:http://www.qt21.eu/

© 2016, The Individual AuthorsNo part of this document may be reproduced or transmitted in any form, or by any means,

electronic or mechanical, including photocopy, recording, or any information storage and retrievalsystem, without permission from the copyright owner.

of 84

mailto:[email protected]

http://www.qt21.eu/


Contents1 Executive Summary 4

2 First Quality Translation Shared Task 52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Translation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Quality Estimation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Metrics Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Tuning Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Automatic Post-editing Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Plans for the Second Quality Translation Shared Task 93.1 Plans for the Translation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Plans for the Quality Estimation Task . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Plans for the Metrics Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Plans for the Tuning Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.5 Plans for the Automatic Post-editing Task . . . . . . . . . . . . . . . . . . . . . . 10

4 Conclusion 11

References 12

Appendices 13

Appendix A Findings of the 2015 Workshop on Statistical Machine Translation 13

Appendix B Results of the WMT15 Metrics Shared Task 59

Appendix C Results of the WMT15 Tuning Shared Task 77

of 84


1 Executive SummaryThis deliverable reports on the First Quality Translation Shared Task campaign, organized byQT21 as part of work package 4 (WP4) during the first year of the project.

The First Quality Translation Shared Task took place in 2015, in conjunction with theEMNLP 2015 Tenth Workshop on Statistical Machine Translation (WMT). WMT and its sharedtasks are realized in close collaboration with a number of other European projects (such asCracker, MosesCore, EXPERT) and with members of international universities and researchlabs (such as the Johns Hopkins University and Microsoft Research).

In Section 2 of this document, we give a compact outline of the main objectives, the basicshaping of the distinct individual tasks, and some of the most important results of the QT21First Quality Translation Shared Task campaign.1 Three papers presenting an in-depth de-scription of all tasks, the evaluation results, as well as their analysis have been published in theProceedings of the EMNLP 2015 Tenth Workshop on Statistical Machine Translation (Bojaret al., 2015; Stanojević et al., 2015b; Stanojević et al., 2015a). For reference to the detailedfindings of the campaign, we include the three WMT papers as appendices to this document(Appendices A, B, C).

We are currently in the preparation phase for the Second Quality Translation Shared Taskcampaign, to be run in 2016. We therefore briefly outline plans for the 2016 campaign inSection 3 of this document, with a focus on changes as compared to the 2015 campaign.

1Note that Section 2 naturally overlaps with the WP4 section in the Interim Progress Report from month 9(Deliverable D6.1), since the First Quality Translation Shared Task campaign had been run by month 9.

of 84


2 First Quality Translation Shared Task2.1 OverviewDuring its project time, QT21 will organize three annual shared task campaigns, in which allresearch partners participate and which are open to outside machine translation research anddevelopment groups. This deliverable reports on the first annual shared task campaign whoseresults where discussed in September 2015 in Lisbon.

The QT21 Quality Translation Shared Tasks are designed as to be the core testing andvalidation activity of the project. New research methods which have been developed in QT21are evaluated with respect to their impact on translation quality. The shared tasks also servethe purpose of providing a platform for sharing ideas and expertise, therefore driving researchthrough competition.

Each project task in WP 4 corresponds to the organization of one annual shared task cam-paign, with the following schedule:

• WP4 Task 4.1: First Quality Translation Shared Task [M01-M12]• WP4 Task 4.2: Second Quality Translation Shared Task [M13-M24]• WP4 Task 4.3: Third Quality Translation Shared Task [M25-M36]

Each shared task involves creation and distribution of training data, creation of test data,definition of an evaluation protocol, infrastructure to collect participant submissions, as well asautomatic and manual scoring of results.

The First Quality Translation Shared Task (WP4 Task 4.1) was organized in the middleof the first year of the project, in conjunction with the EMNLP 2015 Tenth Workshop onStatistical Machine Translation (WMT).2 Organizing the QT21 Quality Translation SharedTasks at WMT helps us maximize their impact, since we are able to build on the popularityof an established event. Thus it is not surprising that high-profile research groups throughoutthe world actively participate. We keep the shared task open to all interested parties in thisframework. Progress achieved within QT21 can thus directly be compared against the state ofthe art elsewhere.

The open character of the Quality Translation Shared Task allows for a verification of thesuccess of the research conducted in QT21. Three project milestones (MS1, MS4, MS6) arecoupled with the shared tasks:

• MS1: Shared Task 1 [M8] — QT21 participants should perform well compared to others.• MS4: Shared Task 2 [M20] — QT21 participants should demonstrate improvements over

Shared Task 1 and perform better than others.• MS6: Shared Task 3 [M30] — QT21 participants should demonstrate continuous improve-

ments over Shared Tasks 1 and 2 and perform better than others.

The First Quality Translation Shared Task evaluation campaign comprised five distincttasks:

• a translation task• a quality estimation task• a metrics task• a tuning task• an automatic post-editing task

In the remainder of this section, we give a brief description of the five distinct tasks andhighlight key results of the evaluation campaign. Overview papers published in the WMTproceedings provide more details (Bojar et al., 2015; Stanojević et al., 2015a; Stanojević et al.,2015b), cf. Appendices A, B, C of this report. The results show that Milestone 1 (MS1) of theproject has been reached.

2http://www.statmt.org/wmt15/

of 84

http://www.statmt.org/wmt15/


Language Pair Rank range PartnerEnglish→Czech 1 (out of 15) CUNICzech→English 2 (out of 16) UEDINEnglish→French 1 (out of 7) LIMSI-CNRSFrench→English 1-3 (out of 7) LIMSI-CNRS, UEDINEnglish→Finnish 6-8 (out of 10) UEDINFinnish→English 4-7 (out of 14) UEDINEnglish→German 1-2 (out of 16) UEDINGerman→English 2-3 (out of 13) UEDINEnglish→Russian 4-5 (out of 10) LIMSI-CNRSRussian→English 7-10 (out of 13) LIMSI-CNRS

Table 1: Results of the translation task: top ranked systems submitted by QT21 project part-ners.

2.2 Translation TaskIn the translation task, participants were asked to translate a shared test set, optionally restrict-ing themselves to the provided training data. We covered five language pairs in 2015: Czech,German, French, Russian, and Finnish; each paired with English, in both translation directions.Finnish was added as a new language that had not been covered at WMT in previous years,providing a lesser resourced data condition on a challenging language pair.

The system outputs for each language pair and translation direction were evaluated bothautomatically3 and manually. Ranking according to human judgment is the primary evaluationmetric. The human evaluation involves asking human judges to rank sentences output byanonymized systems. A rank range is computed for each system by collecting the absolute rankof each system, and then clustering systems into equivalence classes containing systems withoverlapping ranges, yielding a partial ordering over systems at the 95% confidence level (Bojaret al. (2015) describe the details).

In total, 68 machine translation systems from 24 institutions were submitted for the tentranslation directions in the translation task. QT21 participants performed well compared toothers (Milestone 1), including 7 anonymized commercial systems. Table 1 lists the top rankedsystems submitted by QT21 project partners.

An in-depth description of all aspects of the shared translation task (such as the evaluationframework and methodology, the data provided, and all official results) is given in (Bojar et al.,2015), cf. Appendix A of this report.

2.3 Quality Estimation TaskThe fourth edition of the WMT shared task on quality estimation (QE) of machine translationcontinued work from previous editions of the task, with subtasks including both sentence- andword-level estimation, and a new subtask on document-level prediction. A novel component ofthe 2015 task which is particularly relevant for QT21 was the use of much larger training sets forpredicting quality at the word level: 280, 755 words for training and 40, 899 for testing. Word-level quality prediction is one of the goals in WP3 in QT21 and was supported by the evaluationcampaign in WP4. The hypothesis was that with a larger dataset, more positive results couldbe achieved. Our findings showed that the label distribution in the data is as important as thesize of the dataset, and this will be taken into account during the data collection activities inWP3. The three subtasks are specifically:

3http://matrix.statmt.org

of 84

http://matrix.statmt.org


Subtask 1: Predicting sentence-level quality. This subtask aimed at scoring (and rank-ing) translation sentences according to the predicted percentage of words that need to befixed (HTER). We provided a dataset with 12, 271 + 1, 817 English→Spanish translations gen-erated by a statistical machine translation system. The labels were automatically derived fromthe human post-editing of the machine translation.

Subtask 2: Predicting word-level quality. The goal of this subtask was to evaluate the extentto which we can detect word-level errors in machine translation output, aiming at making abinary distinction between GOOD and BAD tokens. The same dataset as in subtask 1 wasprovided.

Subtask 3: Predicting document-level quality. This subtask explored scoring and ranking ofshort documents according to their predicted METEOR score. We provided datasets for twolanguage pairs: English→German and German→English translations taken from all participat-ing systems in WMT 2013 labelled against reference translations using METEOR.

Participants were provided with a baseline set of features for each task, and a softwarepackage to extract these and other quality estimation features and perform model learning,with suggested methods for all levels of prediction.

We received a total of 34 system submissions by 10 teams. Participants included groups fromEurope and China. The experimental settings, system descriptions, and results are detailed inthe WMT 2015 Findings paper (Bojar et al., 2015), cf. Appendix A.

2.4 Metrics TaskThe participants of the metrics task were asked to score the outputs of the MT systems involvedin the WMT 2015 Shared Translation Task. We collected scores of 46 metrics from 11 researchgroups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU,NIST, WER, PER, TER, and CDER) as baselines. The collected scores were evaluated interms of system level correlation (how well each metric’s scores correlate with WMT 2015official manual ranking of systems) and in terms of segment level correlation (how often a metricagrees with humans in comparing two translations of a particular sentence). The best metricsout of English (to French, Finnish, German, Czech, and Russian) both on system and segmentlevel are BEER (UvA-ILLC) and chrF3 (DFKI). The best metric on system level for Englishis DPMFcomb (DCU), while on segment level it is DPMFcomb (DCU) and BEER_Treepel(UvA-ILLC). All of the winning metrics have been developed by members of QT21. Thedetailed results are reported in (Stanojević et al., 2015b), cf. Appendix B.

2.5 Tuning TaskThe participants of the tuning task were provided with a complete machine translation systemand asked to tune its internal parameters (feature weights). The tuned systems were used totranslate the test set and the outputs were manually ranked for translation quality. We received4 submissions in the English→Czech and 6 in the Czech→English translation direction. Inaddition, we ran 3 baseline setups, tuning the parameters with standard optimizers for BLEUscore. The best submitted systems for English→Czech and Czech→English are from DCU andUvA-ILLC respectively (both members of QT21). The detailed report on this task is presentedin (Stanojević et al., 2015a), cf. Appendix C.

2.6 Automatic Post-editing TaskIn addition to the planned tasks (cf. the Grant Agreement), in 2015 WMT hosted for the firsttime a shared task on MT automatic post-editing (APE). This pilot task was organized by FBKto focus the attention of the MT community on one of the core problems addressed in QT21(see Task 3.3 of WP3).

of 84


The APE task requires systems to automatically correct errors present in a machine-translatedtext. By solving this problem, APE components would make it possible to:

• Improve MT output by exploiting information unavailable to the decoder, or by performingdeeper text analysis that is too expensive at the decoding stage;

• Cope with systematic errors of an MT system whose decoding process is not accessible;• Adapt the output of a general-purpose MT system to the style requested in a specific

application domain, thus reducing (human) post-editing effort.

The APE pilot task focused on the challenges posed by the “black-box” scenario in whichthe MT system is unknown and cannot be modified. The main objectives were to:

1. define a sound evaluation framework for the task,2. identify and understand the most critical aspects in terms of data acquisition and system

evaluation, and3. make an inventory of current approaches and evaluate the state of the art.

Participants were provided with English-Spanish train/dev data consisting of (source, target,human post-edits) triplets, and were asked to return automatic post-edits for a test set of unseen(source, target) pairs. Training, dev and test data respectively contained 11,272, 1,000 and 1,817tuples drawn from the news domain. Post-edits were collected by means of a crowdsourcingplatform. All data were provided by Unbabel,4 since those collected by QT21 will be onlyavailable for the next round in 2016.

Four teams participated in the APE pilot task by submitting a total of seven runs. Twoof them, FBK and LIMSI-CNRS are also among the QT21 partners. Both participations weresuccessful, though none of the systems managed to beat this year’s baseline. In terms of thetwo metrics used to rank participants (average TER, both case sensitive and insensitive), theirprimary runs ranked first and second (with FBK being the best in the case sensitive, and LIMSI-CNRS being the best in the case insensitive evaluation mode). This can be considered as a goodresult in light of the first milestone of QT21.

Overall, the APE pilot task was a positive experience. With respect to the first objectivefor the initial year (define a sound evaluation framework), no major issues emerged or requestedradical changes in future evaluation rounds. Concerning the second objective (identify andunderstand the most critical aspects) we learned a lot, especially about the strong relationbetween the domain and the source of the data. Most likely, the next rounds will focus ondomain-specific data from professional translators (i.e. those collected in WP3), which are themost suitable to learn reliable and reusable correction rules. These findings are also useful andinformative to drive the data collection activities in WP3, which will follow similar criteria.Concerning the third objective (make an inventory of current approaches) it was interestingto observe that, despite sharing the same underlying approach, each system included originalsolutions that improved over the state of the art.

The detailed report on the APE pilot task is contained in (Bojar et al., 2015), cf. Appendix A.

4https://unbabel.com/

of 84

https://unbabel.com/


3 Plans for the Second Quality Translation Shared TaskThe Second Quality Translation Shared Task campaign is under preparation and will be heldin conjunction with the ACL 2016 First Conference on Machine Translation.5 In the nextparagraphs, we give an outlook on the planned characteristics of the different individual tasksof this follow-up QT21 evaluation campaign.

3.1 Plans for the Translation TaskWe will continue to cover the Czech-English and German-English language pairs in both trans-lation directions. We will now have two new language pairs, Romanian and Turkish both pairedwith English as target language. Due to the sponsorship of the University of Helsinki and ofYandex, Finnish and Russian (both paired with English) can be continued.6

Training corpora will be made available for language pairs that we did not cover before. Forthe established language pairs, our focus is on increasing the amount of provided training data.Test sets are in the news domain and have been created by means of the allocated subcontractingbudget.

Looking further ahead into the future, we intend to cover the Latvian-English language pairat the Third Quality Translation Shared Task campaign in 2017.

3.2 Plans for the Quality Estimation TaskThe next quality estimation task will build on its previous four editions to further examineautomatic methods for estimating the quality of machine translation output. It will includeword-level, sentence-level and document-level estimation, as in previous years, as well as anew prediction level: phrase-based quality estimation. Identifying errors at the phrase-levelis one of the objectives in WP3. Other important novelties for the next edition concern thedata to be used for the sentence, word and phrase-level tasks: a large English-German dataset(15,000 segments) produced from post-edits of machine translations by professional translators(as opposed to crowdsourced post-edits as in the previous year). Also, for the first time, the datawill be domain-specific (Information Technology domain). We are in the process of collectingand annotating this dataset as part of WP3. Finally, the document-level task will use, for thefirst time, entire documents, which have been human annotated for quality indirectly in twoways: through reading comprehension tests and through a two-stage post-editing exercise. Thelanguage pairs for the document-level subtask will be German-English and English-Spanish.The different subtasks planned for WMT 2016 have the following main goals:

• To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets.

• To study the utility of detailed information logged during post-editing (time, keystrokes,actual edits) for different levels of prediction.

• To analyze the effectiveness of different types of quality labels provided by humans forlonger texts in document-level prediction.

• To investigate quality estimation at a new level of granularity: phrases.

3.3 Plans for the Metrics TaskThe 2016 metrics task will be co-organized by UvA and CUNI, but potentially also otherteams. The 2016 edition of the task is largely similar to the 2015 edition with the followingpotential difference (still under study): we would like to evaluate not only system outputs fromthe translation task but additionally include evaluation of data from other tasks (such as the

5http://www.statmt.org/wmt16/6We thank Yandex for their donation of data for the Russian-English and Turkish-English language pairs, and

the University of Helsinki for their donation for the Finnish-English language pair.

of 84

http://www.statmt.org/wmt16/


multimodal task, which is a separate new WMT shared task in 2016). The type of outputfrom the multimodal task can be quite different so it would be interesting to see how machinetranslation evaluation metrics can handle the difference.

3.4 Plans for the Tuning TaskThe 2016 tuning task will be co-organized by UvA and CUNI, but potentially also other partnersfrom QT21. The 2016 edition of the task is largely similar to the 2015 edition with the potentialdifference that system tuning is expected to be on a much larger dataset than the standard sizeused for WMT translation system tuning or the dataset from the 2015 edition.

3.5 Plans for the Automatic Post-editing TaskThe second round of the automatic post-editing task will build on the findings of the pilotedition, organized in 2015. The main goal will be to set up an evaluation scenario in whichthe data repetitiveness/representativeness issues encountered last year are mitigated, makingthe task easier to approach for newcomers and baseline improvements easier to obtain. Theattention, in organizing the new round, is hence being focused on the following two criticalaspects:

1. Domain of the data. The use of news data raised the problem of data sparseness and didnot satisfy the requirement of testing new APE approaches in real-industrial conditions.News data, indeed, covers a wide range of topics which makes it of limited usefulness tolearn reusable correction rules. With the limited amount of data available for the pilot,this made it particularly hard for participants to find and apply corrections when neededto improve the test instances. Moreover, since news does not seem to represent interestingmaterial from the industry point of view, following some participants’ suggestions the typeof data chosen will shift towards a specific domain (Information Technology) representedby a collection from a single vendor.

2. Source of the post-edits. Another issue in the pilot task was the high variability ofvalid MT corrections, which mostly depends on different attitudes and criteria shownby the crowdsourced workforce involved in data acquisition. Compared to professionaltranslators, which tend to use a more coherent style within the same translation project,non-expert crowdsourced workers hired to translate single isolated sentences will inevitablytend to show large differences in lexical choice, word ordering, etc. This represents anotherpossible cause of data sparsity and, in turn, an additional cause of complexity for the task.For this reason, the post-edits that will be used in the second edition of the APE task arecreated by professional post-editors following their normal translation guidelines.

To enhance the interpretability of the results and provide useful findings for future research,the organizers will also try to extend the evaluation by adding a manual analysis of the post-edited sentences. Similarly to the translation task, the mere use of automatic metrics (TERin the case of APE) for the evaluation of the automatic post-edited sentences might limit theanalysis of the results (especially when only one reference is available). For this reason, it is ourintention to perform a manual evaluation of a subset of sentences selected from the outputs ofthe submitted systems, aimed to avoid penalizing valid post-edits that differ from the reference.

Differently from the pilot edition, the task will involve a new language setting, English toGerman translation, chosen among those that are relevant to QT21. All these changes, whichstem from the thorough analysis of the pilot experience reported in (Bojar et al., 2015), will makea real estimation of the technology progress from the last year rather difficult. Nevertheless,they represent a significant step forward towards a stable task definition for the future years.

of 84


4 ConclusionThe QT21 First Quality Translation Shared Task campaign was a success, with QT21 organizingthe three distinct tasks originally envisaged in the Grant Agreement (a translation task, aquality estimation task, and a metrics task), plus two novel distinct tasks (a tuning task andan automatic post-editing task). Over the course of the preparation and scientific analysis ofthe results of the campaign, we have closely collaborated with partners from other Europeanprojects, international research labs, and high-profile companies.

For participation in the evaluation campaign, we were able to attract many external researchgroups who are renowned in the area of machine translation. Members of the QT21 projectperformed well compared to others, showing that Milestone 1 of the project has been reached.

The Second Quality Translation Shared Task campaign will be held in conjunction with theACL 2016 First Conference on Machine Translation. All five distinct tasks will be continuedin 2016, with new challenges such as additional language pairs, domain-specific data, subtaskswith different granularities of the prediction level, and refined evaluation scenarios.

of 84


ReferencesOndřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris

Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Car-olina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop onStatistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Ma-chine Translation, pages 1–46, Lisbon, Portugal, September. Association for ComputationalLinguistics.

Miloš Stanojević, Amir Kamran, and Ondřej Bojar. 2015a. Results of the WMT15 TuningShared Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation,pages 274–281, Lisbon, Portugal, September. Association for Computational Linguistics.

Miloš Stanojević, Amir Kamran, Philipp Koehn, and Ondřej Bojar. 2015b. Results of theWMT15 Metrics Shared Task. In Proceedings of the Tenth Workshop on Statistical MachineTranslation, pages 256–273, Lisbon, Portugal, September. Association for ComputationalLinguistics.

of 84


A Findings of the 2015 Workshop on Statistical Machine Translation

Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46,Lisboa, Portugal, 17-18 September 2015. c©2015 Association for Computational Linguistics.

Findings of the 2015 Workshop on Statistical Machine TranslationOndrej Bojar

Charles Univ. in PragueRajen Chatterjee

FBKChristian Federmann

Microsoft ResearchBarry Haddow

Univ. of Edinburgh

Matthias HuckUniv. of Edinburgh

Chris HokampDublin City Univ.

Philipp KoehnJHU / Edinburgh

Varvara LogachevaUniv. of Sheffield

Christof MonzUniv. of Amsterdam

Matteo NegriFBK

Matt PostJohns Hopkins Univ.

Carolina ScartonUniv. of Sheffield

Lucia SpeciaUniv. of Sheffield

Marco TurchiFBK

Abstract

This paper presents the results of theWMT15 shared tasks, which included astandard news translation task, a metricstask, a tuning task, a task for run-timeestimation of machine translation quality,and an automatic post-editing task. Thisyear, 68 machine translation systems from24 institutions were submitted to the tentranslation directions in the standard trans-lation task. An additional 7 anonymizedsystems were included, and were thenevaluated both automatically and manu-ally. The quality estimation task had threesubtasks, with a total of 10 teams, submit-ting 34 entries. The pilot automatic post-editing task had a total of 4 teams, submit-ting 7 entries.

1 Introduction

We present the results of the shared tasks ofthe Workshop on Statistical Machine Translation(WMT) held at EMNLP 2015. This workshopbuilds on eight previous WMT workshops (Koehnand Monz, 2006; Callison-Burch et al., 2007,2008, 2009, 2010, 2011, 2012; Bojar et al., 2013,2014). This year we conducted five official tasks:a translation task, a quality estimation task, a met-rics task, a tuning task1, and a automatic post-editing task.

In the translation task (§2), participants wereasked to translate a shared test set, optionally re-stricting themselves to the provided training data.We held ten translation tasks this year, betweenEnglish and each of Czech, French, German,Finnish, and Russian. The Finnish translation

1The metrics and tuning tasks are reported in separate pa-pers (Stanojevic et al., 2015a,b).

tasks were new this year, providing a lesser re-sourced data condition on a challenging languagepair. The system outputs for each task were evalu-ated both automatically and manually.

The human evaluation (§3) involves askinghuman judges to rank sentences output byanonymized systems. We obtained large num-bers of rankings from researchers who contributedevaluations proportional to the number of tasksthey entered. We made data collection more ef-ficient and used TrueSkill as ranking method.

The quality estimation task (§4) this year in-cluded three subtasks: sentence-level predictionof post-editing effort scores, word-level predictionof good/bad labels, and document-level predictionof Meteor scores. Datasets were released withEnglish→Spanish news translations for sentenceand word level, English↔German news transla-tions for document level.

The first round of the automatic post-editingtask (§5) examined automatic methods for cor-recting errors produced by an unknown machinetranslation system. Participants were providedwith training triples containing source, target andhuman post-editions, and were asked to returnautomatic post-editions for unseen (source, tar-get) pairs. This year we focused on correctingEnglish→Spanish news translations.

The primary objectives of WMT are to evaluatethe state of the art in machine translation, to dis-seminate common test sets and public training datawith published performance numbers, and to re-fine evaluation and estimation methodologies formachine translation. As before, all of the data,translations, and collected human judgments arepublicly available.2 We hope these datasets serveas a valuable resource for research into statistical

2http://statmt.org/wmt15/results.html

1

of 84


machine translation and automatic evaluation orprediction of translation quality.

2 Overview of the Translation Task

The recurring task of the workshop examinestranslation between English and other languages.As in the previous years, the other languages in-clude German, French, Czech and Russian.

Finnish replaced Hindi as the special languagethis year. Finnish is a lesser resourced languagecompared to the other languages and has challeng-ing morphological properties. Finnish representsalso a different language family that we had nottackled since we included Hungarian in 2008 and2009 (Callison-Burch et al., 2008, 2009).

We created a test set for each language pair bytranslating newspaper articles and provided train-ing data, except for French, where the test set wasdrawn from user-generated comments on the newsarticles.

2.1 Test data

The test data for this year’s task was selected fromonline sources, as before. We took about 1500 En-glish sentences and translated them into the other5 languages, and then additional 1500 sentencesfrom each of the other languages and translatedthem into English. This gave us test sets of about3000 sentences for our English-X language pairs,which have been either written originally writtenin English and translated into X, or vice versa.

For the French-English discussion forum testset, we collected 38 discussion threads each fromthe Guardian for English and from Le Monde forFrench. See Figure 1 for an example.

The composition of the test documents is shownin Table 1.

The stories were translated by the professionaltranslation agency Capita, funded by the EUFramework Programme 7 project MosesCore, andby Yandex, a Russian search engine company.3

All of the translations were done directly, and notvia an intermediate language.

2.2 Training data

As in past years we provided parallel corporato train translation models, monolingual cor-pora to train language models, and developmentsets to tune system parameters. Some train-ing corpora were identical from last year (Eu-

3http://www.yandex.com/

roparl4, United Nations, French-English 109 cor-pus, CzEng, Common Crawl, Russian-Englishparallel data provided by Yandex, Russian-EnglishWikipedia Headlines provided by CMU), somewere updated (News Commentary, monolingualdata), and new corpora was added (Finnish Eu-roparl), Finnish-English Wikipedia Headline cor-pus).

Some statistics about the training materials aregiven in Figure 2.

2.3 Submitted systems

We received 68 submissions from 24 institu-tions. The participating institutions and their en-try names are listed in Table 2; each system didnot necessarily appear in all translation tasks. Wealso included 1 commercial off-the-shelf MT sys-tem and 6 online statistical MT systems, which weanonymized.

For presentation of the results, systems aretreated as either constrained or unconstrained, de-pending on whether their models were trained onlyon the provided data. Since we do not know howthey were built, these online and commercial sys-tems are treated as unconstrained during the auto-matic and human evaluations.

3 Human Evaluation

Following what we had done for previous work-shops, we again conduct a human evaluationcampaign to assess translation quality and deter-mine the final ranking of candidate systems. Thissection describes how we prepared the evaluationdata, collected human assessments, and computedthe final results.

This year’s evaluation campaign differed fromlast year in several ways:

• In previous years each ranking task comparedfive different candidate systems which wereselected without any pruning or redundancycleanup. This had resulted in a noticeableamount of near-identical ranking candidatesin WMT14, making the evaluation processunnecessarily tedious as annotators ran intoa fair amount of ranking tasks containingvery similar segments which are hard to in-spect. For WMT15, we perform redundancycleanup as an initial preprocessing step and

4As of Fall 2011, the proceedings of the European Parlia-ment are no longer translated into all official languages.

2

of 84


This is perfectly illustrated by the UKIP numbties banning people with HIV.You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK ashealth tourists, and saying yes when the interviewer specifically asked if, with the aforementionedin mind, people with HIV were included in not being welcome.You raise a straw man and then knock it down with thinly veiled homophobia.Every time I or my family need to use the NHS we have to queue up behind bigots with a sense ofentitlement and chronic hypochondria.I think the straw man is yours.Health tourism as defined by the right wing loonies is virtually none existent.I think it’s called democracy.So no one would be affected by UKIP’s policies against health tourism so no problem.Only in UKIP La La Land could Carswell be described as revolutionary.Quoting the bollox The Daily Muck spew out is not evidence.Ah, shoot the messenger.The Mail didn’t write the report, it merely commented on it.Whoever controls most of the media in this country should undead be shot for spouting populistpropaganda as fact.I don’t think you know what a straw man is.You also don’t know anything about my personal circumstances or identity so I would be verycareful about trying to eradicate a debate with accusations of homophobia.Farage’s comment came as quite a shock, but only because it is so rarely addressed.He did not express any homophobic beliefs whatsoever.You will just have to find a way of getting over it.I’m not entirely sure what you’re trying to say, but my guess is that you dislike the media reportingthings you disagree with.It is so rarely addressed because unlike Fararge and his Thatcherite loony disciples who think aidsand floods are a signal from the divine and not a reflection on their own ignorance in understandingthe complexities of humanity as something to celebrate,then no.

Figure 1: Example news discussion thread used in the French–English translation task.

Language Sources (Number of Documents)Czech aktualne.cz (4), blesk.cz (1), blisty.cz (1), ctk.cz (1), denık.cz (1), e15.cz (1), iDNES.cz (19), ihned.cz (3), li-

dovky.cz (6), Novinky.cz (2), tyden.cz (1).English ABC News (4), BBC (6), CBS News (1), Daily Mail (1), Euronews (1), Financial Times (1), Fox News (2), Globe and

Mail (1), Independent (1), Los Angeles Times (1), News.com Australia (9), Novinite (2), Reuters (2), Sydney MorningHerald (1), stv.tv (1), Telegraph (8), The Local (1), The Nation (1), UPI (1), Washington Post (3).

German Abendzeitung Nurnberg (1), Aachener Nachrichten (1), Der Standard (2), Deutsche Welle (1), Frankfurter NeuePresse (1), Frankfurter Rundschau (1), Generalanzeiger Bonn (2), Gottinger Tageblatt (1), Haller Kreisblatt (1), Hell-weger Anzeiger (1), Junge Welt (1), Kreisanzeiger (1), Mainpost (1), Merkur (3), Mittelbayerische Nachrichten (2),Morgenpost (1), Mitteldeutsche Zeitung (1), Neue Presse Coburg (1), Nurtinger Zeitung (1), OE24 (1), KolnischeRundschau (1), Tagesspiegel (1), Volksfreund (1), Volksstimme (1), Wiener Zeitung (1), Westfalische Nachrichten (2).

Finnish Aamulehti (2), Etela-Saimaa (1), Etela-Suomen Sanomat (3), Helsingin Sanomat (13), Ilkka (7), Ilta-Sanomat (18),Kaleva (4), Karjalainen (2), Kouvolan Sanomat (1), Lapin Kansa (3), Maaseudun Tulevaisuus (1).

Russian 168.ru (1), aif (6), altapress.ru (1), argumenti.ru (8), BBC Russian (1), dp.ru (2), gazeta.ru (4), interfax (2), Kommer-sant (12), lenta.ru (8), lgng (3), mk (5), novinite.ru (1), rbc.ru (1), rg.ru (2), rusplit.ru (1), Sport Express (6), vesti.ru (10).

Table 1: Composition of the test set. For more details see the XML test files. The docid tag gives the source and the date foreach document in the test set, and the origlang tag indicates the original source language.

3

of 84


Europarl Parallel CorpusFrench↔ English German↔ English Czech↔ English Finnish↔ English

Sentences 2,007,723 1,920,209 646,605 1,926,114Words 60,125,563 55,642,101 50,486,398 53,008,851 14,946,399 17,376,433 37,814,266 52,723,296

Distinct words 140,915 118,404 381,583 115,966 172,461 63,039 693,963 115,896

News Commentary Parallel CorpusFrench↔ English German↔ English Czech↔ English Russian↔ English

Sentences 200,239 216,190 152,763 174,253Words 6,270,748 5,161,906 5,513,985 5,499,625 3,435,458 3,759,874 4,394,974 4,625,898

Distinct words 75,462 71,767 157,682 74,341 142,943 58,817 172,021 67,402

Common Crawl Parallel CorpusFrench↔ English German↔ English Czech↔ English Russian↔ English

Sentences 3,244,152 2,399,123 161,838 878,386Words 91,328,790 81,096,306 54,575,405 58,870,638 3,529,783 3,927,378 21,018,793 21,535,122

Distinct words 889,291 859,017 1,640,835 823,480 210,170 128,212 764,203 432,062

United Nations Parallel CorpusFrench↔ English

Sentences 12,886,831Words 411,916,781 360,341,450

Distinct words 565,553 666,077

109 Word Parallel CorpusFrench↔ English

Sentences 22,520,400Words 811,203,407 668,412,817

Distinct words 2,738,882 2,861,836

Yandex 1M Parallel CorpusRussian↔ English

Sentences 1,000,000Words 24,121,459 26,107,293

Distinct words 701,809 387,646

CzEng Parallel CorpusCzech↔ English

Sentences 14,833,358Words 200,658,857 228,040,794

Distinct words 1,389,803 920,824

Wiki Headlines Parallel CorpusRussian↔ English Finnish↔ English

Sentences 514,859 153,728Words 1,191,474 1,230,644 269,429 354,362

Distinct words 282,989 251,328 127,576 96,732

Europarl Language Model DataEnglish French German Czech Finnish

Sentence 2,218,201 2,190,579 2,176,537 668,595 2,120,739Words 59,848,044 63,439,791 53,534,167 14,946,399 39,511,068

Distinct words 123,059 145,496 394,781 172,461 711,868

News Language Model DataEnglish French German Czech Russian Finnish

Sentence 118,337,431 42,110,011 135,693,607 45,149,206 45,835,812 1,378,582Words 2,744,428,620 1,025,132,098 2,427,581,519 745,645,366 823,284,188 16,501,511

Distinct words 4,895,080 2,352,451 13,727,336 3,513,784 3,885,756 925,201

Test SetFrench↔ English German↔ English Czech↔ English Russian↔ English Finnish↔ English

Sentences 1500 2169 2656 2818 1370Words 29,858 27,173 44,081 46,828 46,005 54,055 55,655 65,744 19,840 27,811

Distinct words 5,798 5,148 9,710 7,483 13,013 7,757 15,795 8,695 8,553 5,279

Figure 2: Statistics for the training and test sets used in the translation task. The number of words and the number of distinctwords (case-insensitive) is based on the provided tokenizer.

4

of 84


ID InstitutionAALTO Aalto University (Gronroos et al., 2015)ABUMATRAN Abu-MaTran (Rubino et al., 2015)AFRL-MIT-* Air Force Research Laboratory / MIT Lincoln Lab (Gwinnup et al., 2015)CHALMERS Chalmers University of Technology (Kolachina and Ranta, 2015)CIMS University of Stuttgart and Munich (Cap et al., 2015)CMU Carnegie Mellon UniversityCU-CHIMERA Charles University (Bojar and Tamchyna, 2015)CU-TECTO Charles University (Dusek et al., 2015)DFKI Deutsches Forschungszentrum fur Kunstliche Intelligenz (Avramidis et al., 2015)ILLINOIS University of Illinois (Schwartz et al., 2015)IMS University of Stuttgart (Quernheim, 2015)KIT Karsruhe Institut of Technology (Cho et al., 2015)KIT-LIMSI Karsruhe Institut of Technology / LIMSI (Ha et al., 2015)LIMSI LIMSI (Marie et al., 2015)MACAU University of MacauMONTREAL University of Montreal (Jean et al., 2015)PROMT ProMTRWTH RWTH Aachen (Peter et al., 2015)SHEFF* University of Sheffield (Steele et al., 2015)UDS-SANT University of Saarland (Pal et al., 2015a)UEDIN-JHU University of Edinburgh / Johns Hopkins University (Haddow et al., 2015)UEDIN-SYNTAX University of Edinburgh (Williams et al., 2015)USAAR-GACHA University of Saarland, Liling TanUU Uppsala University (Tiedemann et al., 2015)COMMERCIAL-1 Commercial machine translation systemONLINE-[A,B,C,E,F,G]

Six online statistical machine translation systems

Table 2: Participants in the shared translation task. Not all teams participated in all language pairs. The translations from thecommercial and online systems were not submitted by their respective companies but were obtained by us, and are thereforeanonymized in a fashion consistent with previous years of the workshop.

5

of 84


create multi-system translations. As a con-sequence, we get ranking tasks with vary-ing numbers of candidate systems. To avoidoverloading the annotators we still allow amaximum of five candidates per ranking task.If we have more multi-system translations,we choose randomly.

A brief example should illustrate this moreclearly: say we have the following two can-didate systems:

sysA="This, is ’Magic’"

sysX="this is magic"

After lowercasing, removal of punctuationand whitespace normalization, which are ourcriteria for identifying near-identical outputs,both would be collapsed into a single multi-system:

sysA+sysX="This, is ’Magic’"

The first representative of a group of near-identical outputs is used as a proxy represent-ing all candidates in the group throughout theevaluation.

While there is a good chance that users wouldhave used some of the stripped information,e.g., case to differentiate between the twosystems relative to each other, the collapsedsystem’s comparison result against the othercandidates should be a good approximationof how human annotators would have rankedthem individually. We get a near 2x increasein the number of pairwise comparisons, sothe general approach seems helpful.

• After dropping external, crowd-sourcedtranslation assessment in WMT14 we endedup with approximately seventy-five percentless raw comparison data. Still, we were ableto compute good confidence intervals onthe clusters based on our improved rankingapproach.

This year, due to the aforementioned cleanup,annotators spent their time more efficiently,resulting in an increased number of finalranking results. We collected a total of542,732 individual “A > B” judgments thisyear, nearly double the amount of data com-pared to WMT14.

• Last year we compared three different mod-els of producing the final system rankings:Expected Wins (used in WMT13), Hopkinsand May (HM) and TrueSkill (TS). Overall,we found the TrueSkill method to work bestwhich is why we decided to use it as our onlyapproach in WMT15.

We keep using clusters in our final system rank-ings, providing a partial ordering (clustering) ofall evaluated candidate systems. Semantics remainunchanged to previous years: systems in the samecluster could not be meaningfully distinguishedand hence are considered to be of equal quality.

3.1 Evaluation campaign overview

WMT15 featured the largest evaluation campaignto date. Similar to last year, we decided to collectresearcher-based judgments only. A total of 137individual annotator accounts have been activelyinvolved. Users came from 24 different researchgroups and contributed judgments on 9,669 HITs.

Overall, these correspond to 29,007 individualranking tasks (plus some more from incompleteHITs), each of which would have spawned exactly10 individual “A > B” judgments last year, sowe expected at least >290,070 binary data points.Due to our redundancy cleanup, we are able toget a lot more, namely 542,732. We report ourinter/intra-annotator agreement scores based onthe actual work done (otherwise, we’d artificiallyboost scores based on inferred rankings) and usethe full set of data to compute clusters (where theinferred rankings contribute meaningful data).

Human annotation effort was exceptional andwe are grateful to all participating individuals andteams. We believe that human rankings providethe best decision basis for machine translationevaluation and it is great to see contributions onthis large a scale. In total, our human annotatorsspent 32 days and 20 hours working in Appraise.

The average annotation time per HIT amountsto 4 minutes 53 seconds. Several annotatorspassed the mark of 100 HITs annotated, someworked for more than 24 hours. We don’t take thisenormous amount of effort for granted and willmake sure to improve the evaluation platform andoverall process for upcoming workshops.

3.2 Data collection

The system ranking is produced from a large setof pairwise judgments on the translation quality of

6

of 84


candidate systems. Annotations are collected inan evaluation campaign that enlists participants inthe shared task to help. Each team is asked to con-tribute one hundred “Human Intelligence Tasks”(HITs) per primary system submitted.

Each HIT consists of three so-called rankingtasks. In a ranking task, an annotator is presentedwith a source segment, a human reference trans-lation, and the outputs of up to five anonymizedcandidate systems, randomly selected from the setof participating systems, and displayed in randomorder. This year, we perform redundancy cleanupas an initial preprocessing step and create multi-system translations. As a consequence, we getranking tasks with varying numbers of candidateoutputs.

There are two main benefits to this approach:

• Annotators are more efficient as they don’thave to deal with near-identical translationswhich are notoriously hard to differentiate;and

• Potentially, we get higher quality annotationsas near-identical systems will be assigned thesame “A > B” ranks, improving consistency.

As in previous years, the evaluation campaignis conducted using Appraise5 (Federmann, 2012),an open-source tool built using Python’s Djangoframework. At the top of each HIT, the followinginstructions are provided:

You are shown a source sentence fol-lowed by several candidate translations.Your task is to rank the translations frombest to worst (ties are allowed).

Annotators can decide to skip a ranking task butare instructed to do this only as a last resort, e.g.,if the translation candidates shown on screen areclearly misformatted or contain data issues (wronglanguage or similar problems). Only a small num-ber of ranking tasks has been skipped in WMT15.A screenshot of the Appraise ranking interface isshown in Figure 3.

Annotators are asked to rank the outputs from 1(best) to 5 (worst), with ties permitted. Note thata lower rank is better. The joint rankings providedby a ranking task are then reduced to the fully ex-panded set of pairwise rankings produced by con-sidering all

(n2

) ≤ 10 combinations of all n ≤ 5outputs in the respective ranking task.

5https://github.com/cfedermann/Appraise

For example, consider the following annotationprovided among outputs A,B, F,H , and J :

1 2 3 4 5F •A •B •J •H •

As the number of outputs n depends on the num-ber of corresponding multi-system translations inthe original data, we get varying numbers of re-sulting binary judgments. Assuming that outputsA and F from above are actually near-identical,the annotator this year would see a shorter rankingtask:

1 2 3 4 5AF •B •J •H •

Note that AF is a multi-system translation cover-ing two candidate systems.

Both examples would be reduced to the follow-ing set of pairwise judgments:

A > B,A = F,A > H,A < J

B < F,B < H,B < J

F > H,F < J

H < J

Here,A > B should be read is “A is ranked higherthan (worse than) B”. Note that by this procedure,the absolute value of ranks and the magnitude oftheir differences are discarded. Our WMT15 ap-proach including redundancy cleanup allows toobtain these judgments at a lower cognitive costfor the annotators. This partially explains why wewere able to collect more results this year.

For WMT13, nearly a million pairwise anno-tations were collected from both researchers andpaid workers on Amazon’s Mechanical Turk, ina roughly 1:2 ratio. Last year, we collected datafrom researchers only, an ability that was en-abled by the use of TrueSkill for producing thepartial ranking for each task (§3.4). This year,based on our redundancy cleanup we were able tonearly double the amount of annotations, collect-ing 542,732. See Table 3 for more details.

3.3 Annotator agreementEach year we calculate annotator agreementscores for the human evaluation as a measure of

7

of 84


Figure 3: Screenshot of the Appraise interface used in the human evaluation campaign. The annotator is presented with asource segment, a reference translation, and up to five outputs from competing systems (anonymized and displayed in randomorder), and is asked to rank these according to their translation quality, with ties allowed.

the reliability of the rankings. We measured pair-wise agreement among annotators using Cohen’skappa coefficient (κ) (Cohen, 1960). If P (A) bethe proportion of times that the annotators agree,and P (E) is the proportion of time that they wouldagree by chance, then Cohen’s kappa is:

κ =P (A)− P (E)

1− P (E)

Note that κ is basically a normalized version ofP (A), one which takes into account how mean-ingful it is for annotators to agree with each otherby incorporating P (E). The values for κ rangefrom 0 to 1, with zero indicating no agreement and1 perfect agreement.

We calculate P (A) by examining all pairs ofoutputs6 which had been judged by two or morejudges, and calculating the proportion of time thatthey agreed that A < B, A = B, or A > B. In

6regardless if they correspond to an individual system orto a set of systems (“multi-system”) producing nearly identi-cal translations

other words, P (A) is the empirical, observed rateat which annotators agree, in the context of pair-wise comparisons.

As for P (E), it captures the probability that twoannotators would agree randomly. Therefore:

P (E) = P (A<B)2 + P (A=B)2 + P (A>B)2

Note that each of the three probabilities in P (E)’sdefinition are squared to reflect the fact that we areconsidering the chance that two annotators wouldagree by chance. Each of these probabilities iscomputed empirically, by observing how often an-notators actually rank two systems as being tied.

Table 4 shows final κ values for inter-annotatoragreement for WMT11–WMT15 while Table 5details intra-annotator agreement scores, includ-ing the division of researchers (WMT13r) andMTurk (WMT13m) data. The exact interpretationof the kappa coefficient is difficult, but accordingto Landis and Koch (1977), 0–0.2 is slight, 0.2–0.4is fair, 0.4–0.6 is moderate, 0.6–0.8 is substantial,and 0.8–1.0 is almost perfect.

8

of 84


Language Pair Systems Rankings AverageCzech→English 17 85,877 5,051.6English→Czech 16 136,869 8,554.3German→English 14 40,535 2,895.4English→German 17 55,123 3,242.5French→English 8 29,770 3,721.3English→French 8 34,512 4,314.0Russian→English 14 46,193 3,299.5English→Russian 11 49,582 4,507.5Finnish→English 15 31,577 2,105.1English→Finnish 11 32,694 2,972.2Totals WMT15 131 542,732 4,143.0WMT14 110 328,830 2,989.3WMT13 148 942,840 6,370.5WMT12 103 101,969 999.6WMT11 133 63,045 474.0

Table 3: Amount of data collected in the WMT15 manual evaluation campagin. The final four rows report summary informationfrom previous editions of the workshop. Note how many rankings we get for Czech language pairs. These include systems fromthe tuning shared task. Finnish, as a new language, sees a shortage of rankings for Finnnish→English Interest in French seemsto have lowered this year with only seven systems. Overall, we see a nice increase in pairwise rankings, especially consideringthat we have dropped crowd-source annotation and are instead relying on researchers’ judgments exclusively.

The inter-annotator agreement rates improvefor most language pairs. On average, these arethe best scores we have ever observed in one ofour evaluation campaigns, including in WMT11,where results were inflated due to inclusion of thereference in the agreement rates. The results forintra-annotator agreement are more mixed: someimprove greatly (Czech and German) while othersdegrade (French, Russian). Our special language,Finnish, also achieves very respectable scores. Onaverage, again, we see the best intra-annotatoragreement scores since WMT11.

It should be noted that the improvement is notcaused by the “ties forced by our redundancycleanup”. If two systems A and F produced near-identical outputs, they are collapsed to one multi-system outputAF and treated jointly in our agree-ment calculations, i.e. only in comparison withother outputs. It is only the final TrueSkill scoresthat include the tie A = F .

3.4 Producing the human rankingThe collected pairwise rankings are used to pro-duce the official human ranking of the systems.For WMT14, we introduced a competition amongmultiple methods of producing this human rank-ing, selecting the method based on which couldbest predict the annotations in a portion of thecollected pairwise judgments. The results of thiscompetition were that (a) the competing metrics

produced almost identical rankings across all tasksbut that (b) one method, TrueSkill, had less vari-ance across randomized runs, allowing us to makemore confident cluster predictions. In light ofthese findings, this year, we produced the humanranking for each task using TrueSkill in the fol-lowing fashion, following procedures adopted forWMT12: We produce 1,000 bootstrap-resampledruns over all of the available data. We then com-pute a rank range for each system by collecting theabsolute rank of each system in each fold, throw-ing out the top and bottom 2.5%, and then clus-tering systems into equivalence classes containingsystems with overlapping ranges, yielding a par-tial ordering over systems at the 95% confidencelevel.

The full list of the official human rankings foreach task can be found in Table 6, which also re-ports all system scores, rank ranges, and clustersfor all language pairs and all systems. The officialinterpretation of these results is that systems in thesame cluster are considered tied. Given the largenumber of judgments that we collected, it was pos-sible to group on average about two systems in acluster, even though the systems in the middle aretypically in larger clusters.

In Figure 4 and 5, we plotted the human eval-uation result against everybody’s favorite metricBLEU (some of the outlier online systems are

9

of 84


Language Pair WMT11 WMT12 WMT13 WMT13r WMT13m WMT14 WMT15Czech→English 0.400 0.311 0.244 0.342 0.279 0.305 0.458English→Czech 0.460 0.359 0.168 0.408 0.075 0.360 0.438German→English 0.324 0.385 0.299 0.443 0.324 0.368 0.423English→German 0.378 0.356 0.267 0.457 0.239 0.427 0.423French→English 0.402 0.272 0.275 0.405 0.321 0.357 0.343English→French 0.406 0.296 0.231 0.434 0.237 0.302 0.317Russian→English — — 0.278 0.315 0.324 0.324 0.372English→Russian — — 0.243 0.416 0.207 0.418 0.336Finnish→English — — — — — — 0.388English→Finnish — — — — — — 0.549Mean 0.395 0.330 0.260 0.403 0.251 0.367 0.405

Table 4: κ scores measuring inter-annotator agreement for WMT15. See Table 5 for corresponding intra-annotator agreementscores. WMT13r and WMTm refer to researchers’ judgments and crowd-sourced judgments obtained using Mechanical Turk,respectively. WMT14 and WMT15 results are based on researchers’ judgments only (hence, comparable to WMT13r).

Language Pair WMT11 WMT12 WMT13 WMT13r WMT13m WMT14 WMT15Czech→English 0.597 0.454 0.479 0.483 0.478 0.382 0.694English→Czech 0.601 0.390 0.290 0.547 0.242 0.448 0.584German→English 0.576 0.392 0.535 0.643 0.515 0.344 0.801English→German 0.528 0.433 0.498 0.649 0.452 0.576 0.676French→English 0.673 0.360 0.578 0.585 0.565 0.629 0.510English→French 0.524 0.414 0.495 0.630 0.486 0.507 0.426Russian→English — — 0.450 0.363 0.477 0.629 0.506English→Russian — — 0.513 0.582 0.500 0.570 0.492Finnish→English — — — — — — 0.562English→Finnish — — — — — — 0.697Mean 0.583 0.407 0.479 0.560 0.464 0.522 0.595

Table 5: κ scores measuring intra-annotator agreement, i.e., self-consistency of judges, across for the past few years of thehuman evaluation campaign. Scores are much higher for WMT15 which makes sense as we enforce annotation consistencythrough our initial preprocessing which joins near-identical translation candidates into multi-system entries. It seems that thefocus on actual differences in our annotation tasks as well as the possibility of having “easier” ranking scenarios for n < 5candidate systems results in a higher annotator agreement, both for inter- and intra-annotator agreement scores.

not included to make the graphs viewable). Theplots cleary suggest that a fair comparison ofsystems of different kinds cannot rely on auto-matic scores. Rule-based systems receive a muchlower BLEU score than statistical systems (see forinstance English–German, e.g., PROMT-RULE).The same is true to a lesser degree for statisti-cal syntax-based systems (see English–German,UEDIN-SYNTAX) and online systems that were nottuned to the shared task (see Czech–English, CU-TECTO vs. the cluster of tuning task systems TT-*).

4 Quality Estimation Task

The fourth edition of the WMT shared task onquality estimation (QE) of machine translation(MT) builds on the previous editions of the task

(Callison-Burch et al., 2012; Bojar et al., 2013,2014), with tasks including both sentence andword-level estimation, using new training and testdatasets, and an additional task: document-levelprediction.

The goals of this year’s shared task were:

• Advance work on sentence- and word-level quality estimation by providing largerdatasets.

• Investigate the effectiveness of quality labels,features and learning methods for document-level prediction.

• Explore differences between sentence-leveland document-level prediction.

• Analyse the effect of training data sizes andquality for sentence and word-level predic-

10

of 84


Czech–English# score range system1 0.619 1 ONLINE-B2 0.574 2 UEDIN-JHU3 0.532 3-4 UEDIN-SYNTAX

0.518 3-4 MONTREAL4 0.436 5 ONLINE-A5 -0.125 6 CU-TECTO6 -0.182 7-9 TT-BLEU-MIRA-D

-0.189 7-10 TT-ILLC-UVA-0.196 7-11 TT-BLEU-MERT-0.210 8-11 TT-AFRL-0.220 9-11 TT-USAAR-TUNA

7 -0.263 12-13 TT-DCU-0.297 13-15 TT-METEOR-CMU-0.320 13-15 TT-BLEU-MIRA-SP-0.320 13-15 TT-HKUST-MEANT-0.358 15-16 ILLINOIS

English–Czech# score range system1 0.686 1 CU-CHIMERA2 0.515 2-3 ONLINE-B

0.503 2-3 UEDIN-JHU3 0.467 4 MONTREAL4 0.426 5 ONLINE-A5 0.261 6 UEDIN-SYNTAX6 0.209 7 CU-TECTO7 0.114 8 COMMERCIAL18 -0.342 9-11 TT-DCU

-0.342 9-11 TT-AFRL-0.346 9-11 TT-BLEU-MIRA-D

9 -0.373 12 TT-USAAR-TUNA10 -0.406 13 TT-BLEU-MERT11 -0.563 14 TT-METEOR-CMU12 -0.808 15 TT-BLEU-MIRA-SP

Russian–English# score range system1 0.494 1 ONLINE-G2 0.311 2 ONLINE-B3 0.129 3-6 PROMT-RULE

0.116 3-6 AFRL-MIT-PB0.113 3-6 AFRL-MIT-FAC0.104 3-7 ONLINE-A0.051 6-8 AFRL-MIT-H0.010 7-10 LIMSI-NCODE-0.021 8-10 UEDIN-SYNTAX-0.031 8-10 UEDIN-JHU

4 -0.218 11 USAAR-GACHA5 -0.278 12 USAAR-GACHA6 -0.781 13 ONLINE-F

German–English# score range system1 0.567 1 ONLINE-B2 0.319 2-3 UEDIN-JHU

0.298 2-4 ONLINE-A0.258 3-5 UEDIN-SYNTAX0.228 4-5 KIT

3 0.141 6-7 RWTH0.095 6-7 MONTREAL

4 -0.172 8-10 ILLINOIS-0.177 8-10 DFKI-0.221 9-10 ONLINE-C

5 -0.304 11 ONLINE-F6 -0.489 12-13 MACAU

-0.544 12-13 ONLINE-E

French–English# score range system1 0.498 1-2 ONLINE-B

0.446 1-3 LIMSI-CNRS0.415 1-3 UEDIN-JHU

2 0.275 4-5 MACAU0.223 4-5 ONLINE-A

3 -0.423 6 ONLINE-F4 -1.434 7 ONLINE-E

English–French# score range system1 0.540 1 LIMSI-CNRS2 0.304 2-3 ONLINE-A

0.258 2-4 UEDIN-JHU0.215 3-4 ONLINE-B

3 -0.001 5 CIMS4 -0.338 6 ONLINE-F5 -0.977 7 ONLINE-E

English–Russian# score range system1 1.015 1 PROMT-RULE2 0.521 2 ONLINE-G3 0.217 3 ONLINE-B4 0.122 4-5 LIMSI-NCODE

0.075 4-5 ONLINE-A5 0.014 6 UEDIN-JHU6 -0.138 7 UEDIN-SYNTAX7 -0.276 8 USAAR-GACHA8 -0.333 9 USAAR-GACHA9 -1.218 10 ONLINE-F

English–German# score range system1 0.359 1-2 UEDIN-SYNTAX

0.334 1-2 MONTREAL2 0.260 3-4 PROMT-RULE

0.235 3-4 ONLINE-A3 0.148 5 ONLINE-B4 0.086 6 KIT-LIMSI5 0.036 7-9 UEDIN-JHU

0.003 7-11 ONLINE-F-0.001 7-11 ONLINE-C-0.018 8-11 KIT-0.035 9-11 CIMS

6 -0.133 12-13 DFKI-0.137 12-13 ONLINE-E

7 -0.235 14 UDS-SANT8 -0.400 15 ILLINOIS9 -0.501 16 IMS

Finnish–English# score range system1 0.675 1 ONLINE-B2 0.280 2-4 PROMT-SMT

0.246 2-5 ONLINE-A0.236 2-5 UU0.182 4-7 UEDIN-JHU0.160 5-7 ABUMATRAN-COMB0.144 5-8 UEDIN-SYNTAX0.081 7-8 ILLINOIS

3 -0.081 9 ABUMATRAN-HFS4 -0.177 10 MONTREAL5 -0.275 11 ABUMATRAN6 -0.438 12-13 LIMSI

-0.513 13-14 SHEFFIELD-0.520 13-14 SHEFF-STEM

English–Finnish# score range system1 1.069 1 ONLINE-B2 0.548 2 ONLINE-A3 0.210 3 UU4 0.042 4 ABUMATRAN-COMB5 -0.059 5 ABUMATRAN-COMB6 -0.143 6-7 AALTO

-0.184 6-8 UEDIN-SYNTAX-0.212 6-8 ABUMATRAN

7 -0.342 9 CMU8 -0.929 10 CHALMERS

Table 6: Official results for the WMT15 translation task. Systems are ordered by their inferred system means, though systemswithin a cluster are considered tied. Lines between systems indicate clusters according to bootstrap resampling at p-levelp ≤ .05. Systems with grey background indicate use of resources that fall outside the constraints provided for the shared task.

11

of 84


English–German

12 14 16 18 20 22 24 26

BLEU-.6

-.4

-.2

.0

.2

.4

HUMAN

ONLINE-B

IMS

UEDIN-JHU

MONTREALUEDIN-SYNTAX

ONLINE-A

UDS-SANT

CIMS

KIT-LIMSI

ONLINE-E

ILLINOIS

ONLINE-C

DFKI

KIT

PROMT-RULE

German–English

16 18 20 22 24 26 28 30

BLEU

-.6

-.4

-.2

.0

.2

.4

.6

HUMAN

MACAU

DFKI

RWTH

KIT

ONLINE-B

UEDIN-JHU

MONTREAL

ONLINE-A

UEDIN-SYNTAX

ILLINOIS

ONLINE-E

ONLINE-C

English–Czech

6 8 10 12 14 16 18 20

BLEU

-.8

-.6

-.4

-.2

.0

.2

.4

.6

.8HUMAN

TT-BLEU-MERT

TT-AFRLTT-BLEU-MIRA-D

UEDIN-JHUONLINE-B

UEDIN-SYNTAX

ONLINE-A

MONTREAL

TT-BLEU-MIRA-SP

TT-METEOR-CMU

CU-TECTO

TT-DCU

CU-CHIMERA

TT-USAAR-TUNA-SAARLAND

COMMERCIAL1

Czech–English

10 12 14 16 18 20 22 24 26 28

BLEU

-.4

-.2

.0

.2

.4

.6

HUMAN

TT-BLEU-MERT

ILLINOIS

TT-ILLC-UVA

UEDIN-JHU

MONTREAL UEDIN-SYNTAX

TT-BLEU-MIRA-SP

TT-USAAR-TUNA-SAARLANDTT-AFRLTT-BLEU-MIRA-D

ONLINE-B

ONLINE-A

CU-TECTO

TT-METEOR-CMUTT-DCU

TT-HKUST-MEANT

Figure 4: Human evaluation scores versus BLEU scores for the German–English and Czech–English language pairs illustratethe need for human evaluation when comparing systems of different kind. Confidence intervals are indicated by the shadedellipses. Rule-based systems and to a lesser degree syntax-based statistical systems receive a lower BLEU score than theirhuman score would indicate. The big cluster in the Czech-English plot are tuning task submissions.

12

of 84


English–French

30 32 34

BLEU

.0

.2

.4

.6

HUMAN

LIMSI-CNRS

ONLINE-A

ONLINE-B

UEDIN-JHU

CIMS

French–English

30 32 34

BLEU

.2

.4

.6HUMAN

LIMSI-CNRS

ONLINE-A

ONLINE-B

MACAU

UEDIN-JHU

Russian–English

20 22 24 26 28 30

BLEU-.4

-.2

.0

.2

.4

.6HUMAN

PROMT-RULE AFRL-MIT-PB

LIMSI-NCODE

USAAR-GACHA

AFRL-MIT-FAC

AFRL-MIT-H

USAAR-GACHA2

ONLINE-G

UEDIN-JHU

ONLINE-B

ONLINE-A

UEDIN-SYNTAX

English–Russian

20 22 24 26

BLEU

-.4

-.2

.0

.2

.4

.6

.8

1.0

HUMAN

LIMSI-NCODE

PROMT-RULE

USAAR-GACHA

ONLINE-B

UEDIN-JHU

UEDIN-SYNTAX

ONLINE-A

ONLINE-G

USAAR-GACHA2

Finnish–English

12 14 16 18 20 22

BLEU-.6

-.4

-.2

.0

.2

.4

.6

.8HUMAN

UU-UNC

ABUMATRAN-HFS

ABUMATRAN-COMB

SHEFFIELD SHEFF-STEM

UEDIN-SYNTAX

ONLINE-A

MONTREAL

PROMT-SMT

UEDIN-JHU

ONLINE-B

LIMSI

ILLINOIS

ABUMATRAN

English–Finnish

4 6 8 10 12 14 16

BLEU

-1.0

-.8

-.6

-.4

-.2

.0

.2

.4

.6

.8

1.0

1.2HUMAN

UEDIN-SYNTAX

ONLINE-A

AALTO

ONLINE-B

ABUMATRAN-UNC-COMB

UU-UNC

CHALMERS

ABUMATRAN-UNC

ABUMATRAN-COMB

CMU

Figure 5: Human evaluation versus BLEU scores for the French–English, Russian–English, and Finnish-English language pairs.

13

of 84


tion, particularly the use of annotations ob-tained from crowdsourced post-editing.

Three tasks were proposed: Task 1 at sentencelevel (Section 4.3), Task 2 at word level (Sec-tion 4.4), and Task 3 at document level (Section4.5). Tasks 1 and 2 provide the same dataset withEnglish-Spanish translations generated by the sta-tistical machine translation (SMT) system, whileTask 3 provides two different datasets, for twolanguage pairs: English-German (EN-DE) andGerman-English (DE-EN) translations taken fromall participating systems in WMT13 (Bojar et al.,2013). These datasets were annotated with differ-ent labels for quality: for Tasks 1 and 2, the labelswere automatically derived from the post-editingof the machine translation output, while for Task3, scores were computed based on reference trans-lations using Meteor (Banerjee and Lavie, 2005).Any external resource, including additional qual-ity estimation training data, could be used by par-ticipants (no distinction between open and closetracks was made). As presented in Section 4.1,participants were also provided with a baseline setof features for each task, and a software packageto extract these and other quality estimation fea-tures and perform model learning, with suggestedmethods for all levels of prediction. Participants,described in Section 4.2, could submit up to twosystems for each task.

Data used to build MT systems or internal sys-tem information (such as model scores or n-bestlists) were not made available this year as multi-ple MT systems were used to produce the datasets,especially for Task 3, including online and rule-based systems. Therefore, as a general rule, par-ticipants could only use black-box features.

4.1 Baseline systemsSentence-level baseline system: For Task 1,QUEST7 (Specia et al., 2013) was used to ex-tract 17 MT system-independent features from thesource and translation (target) files and parallelcorpora:

• Number of tokens in the source and targetsentences.

• Average source token length.

• Average number of occurrences of the targetword within the target sentence.

7https://github.com/lspecia/quest

• Number of punctuation marks in source andtarget sentences.

• Language model (LM) probability of sourceand target sentences based on models for theWMT News Commentary corpus.

• Average number of translations per sourceword in the sentence as given by IBM Model1 extracted from the WMT News Commen-tary parallel corpus, and thresholded suchthat P (t|s) > 0.2/P (t|s) > 0.01.

• Percentage of unigrams, bigrams and tri-grams in frequency quartiles 1 (lower fre-quency words) and 4 (higher frequencywords) in the source language extracted fromthe WMT News Commentary corpus.

• Percentage of unigrams in the source sen-tence seen in the source side of the WMTNews Commentary corpus.

These features were used to train a Support Vec-tor Regression (SVR) algorithm using a RadialBasis Function (RBF) kernel within the SCIKIT-LEARN toolkit.8 The γ, ε and C parameters wereoptimised via grid search with 5-fold cross valida-tion on the training set. We note that although thesystem is referred to as “baseline”, it is in fact astrong system. It has proved robust across a rangeof language pairs, MT systems, and text domainsfor predicting various forms of post-editing effort(Callison-Burch et al., 2012; Bojar et al., 2013,2014).

Word-level baseline system: For Task 2, thebaseline features were extracted with the MAR-MOT tool9. For the baseline system we used anumber of features that have been found the mostinformative in previous research on word-levelquality estimation. Our baseline set of featuresis loosely based on the one described in (Luonget al., 2014). It contains the following 25 features:

• Word count in the source and target sen-tences, source and target token count ratio.Although these features are sentence-level(i.e. their values will be the same for allwords in a sentence), but the length of asentence might influence the probability of aword being incorrect.

8http://scikit-learn.org/9https://github.com/qe-team/marmot

14

of 84


• Target token, its left and right contexts of oneword.

• Source token aligned to the target token,its left and right contexts of one word.The alignments were produced with theforce align.py script, which is part ofcdec (Dyer et al., 2010). It allows toalign new parallel data with a pre-trainedalignment model built with the cdec wordaligner (fast align). The alignment modelwas trained on the Europarl corpus (Koehn,2005).

• Boolean dictionary features: whether targettoken is a stopword, a punctuation mark, aproper noun, a number.

• Target language model features:

– The order of the highest order n-gramwhich starts or ends with the target to-ken.

– Backoff behaviour of the n-grams(ti−2, ti−1, ti), (ti−1, ti, ti+1),(ti, ti+1, ti+2), where ti is the tar-get token (the backoff behaviour iscomputed as described in (Raybaudet al., 2011)).

• The order of the highest order n-gram whichstarts or ends with the source token.

• Boolean pseudo-reference feature: 1 if thetoken is contained in a pseudo-reference, 0otherwise. The pseudo-reference used forthis feature is the automatic translation gen-erated by an English-Spanish phrase-basedSMT system trained on the Europarl corpuswith standard settings.10

• The part-of-speech tags of the target andsource tokens.

• The number of senses of the target and sourcetokens in WordNet.

We model the task as a sequence predictionproblem and train our baseline system using theLinear-Chain Conditional Random Fields (CRF)algorithm with the CRF++ tool.11

10http://www.statmt.org/moses/?n=Moses.Baseline

11http://taku910.github.io/crfpp/

Document-level baseline system: For Task 3,the baseline features for sentence-level predictionwere used. These are aggregated by summingor averaging their values for the entire document.Features that were summed: number of tokensin the source and target sentences and number ofpunctuation marks in source and target sentences.All other features were averaged. The imple-mentation for document-level feature extraction isavailable in QUEST++ (Specia et al., 2015).12

These features were then used to train a SVR al-gorithm with RBF kernel using the SCIKIT-LEARN

toolkit. The γ, ε and C parameters were optimisedvia grid search with 5-fold cross validation on thetraining set.

4.2 Participants

Table 7 lists all participating teams submitting sys-tems to any of the tasks. Each team was allowedup to two submissions for each task and languagepair. In the descriptions below, participation inspecific tasks is denoted by a task identifier.

DCU-SHEFF (Task 2): The system uses thebaseline set of features provided for the task.Two pre-processing data manipulation tech-niques were used: data selection and databootstrapping. Data selection filters out sen-tences which have the smallest proportion oferroneous tokens and are assumed to be theleast useful for the task. Data bootstrappingenhances the training data with incompletetraining sentences (e.g. the first k wordsof a sentence of the length N , where k <N ). This technique creates additional datainstances and boosts the importance of er-rors occurring in the training data. The com-bination of these techniques doubled the F1

score for the “BAD” class, as compared to amodels trained on the entire dataset given forthe task. The labelling was performed with aCRF model trained using the CRF++ tool, asin the baseline system.

HDCL (Task 2): HDCL’s submissions are basedon a deep neural network that learns continu-ous feature representations from scratch, i.e.from bilingual contexts. The network waspre-trained by initialising the word lookup-table with distributed word representations,

12https://github.com/ghpaetzold/questplusplus

15

of 84


ID Participating teamDCU-SHEFF Dublin City University, Ireland and University of Sheffield, UK (Logacheva

et al., 2015)HDCL Heidelberg University, Germany (Kreutzer et al., 2015)

LORIA Lorraine Laboratory of Research in Computer Science and its Applications,France (Langlois, 2015)

RTM-DCU Dublin City University, Ireland (Bicici et al., 2015)SAU-KERC Shenyang Aerospace University, China (Shang et al., 2015)SHEFF-NN University of Sheffield Team 1, UK (Shah et al., 2015)

UAlacant Alicant University, Spain (Espla-Gomis et al., 2015a)UGENT Ghent University, Belgium (Tezcan et al., 2015)

USAAR-USHEF University of Sheffield, UK and Saarland University, Germany (Scarton et al.,2015a)

USHEF University of Sheffield, UK (Scarton et al., 2015a)HIDDEN Undisclosed

Table 7: Participants in the WMT15 quality estimation shared task.

and fine-tuned for the quality estimation clas-sification task by back-propagating word-level prediction errors using stochastic gra-dient descent. In addition to the continuousspace deep model, a shallow linear classifierwas trained on the provided baseline featuresand their quadratic expansion. One of thesubmitted systems (QUETCH) relies on thedeep model only, the other (QUETCHPLUS)is a linear combination of the QUETCH sys-tem score, the linear classifier score, and bi-nary and binned baseline features. The sys-tem combination yielded significant improve-ments, showing that the deep and shallowmodels each contributes complementary in-formation to the combination.

LORIA (Task 1): The LORIA system for Task1 is based on a standard machine learningapproach where source-target sentences aredescribed by numerical vectors and SVR isused to learn a regression model betweenthese vectors and quality scores. Feature vec-tors used the 17 baseline features, two La-tent Semantic Indexing (LSI) features and 31features based on pseudo-references. TheLSI approach considers source-target pairs asdocuments, and projects the TF-IDF words-documents matrix into a reduced numericalspace. This leads to a measure of simi-larity between a source and a target sen-tence, which was used as a feature. Twoof these features were used based on twomatrices, one from the Europarl corpus and

one from the official training data. Pseudo-references were produced by three onlinesystems. These features measure the inter-section between n-gram sets of the target sen-tence and of the pseudo-references. Threesets of features were extracted from each on-line system, and a fourth feature was ex-tracted measuring the inter-agreement amongthe three online systems and the target sys-tem.

RTM-DCU (Tasks 1, 2, 3): RTM-DCU systemsare based on referential translation machines(RTM) (Bicici, 2013; Bicici and Way, 2014).RTMs propose a language independent ap-proach and avoid the need to access any task-or domain-specific information or resource.The submissions used features that indicatethe closeness between instances to the avail-able training data, the difficulty of translat-ing them, and the presence of acts of transla-tion for data transformation. SVR was usedfor document and sentence-level predictiontasks, also in combination with feature selec-tion or partial least squares, and global linearmodels with dynamic learning were used forthe word-level prediction task.

SAU (Task 2): The SAU submissions used a CRFmodel to predict the binary labels for Task2. They rely on 12 basic features and 85combination features. The ratio between OKand BAD labels was found to be 4:1 in thetraining set. Two strategies were proposed to

16

of 84


solve this problem of label ratio imbalance.The first strategy is to replace “OK” labelswith sub-labels to balance label distribution,where the sub-labels are OK B, OK I, OK E,OK (depending on the position of the tokenin the sentence). The second strategy is toreconstruct the training set to include more“BAD” words.

SHEFF-NN (Tasks 1, 2): SHEFF-NN sub-missions were based on (i) a ContinuousSpace Language Model (CSLM) to extractadditional features for Task 1 (SHEF-GPand SHEF-SVM), (ii) a Continuous Bag-of-Words (CBOW) model to produce wordembeddings as features for Task 2 (SHEF-W2V), and (iii) a combination of featuresproduced by QUEST++ and a feature pro-duced with word embedding models (SHEF-QuEst++). SVR and Gaussian Processeswere used to learn prediction models for Task1, and a CRF algorithm for binary taggingmodels in Task 2 (Pystruct Linear-chain CRFtrained with a structured SVM for systemSHEF-W2V, and CRFSuite Adaptive Reg-ularisation of Weight Vector (AROW) andPassive Aggressive (PA) algorithms for sys-tem SHEF-QuEst++). Interesting findingsfor Task 1 were that (i) CSLM features al-ways bring improvements whenever added toeither baseline or complete feature sets and(ii) CSLM features alone perform better thanthe baseline features. For Task 2, the resultsobtained by SHEF-W2V are promising: al-though it uses only features learned in unsu-pervised fashion (CBOW word embeddings),it was able to outperform the baseline as wellas many other systems. Further, combiningthe source-to-target cosine similarity featurewith the ones produced by QUEST++ led toimprovements in the F1 of “BAD” labels.

UAlacant (Task 2): The submissions of the Uni-versitat d’Alacant team were obtained by ap-plying the approach in (Espla-Gomis et al.,2015b), which uses any source of bilingualinformation available as a black-box in or-der to spot sub-segment correspondences be-tween a sentence S in the source languageand a given translation hypothesis T in thetarget language. These sub-segment corre-spondences are used to extract a collection of

features that is then used by a multilayer per-ceptron to determine the word-level predictedscore. Three sources of bilingual informa-tion available online were used: two onlinemachine translation systems, Apertium13 andGoogle Translate; and the bilingual concor-dancer Reverso Context.14 Two submissionswere made for Task 2: one using only the70 features described in (Espla-Gomis et al.,2015b), and one combining them with thebaseline features provided by the task organ-isers.

UGENT (Tasks 1, 2): The submissions forthe word-level task used 55 new featuresin combination with the baseline feature setto train binary classifiers. The new fea-tures try to capture either accuracy (mean-ing transfer from source to target sentence)using word and phrase alignments, or flu-ency (well-formedness of target sentence) us-ing language models trained on word sur-face forms and on part-of-speech tags. Basedon the combined feature set, SCATE-MBLuses a memory-based learning (MBL) al-gorithm for binary classification. SCATE-HYBRID uses the same feature set and formsa classifier ensemble using CRFs in combi-nation with the MBL system for predictingword-level quality. For the sentence-leveltask, SCATE-SVM-single uses a single fea-ture to train SVR models, which is basedon the percentage of words that are labelledas “BAD” by the word-level quality estima-tion system SCATE-HYBRID. SCATE-SVMadds 16 new features to this single feature andthe baseline feature set to train SVR modelsusing an RBF kernel. Additional language re-sources are used to extract the new featuresfor both tasks.

USAAR-USHEF (Task 3): The systems sub-mitted for both EN-DE and DE-EN (calledBFF) were built by using a exhaustive searchfor feature selection over the official baselinefeatures. In order to select the best features,a Bayesian Ridge classifier was trained foreach feature combination and the classifierswere evaluated in terms of Mean Average Er-ror (MAE): the classifier with the smallest

13http://www.apertium.org14http://context.reverso.net/translation/

17

of 84


MAE was considered the best. For EN-DE,the selected features were: average source to-ken length, percentage of unigrams and of tri-grams in fourth quartile of frequency in a cor-pus of the source language. For DE-EN, thebest features were: number of occurrencesof the target word within the target hypoth-esis, percentage of unigrams and of trigramsin first quartile of frequency in a corpus ofthe source language. This provide an indica-tion of which features of the baseline set con-tribute for document-level quality estimation.

USHEF (Task 3): The system submitted forthe EN-DE document-level task was built byusing the 17 official baseline features, plusdiscourse features (repetition of words, lem-mas and nouns and ratio of repetitions – asimplemented in QUEST++. For DE-EN, acombination of the 17 baseline features, thediscourse repetition features and discourse-aware features extracted from syntactic anddiscourse parsers was used. The new dis-course features are: number of pronouns,number of connectives, number of satelliteand nucleus relations in the RST (Rhetori-cal Structure Theory) tree for the documentand number of EDU (Elementary DiscourseUnits) breaks in the text. A backward fea-ture selection approach, based on the fea-ture rank of SCIKIT-LEARN’s Random For-est implementation, was also applied. Forboth languages pairs, the same algorithm asthat of the baseline system was used: theSCIKIT-LEARN implementation of SVR withRBF kernel and hyper-parameters optimisedvia grid-search.

HIDDEN (Task 3): This submission, whose cre-ators preferred to remain anonymous, esti-mates the quality of a given document byexplicitly identifying potential translation er-rors in it. Translation error detection is im-plemented as a combination of human expertknowledge and different language process-ing tools, including named entity recognition,part-of-speech tagging and word alignments.In particular, the system looks for patternsof errors defined by human experts, takinginto account the actual words and the addi-tional linguistic information. With this ap-proach, a wide variety of errors can be de-

tected: from simple misspellings and typos tocomplex lack of agreement (in genre, numberand tense), or lexical inconsistencies. Eacherror category is assigned an “importance”,again according to human knowledge, andthe amount of error in the document is com-puted as the weighted sum of the identifiederrors. Finally, the documents are sorted ac-cording to this figure to generate the finalsubmission to the ranking variant of Task 3.

4.3 Task 1: Predicting sentence-level qualityThis task consists in scoring (and ranking) transla-tion sentences according to the percentage of theirwords that need to be fixed. It is similar to Task 1.2in WMT14. HTER (Snover et al., 2006b) is usedas quality score, i.e. the minimum edit distancebetween the machine translation and its manuallypost-edited version in [0,1].

As in previous years, two variants of the resultscould be submitted:

• Scoring: An absolute HTER score for eachsentence translation, to be interpreted as anerror metric: lower scores mean better trans-lations.

• Ranking: A ranking of sentence translationsfor all source sentences from best to worst.For this variant, it does not matter how theranking is produced (from HTER predictionsor by other means). The reference ranking isdefined based on the true HTER scores.

Data The data is the same as that used for theWMT15 Automatic Post-editing task,15 as kindlyprovided by Unbabel.16 Source segments are to-kenized English sentences from the news domainwith at least four tokens. Target segments are to-kenized Spanish translations produced by an on-line SMT system. The human post-editions are amanual revision of the target, collected using Un-babel’s crowd post-editing platform. HTER labelswere computed using the TERCOM tool17 withdefault settings (tokenised, case insensitive, exactmatching only), but with scores capped to 1.

As training and development data, we pro-vided English-Spanish datasets with 11, 271 and1, 000 source sentences, their machine transla-tions, post-editions and HTER scores, respec-tively. As test data, we provided an additional

15http://www.statmt.org/wmt15/ape-task.html16https://unbabel.com/17http://www.cs.umd.edu/˜snover/tercom/

18

of 84


set of 1, 817 English-Spanish source-translationspairs produced by the same MT system used forthe training data.

Evaluation Evaluation was performed againstthe true HTER label and/or ranking, using thesame metrics as in previous years:

• Scoring: Mean Average Error (MAE) (pri-mary metric, official score for rankingsubmissions), Root Mean Squared Error(RMSE).

• Ranking: DeltaAvg (primary metric) andSpearman’s ρ rank correlation.

Additionally, we included Pearson’s r correla-tion against the true HTER label, as suggested byGraham (2015).

Statistical significance on MAE and DeltaAvgwas computed using a pairwise bootstrap resam-pling (1K times) approach with 95% confidenceintervals. 18 For Pearson’s r correlation, we mea-sured significance using the Williams test, as alsosuggested in (Graham, 2015).

Results Table 8 summarises the results for theranking variant of Task 1. They are sorted frombest to worst using the DeltaAvg metric scores asprimary key and the Spearman’s ρ rank correlationscores as secondary key.

The results for the scoring variant are presentedin Table 9, sorted from best to worst by using theMAE metric scores as primary key and the RMSEmetric scores as secondary key.

Pearson’s r coefficients for all systems againstHTER is given in Table 10. As discussed in(Graham, 2015), the results according to this met-ric can rank participating systems differently. Inparticular, we note the SHEF/GP submission, arewhich is deemed significantly worse than the base-line system according to MAE, but substantiallybetter than the baseline according to Pearson’scorrelation. Graham (2015) argues that the useof MAE as evaluation score for quality estima-tion tasks is inadequate, as MAE is very sensitiveto variance. This means that a system that out-puts predictions with high variance is more likelyto have high MAE score, even if the distributionfollows that of the true labels. Interestingly, ac-cording to Pearson’s correlation, the systems are

18http://www.quest.dcs.shef.ac.uk/wmt15_files/bootstrap-significance.pl

ranked exactly in the same way as according toour DeltaAvg metric. The only difference is thatthe 4th place is now considered significantly dif-ferent from the three winning submissions. Shealso argues that the significance tests used withMAE, based on randomised resampling, assumethat the data is independent, which is not the case.Therefore, we apply the suggested Williams sig-nificance test for this metric.

4.4 Task 2: Predicting word-level qualityThe goal of this task is to evaluate the extent towhich we can detect word-level errors in MT out-put. Often, the overall quality of a translated seg-ment is significantly harmed by specific errors ina small proportion of words. Various classes oferrors can be found in translations, but for thistask we consider all error types together, aimingat making a binary distinction between ’GOOD’and ’BAD’ tokens. The decision to bucket all er-ror types together was made because of the lack ofsufficient training data that could allow considera-tion of more fine-grained error tags.

Data This year’s word-level task uses the samedataset as Task 1, for a single language pair:English-Spanish. Each instance of the training,development and test sets consists of the follow-ing elements:

• Source sentence (English).

• Automatic translation (Spanish).

• Manual post-edition of the automatic transla-tion.

• Word-level binary (“OK”/“BAD”) labellingof the automatic translation.

The binary labels for the datasets were acquiredautomatically with the TERCOM tool (Snoveret al., 2006b).19 This tool computes the edit dis-tance between machine-translated sentence and itsreference (in this case, its post-edited version).It identifies four types of errors: substitution ofa word with another word, deletion of a word(word was omitted by the translation system), in-sertion of a word (a redundant word was added bythe translation system), and word or sequence ofwords shift (word order error). Every word in themachine-translated sentence is tagged with one ofthese error types or not tagged if it matches a wordfrom the reference.

19http://www.cs.umd.edu/˜snover/tercom/

19

of 84


System ID DeltaAvg ↑ Spearman’s ρ ↑English-Spanish• LORIA/17+LSI+MT+FILTRE 6.51 0.36

• LORIA/17+LSI+MT 6.34 0.37• RTM-DCU/RTM-FS+PLS-SVR 6.34 0.37

• RTM-DCU/RTM-FS-SVR 6.09 0.35UGENT-LT3/SCATE-SVM 6.02 0.34

UGENT-LT3/SCATE-SVM-single 5.12 0.30SHEF/SVM 5.05 0.28

SHEF/GP 3.07 0.28Baseline SVM 2.16 0.13

Table 8: Official results for the ranking variant of the WMT15 quality estimation Task 1. The winning submissions areindicated by a •. These are the top-scoring submission and those that are not significantly worse according to pairwise bootstrapresampling (1K times) with 95% confidence intervals. The systems in the gray area are not different from the baseline systemat a statistically significant level according to the same test.

System ID MAE ↓ RMSE ↓English-Spanish• RTM-DCU/RTM-FS+PLS-SVR 13.25 17.48• LORIA/17+LSI+MT+FILTRE 13.34 17.35• RTM-DCU/RTM-FS-SVR 13.35 17.68

• LORIA/17+LSI+MT 13.42 17.45• UGENT-LT3/SCATE-SVM 13.71 17.45

UGENT-LT3/SCATE-SVM-single 13.76 17.79SHEF/SVM 13.83 18.01

Baseline SVM 14.82 19.13SHEF/GP 15.16 18.97

Table 9: Official results for the scoring variant of the WMT15 quality estimation Task 1. The winning submissions are indicatedby a •. These are the top-scoring submission and those that are not significantly worse according to bootstrap resampling (1Ktimes) with 95% confidence intervals. The systems in the gray area are not different from the baseline system at a statisticallysignificant level according to the same test.

System ID Pearson’s r ↑• LORIA/17+LSI+MT+FILTRE 0.39

• LORIA/17+LSI+MT 0.39• RTM-DCU/RTM-FS+PLS-SVR 0.38

RTM-DCU/RTM-FS-SVR 0.38UGENT-LT3/SCATE-SVM 0.37

UGENT-LT3/SCATE-SVM-single 0.32SHEF/SVM 0.29

SHEF/GP 0.19Baseline SVM 0.14

Table 10: Alternative results for the scoring variant of the WMT15 quality estimation Task 1. The winning submissions areindicated by a •. These are the top-scoring submission and those that are not significantly worse according to Williams test with95% confidence intervals. The systems in the gray area are not different from the baseline system at a statistically significantlevel according to the same test.

All the untagged (correct) words were taggedwith “OK”, while the words tagged with substi-tution and insertion errors were assigned the tag“BAD”. The deletion errors are not associatedwith any word in the automatic translation, so we

could not consider them. We also disabled theshift errors by running TERCOM with the option‘-d 0’. The reason for that is the fact that search-ing for shifts introduces significant noise in theannotation. The system cannot discriminate be-

20

of 84


tween cases where a word was really shifted andwhere a word (especially common words such asprepositions, articles and pronouns) was deleted inone part of the sentence and then independentlyinserted in another part of this sentence, i.e. tocorrect an unrelated error. The statistics of thedatasets are outlined in Table 11.

Sentences Words% of “BAD”words

Training 11,271 257,548 19.14Dev 1,000 23,207 19.18Test 1,817 40,899 18.87

Table 11: Datasets for Task 2.

Evaluation Submissions were evaluated interms of classification performance against theoriginal labels. The main evaluation metric is theaverage F1 for the “BAD” class. Statistical signif-icance on F1 for the “BAD” class was computedusing approximate randomization tests.20

Results The results for Task 2 are summarisedin Table 12. The results are ordered by F1 scorefor the error (BAD) class.

Using the F1 score for the word-level estimationtask has a number of drawbacks. First of all, wecannot use it as the single metric to evaluate thesystem’s quality. The F1 score of the class “BAD”becomes an inadequate metric when one is alsointerested in the tagging of correct words. In fact,a naive baseline which tags all words with the class“BAD” would yield 31.75 F1 score for the “BAD”class in the test set of this task, which is close tosome of the submissions and by far exceeds thebaseline, although this tagging is uninformative.

We could instead use the weighted F1 score,which would lead to a single F1 figure where ev-ery class is given a weight according to its fre-quency in the test set. However, we believe theweighted F1 score does not reflect the real qual-ity of the systems either. Since there are manymore instances of the “GOOD” class than thereare of the “BAD” class, the performance on the“BAD” class does not contribute much weight tothe overall score, and changes in accuracy of errorprediction on this less frequent class can go un-noticed. The weighted F1 score for the strategywhich tags all words as “GOOD” would be 72.66,

20http://www.nlpado.de/˜sebastian/software/sigf.shtml

which is higher than the score of many submis-sions. However, similar to the case of tagging allwords as “BAD”, this strategy is uninformative. Inan attempt to find more intuitive ways of evaluat-ing word-level tasks, we introduce a new metriccalled sequence correlation. It gives higher im-portance to the instances of the “BAD” class andis robust against uninformative tagging.

The basis of the sequence correlation metric isthe number of matching labels in the reference andthe hypothesis, analogously to a precision metric.However, it has some additional features that areaimed at making it more reliable. We considerthe tagging of each sentence separately as a se-quence of tags. We divide each sequence intosub-sequences tagged by the same tag, for exam-ple, the sequence “OK BAD OK OK OK” will berepresented as a list of 3 sub-sequences: [ “OK”,“BAD”, “OK OK OK” ]. Each subsequence hasalso the information on its position in the origi-nal sentence. The sub-sequences of the referenceand the hypothesis are then intersected, and thenumber of matching tags in the corresponding sub-sequences is computed so that every sub-sequencecan be used only once. Let us consider the follow-ing example:

Reference: OK BAD OK OK OKHypothesis: OK OK OK OK OK

Here, the reference has three sub-sequences, asin the previous example, and the hypothesis con-sists of only one sub-sequence which coincideswith the hypothesis itself, because all the wordswere tagged with the “OK” label. The precisionscore for this sentence will be 0.8, as 4 of 5 labelsmatch in this example. However, we notice thatthe hypothesis sub-sequence covers two match-ing sub-sequences of the reference: word 1 andwords 3–5. According to our metric, the hypoth-esis sub-sequence can be used for the intersectiononly once, giving either 1 of 5 or 3 of 5 match-ing words. We choose the highest value and getthe score of 0.6. Thus, the intersection proceduredownweighs the uninformative hypotheses whereall words are tagged with one tag.

In order to compute the sequence correlation weneed to get the set of spans for each label in boththe prediction and the reference, and then intersectthem. A set of spans of each tag t in the string wis computed as follows:

21

of 84


weighted F1 F1 F1

System ID All Bad ↑ GOODEnglish-Spanish• UAlacant/OnLine-SBI-Baseline 71.47 43.12 78.07

• HDCL/QUETCHPLUS 72.56 43.05 79.42UAlacant/OnLine-SBI 69.54 41.51 76.06

SAU/KERC-CRF 77.44 39.11 86.36SAU/KERC-SLG-CRF 77.4 38.91 86.35SHEF2/W2V-BI-2000 65.37 38.43 71.63

SHEF2/W2V-BI-2000-SIM 65.27 38.40 71.52SHEF1/QuEst++-AROW 62.07 38.36 67.58

UGENT/SCATE-HYBRID 74.28 36.72 83.02DCU-SHEFF/BASE-NGRAM-2000 67.33 36.60 74.49

HDCL/QUETCH 75.26 35.27 84.56DCU-SHEFF/BASE-NGRAM-5000 75.09 34.53 84.53

SHEF1/QuEst++-PA 26.25 34.30 24.38UGENT/SCATE-MBL 74.17 30.56 84.32

RTM-DCU/s5-RTM-GLMd 76.00 23.91 88.12RTM-DCU/s4-RTM-GLMd 75.88 22.69 88.26

Baseline 75.31 16.78 88.93

Table 12: Official results for the WMT15 quality estimation Task 2. The winning submissions are indicated by a •. These arethe top-scoring submission and those that are not significantly worse according to approximate randomization tests with 95%confidence intervals. Submissions whose results are statistically different from others according to the same test are groupedby a horizontal line.

St(w) = {w[b:e]}, ∀i s.t. b 6 i 6 e : wi = t

where w[b:e] is a substring wb, wb+1, ..., we−1, we.Then the intersection of spans for all labels is:

Int(y, y) =∑

t∈{0;1}λt

∑sy∈St(y)

∑sy∈St(y)

|sy ∩ sy|

Here λt is the weight of a tag t in the overallresult. It is inversely proportional the number ofinstances of this tag in the reference:

λt =|y|ct(y)

where ct(y) is the number of words labelled withthe label t in the prediction. Thus we give theequal importance to all tags.

The sum of matching spans is also weighted bythe ratio of the number of spans in the hypothe-sis and the reference. This is done to downweighthe system tagging if the number of its spans dif-fers from the number of spans provided in the goldstandard. This ratio is computed as follows:

r(y, y) = min(|y||y| ;|y||y|)

This ratio is 1 if the number of spans is equalfor the hypothesis and the reference, and less than1 otherwise.

The final score for a sentence is produced as fol-lows:

SeqCor(y, y) =r(y, y) · Int(y, y)

|y| (1)

Then the overall sequence correlation for thewhole dataset is the average of sentence scores.

Table 13 shows the results of the evaluation ac-cording to the sequence correlation metric. The re-sults for the two metrics are quite different: one ofthe highest scoring submissions according to theF1-BAD score is only the third under the sequencecorrelation metric, and vice versa: the submissionswith the highest sequence correlation feature in3rd place according to F1-BAD score. However,the system rankings produced by two metrics arecorrelated — their Spearman’s correlation coeffi-cient between them is 0.65.

22

of 84


SequenceSystem ID Correlation

English-Spanish• SAU/KERC-CRF 34.22

• SAU/KERC-SLG-CRF 34.09• UAlacant/OnLine-SBI-Baseline 33.84

UAlacant/OnLine-SBI 32.81HDCL/QUETCH 32.13

HDCL/QUETCHPLUS 31.38DCU-SHEFF/BASE-NGRAM-5000 31.23

UGENT/SCATE-HYBRID 30.15DCU-SHEFF/BASE-NGRAM-2000 29.94

UGENT/SCATE-MBL 28.43SHEF2/W2V-BI-2000 27.65

SHEF2/W2V-BI-2000-SIM 27.61SHEF1/QuEst++-AROW 27.36

RTM-DCU/s5-RTM-GLMd 25.92SHEF1/QuEst++-PA 25.49

RTM-DCU/s4-RTM-GLMd 24.95Baseline 0.2044

Table 13: Alternative results for the WMT15 quality estimation Task 2 according to the sequence correlation metric. The win-ning submissions are indicated by a •. These are the top-scoring submission and those that are not significantly worse accordingto approximate randomization tests with 95% confidence intervals. Submissions whose results are statistically different fromothers according to the same test are grouped by a horizontal line.

The sequence correlation metric gives prefer-ence to systems that use sequence labelling (mod-elling dependencies between the assigned tags).We consider this a desirable feature, as we are gen-erally not interested in maximising the predictionaccuracy for individual words, but in maximisingthe accuracy for word-level labelling in the contextof the whole sentence. However, using the TERalignment to tag errors cannot capture “phrase-level errors”, and each token is considered inde-pendently when the dataset is built. This is a fun-damental issue with the current definition of theword-level quality estimation that we intend to ad-dress in future work.

Our intuition is that the sequence correlationmetric should be closer to human perception ofword-level QE than F1 scores. The goal of word-level QE is to identify incorrect segments of a sen-tence — and the sequence correlation metric eval-uates how good the segmentation of the sentenceis into correct and incorrect phrases. A system canget very high F1 score by (almost) randomly as-signing a correct tag to a word, and giving verylittle information on correct and incorrect areas inthe text. That was illustrated by the WMT14 word-level QE task results, where the baseline strategy

that assigned tag “BAD” to all words had signif-icantly higher F1 score than any of the submis-sions. fundamental problem with the current task.I added a sentence about it at the end of the para-graph before this one.

4.5 Task 3: Predicting document-level quality

Predicting the quality of units larger than sen-tences can be useful in many scenarios. For ex-ample, consider a user searching for informationabout a product on the web. The user can only findreviews in German but he/she does not speak thelanguage, so he/she uses an MT system to translatethe reviews into English. In this case, predictionson the quality of individual sentences in a trans-lated review are not as informative as predictionson the quality of the entire review.

With the goal of exploring quality estimationbeyond sentence level, this year we proposed adocument-level task for the first time. Due tothe lack of large datasets with machine translateddocuments (by various MT systems), we considershort paragraphs as documents. The task consistedin scoring and ranking paragraphs according totheir predicted quality.

23

of 84


Data The paragraphs were extracted from theWMT13 translation task test data (Bojar et al.,2013), using submissions from all participatingMT systems. Source paragraphs were randomlychosen using the paragraph markup in the SGMLfiles. For each source paragraph, a translation wastaken from a different MT system such as to selectapproximately the same number of instances fromeach MT system. We considered EN-DE and DE-EN as language pairs, extracting 1, 215 paragraphsfor each language pair. 800 paragraphs were usedfor training and 415 for test.

Since no human annotation exists for the qual-ity of entire paragraphs (or documents), Meteoragainst reference translations was used as qualitylabel for this task. Meteor was calculated usingits implementation within the Asyia toolkit, withthe following settings: exact match, tokenised andcase insensitive (Gimenez and Marquez, 2010).

Evaluation The evaluation of the paragraph-level task was the same as that for the sentence-level task. MAE and RMSE are reported as eval-uation metrics for the scoring task, with MAE asofficial metric for systems ranking. For the rank-ing task, DeltaAvg and Spearman’s ρ correlationare reported, with DeltaAvg as official metric forsystems ranking. To evaluate the significance ofthe results, bootstrap resampling (1K times) with95% confidence intervals was used. Pearson’s rcorrelation scores with the Williams significancetest are also reported.

Results Table 14 summarises the results of theranking variant of Task 3.21 They are sorted frombest to worst using the DeltaAvg metric scores asprimary key and the Spearman’s ρ rank correla-tion scores as secondary key. RTM-DCU sub-missions achieved the best scores: RTM-SVRwas the winner for EN-DE, and RTM-FS-SVRfor DE-EN. For EN-DE, the HIDDEN systemdid not show significant difference against thebaseline. For DE-EN, USHEF/QUEST-DISC-BO,USAAR-USHEF/BFF and HIDDEN were not sig-nificantly different from the baseline.

The results of the scoring variant are given inTable 15, sorted from best to worst by using theMAE metric scores as primary key and the RMSEmetric scores as secondary key. Again the RTM-DCU submissions scored the best for both lan-

21Results for MAE, RMSE and DeltaAvg are multiplied by100 to improve readability.

guage pairs. All systems were significantly bet-ter than the baseline. However, the difference be-tween the baseline system and all submissions wasmuch lower in the scoring evaluation than in theranking evaluation.

Following the suggestion in (Graham, 2015),Table 16 shows an alternative ranking of sys-tems considering Pearson’s r correlation results.The alternative ranking differs from the officialranking in terms of MAE: for EN-DE, RTM-DCU/RTM-FS-SVR is no longer in the winninggroup, while for DE-EN, USHEF/QUEST-DISC-BO and USAAR-USHEF/BFF did not show statis-tically significant difference against the baseline.However, as with Task 1 these results are the sameas the official ones in terms of DeltaAvg.

4.6 Discussion

In what follows, we discuss the main findings ofthis year’s shared task based on the goals we hadpreviously identified for it.

Advances in sentence- and word-level QE

For sentence-level prediction, we used similardata and quality labels as in previous editions ofthe task: English-Spanish, news text domain andHTER labels to indicate post-editing effort. Themain differences this year were: (i) the muchlarger size of the dataset, (ii) the way post-editingwas performed – by a large number of crowd-sourced translators, and (iii) the MT systems used– an online statistical system. We will discussitems (i) and (ii) later in this section. Regarding(iii), the main implication of using an online sys-tem was that one could not have access to many ofthe resources commonly used to extract features,such as the SMT training data and lexical tables.As a consequence, surrogate resources were usedfor certain features, including many of the baselineones, which made them less effective. To avoidrelying on such resources, novel features were ex-plored, for example those based on deep neuralnetwork architectures (word embeddings and con-tinuous space language models by SHEFF-NN)and those based on pseudo-references (n-gramoverlap and agreement features by LORIA).

While it is not possible to compare results di-rectly with those published in previous years, forsentence level we can observe the following withrespect to the corresponding task in WMT14 (Task1.2):

24

of 84


System ID DeltaAvg ↑ Spearman’s ρ ↑English-German

• RTM-DCU/RTM-SVR 7.62 −0.62RTM-DCU/RTM-FS-SVR 6.45 −0.67

USHEF/QUEST-DISC-REP 4.55 0.32USAAR-USHEF/BFF 3.98 0.27

Baseline SVM 1.60 0.14HIDDEN 1.04 0.05

German-English• RTM-DCU/RTM-FS-SVR 4.93 −0.64

RTM-DCU/RTM-FS+PLS-SVR 4.23 −0.55USHEF/QUEST-DISC-BO 1.55 0.19

Baseline SVM 0.59 0.05USAAR-USHEF/BFF 0.40 0.12

HIDDEN 0.12 −0.03

Table 14: Official results for the ranking variant of the WMT15 quality estimation Task 3. The winning submissions areindicated by a •. These are the top-scoring submission and those that are not significantly worse according to bootstrapresampling (1K times) with 95% confidence intervals. The systems in the gray area are not different from the baseline systemat a statistically significant level according to the same test.

System ID MAE ↓ RMSE ↓English-German• RTM-DCU/RTM-FS-SVR 7.28 11.96• RTM-DCU/RTM-SVR 7.5 11.35

USAAR-USHEF/BFF 9.37 13.53USHEF/QUEST-DISC-REP 9.55 13.46

Baseline SVM 10.05 14.25German-English• RTM-DCU/RTM-FS-SVR 4.94 8.74

RTM-DCU/RTM-FS+PLS-SVR 5.78 10.70USHEF/QUEST-DISC-BO 6.54 10.10

USAAR-USHEF/BFF 6.56 10.12Baseline SVM 7.35 11.40

Table 15: Official results for the scoring variant of the WMT15 quality estimation Task 3. The winning submissions areindicated by a •. These are the top-scoring submission and those that are not significantly worse according to bootstrapresampling (1K times) with 95% confidence intervals. The systems in the gray area are not different from the baseline systemat a statistically significant level according to the same test.

• In terms of scoring, according to the primarymetric – MAE, in WMT15 all systems exceptone were significantly better than the base-line. In both WMT14 and WMT15 only onesystem was significantly worse than the base-line. However, in WMT14 four others (out ofnine) performed no different than the base-line. This year, no system tied with the base-line: the remaining seven systems were sig-nificantly better than the baseline. One couldsay systems are consistently better this year.It is worth mentioning that the baseline re-mains the same, but as previously noted, theresources used to extract baseline features are

likely to be less useful this year given the mis-match between the data used to produce themand the data used to build the online SMTsystem.

• In terms of ranking, in WMT14 one systemwas significantly worse than the baseline, andthe four remaining systems were significantlybetter. This year, all eight submissions aresignificantly better than the baseline. Thiscan once more be seen as progress from lastyear’s results. These results as well as thegeneral ranking of systems were also foundfollowing Pearson’s correlation as metric, as

25

of 84


System ID Pearson’s r ↑English-German

• RTM-DCU/RTM-SVR 0.59RTM-DCU/RTM-FS-SVR 0.53

USHEF/QUEST-DISC-REP 0.30USAAR-USHEF/BFF 0.29

Baseline SVM 0.12German-English• RTM-DCU/RTM-FS-SVR 0.52

RTM-DCU/RTM-FS+PLS-SVR 0.39USHEF/QUEST-DISC-BO 0.10

USAAR-USHEF/BFF 0.08Baseline SVM 0.06

Table 16: Alternative results for the scoring variant of the WMT15 quality estimation Task 3. The winning submissions areindicated by a •. These are the top-scoring submission and those that are not significantly worse according to the Williamstest with 95% confidence intervals. The systems in the gray area are not different from the baseline system at a statisticallysignificant level according to the same test.

suggested by Graham (2015).

For the word level task, a comparison with theWMT14 corresponding task is difficult to perform,as in WMT14 we did not have a meaningful base-line. The baseline used then for binary classifica-tion was to tag all words with the label “BAD”.This baseline outperformed all the submissions interms of F1 for the “BAD” class, but it cannot beconsidered an appropriate baseline strategy (seeSection 4.4). This year the submissions were com-pared against the output of a real baseline systemand the set of baseline features was made avail-able to participants. Although the baseline systemitself performed worse than all the submitted sys-tems, some other systems benefited from addingbaseline features to their feature sets (UAlacant,UGENT, HDCL).

Considering the feature sets and methods used,the number of participants in the WMT14 word-level task was too small to draw reliable conclu-sion: four systems for English–Spanish and onesystem for all other three language pairs. Thelarger number of submissions this year is already apositive result: 16 submissions from eight teams.Inspecting the systems submitted this and lastyear, we can speculate about the most promisingtechniques. Last year’s winning system used aneural network trained on pseudo-reference fea-tures (namely, features extracted from n-best lists)(Camargo de Souza et al., 2014). This year’s win-ning systems are also based on pseudo-referencefeatures (UAlacant) and deep neural network ar-chitectures (HDCL). Luong et al. (2013) had pre-

viously reported that pseudo-reference featuresimprove the accuracy of word-level predictions.The two most recent editions of this shared taskseem to indicate that the state of the art in word-level quality estimation relies upon such features,as well as the ability to model the relationship be-tween the source and target languages using largedatasets.

Effectiveness of quality labels, features andlearning methods for document-level QE

The task of paragraph-level prediction receivedfewer submissions than the other two tasks: foursubmissions for the scoring variant and five forthe ranking variant, for both language pairs. Thisis understandable as it was the first time the taskwas run. Additionally, paragraph-level QE is stillfairly new as a task. However, we were able todraw some conclusions and learn valuable lessonsfor future research in the area.

By and large, most features are similar to thoseused for sentence-level prediction. Discourse-aware features showed only marginal improve-ments relative to the baseline system (USHEF sys-tems for EN-DE and DE-EN). One possible rea-son for that is the way the training and test datasets were created, including paragraphs with onlyone sentence. Therefore, discourse features couldnot be fully explored as they aim to model rela-tionships and dependencies across sentences, aswell as within sentences. In future, data will beselected more carefully in order to consider onlyparagraphs or documents with more sentences.

26

of 84


Systems applying feature selection techniques,such as USAAR-USHEF/BFF, did not obtain ma-jor improvements over the baseline. However,they provided interesting insights by finding aminimum set of baseline features that can be usedto build models with the same performance as theentire baseline feature set. These are models withonly three features selected as the best combina-tion by exhaustive search.

The winning submissions for both languagepairs and variants – RTM-DCU – explored fea-tures based on the source and target side informa-tion. These include distributional similarity, close-ness of test instances to the training data, and in-dicators for translation quality. External data wasused to select “interpretants”, which contain dataclose to both training and test sets to provide con-text for similarity judgements.

In terms of quality labels, one problem ob-served in previous work on document-level QE(Scarton et al., 2015b) is the low variation ofscores (in this case, Meteor) across instances ofthe dataset. Since the data collected for this taskincluded translations from many different MT sys-tems, this was not the case. Table 17 shows the av-erage and standard deviation (STDEV) values forthe datasets (both training and test set together).Although the variation is substantial, the averagevalue of the training set is a good predictor. Inother words, if we consider the average of thetraining set scores as the prediction value for alldata points in the test set, we obtain results as goodas the baseline system. For our datasets, the MAEfigure for EN-DE is 10, and for DE-EN 7 – thesame as the baseline system. We can only spec-ulate that automatically assigned quality labelsbased on reference translations such as Meteor arenot adequate for this task. Other automatic metricstend to behave similarly to Meteor for document-level (Scarton et al., 2015b). Therefore, findingan adequate quality label for document-level QEremains an open issue. Having humans directlyassign quality labels is much more complex thanin the sentence and word level cases. Annotationof entire documents, or even paragraphs, becomesa harder, more subjective and much more costlytask. For future editions of this task, we intendto collect datasets with human-targeted document-level labels obtained indirectly, e.g. through post-editing.

No submission focused on exploring learning

EN-DE DE-ENAVG STDEV AVG STDEV

Meteor (↑) 0.35 0.14 0.26 0.09Table 17: Average metric scores for automatic metrics in thecorpus for Task 3.

algorithms specifically targeted at document-levelprediction.

Differences between sentence-level anddocument-level QEThe differences between sentence and document-level prediction have not been explored to a greatextent. Apart from the discourse-aware features byUSHEF, the baseline and other features exploredby participating teams for document level predic-tion were simple aggregations of sentence levelfeature values.

Also, none of the submitted systems usesentence-level predictions as features forparagraph-level QE. Although this techniqueis possible in principle, its effectiveness hasnot yet been proved. (Specia et al., 2015) re-port promising results when using word-levelprediction for sentence-level QE, but inclusiveresults when using sentence-level prediction fordocument-level QE. They considered BLEU, TERand Meteor as quality labels, all leading to similarfindings. Once more the use of inadequate qualitylabels for document-level prediction could havebeen the reason.

No submission evaluated different machinelearning algorithms for this task. The same algo-rithms as those used for sentence-level predictionwere applied by all participating teams.

Effect of training data sizes and quality forsentence and word-level QEAs it was previously mentioned, the post-editionsused for this year’s sentence and word-level taskswere obtained through a crowdsourcing platformwhere translators volunteered to post-edit machinetranslations. As such, one can expect that not allpost-editions will reach the highest standards ofprofessional translation. Manual inspection of asmall sample of the data, however, showed that thepost-editions were high quality, although stylis-tic differences are evident in some cases. This islikely due to the fact that different editors, withdifferent styles and levels of expertise, worked ondifferent segments. Another factor that may haveinfluenced the quality of the post-editions is the

27

of 84


fact that segments were fixed out of context. Forword level, in particular, a potential issue is thefact that the labelling of the words was done com-pletely automatically, using a tool for alignmentbased on minimum edit distance (TER).

On the positive side, this dataset is much largerdataset than any we have used before for predic-tion at any level: nearly 12K segments for train-ing/development, as opposed to maximum 2K inprevious years. For sentence-level prediction wedid not expect massive gains from larger datasets,as it has been shown that small amounts of datacan be as effective or even more effective than theentire collection, if selected in a clever way (Becket al., 2013a,b). However, it is well known thatdata sparsity is an issue for word-level prediction,so we expected a large dataset to improve resultsconsiderably for this task.

Unfortunately, having access to a large numberof samples did not seem to bring much improve-ment for word-level predictions accuracy. Themain reason for that was the fact that the num-ber of erroneous words in the training data wastoo small, as compared to the number of correctwords: 50% of the sentences had zero incorrectwords (15% of the sentences) or fewer than 15%incorrect words (35% of the sentences). Partici-pants used various data manipulation strategies toimprove results: filtering of the training data, asin DCU-SHEFF systems, alternative labelling ofthe data which discriminates between “OK” labelin the beginning, middle, and end of a good seg-ment, and insertion of additional incorrect words,as in SAU-KERC submissions. Additionally, mostparticipants in the word-level task leveraged ad-ditional data in some way, which points to theneed for even larger but more varied post-editeddatasets in order to make significant progress inthis task.

5 Automatic Post-editing Task

This year WMT hosted for the first time a sharedtask on automatic post-editing (APE) for machinetranslation. The task requires to automatically cor-rect the errors present in a machine translated text.As pointed out in Parton et al. (2012) and Chat-terjee et al. (2015b), from the application point ofview, APE components would make it possible to:

• Improve MT output by exploiting informa-tion unavailable to the decoder, or by per-

forming deeper text analysis that is too ex-pensive at the decoding stage;

• Cope with systematic errors of an MT systemwhose decoding process is not accessible;

• Provide professional translators with im-proved MT output quality to reduce (human)post-editing effort;

• Adapt the output of a general-purpose MTsystem to the lexicon/style requested in a spe-cific application domain.

The first pilot round of the APE task focused onthe challenges posed by the “black-box” scenarioin which the MT system is unknown and cannotbe modified. In this scenario, APE methods haveto operate at the downstream level (that is afterMT decoding), by applying either rule-based tech-niques or statistical approaches that exploit knowl-edge acquired from human post-editions providedas training material. The objectives of this pilotwere to: i) define a sound evaluation frameworkfor the task, ii) identify and understand the mostcritical aspects in terms of data acquisition andsystem evaluation, iii) make an inventory of cur-rent approaches and evaluate the state of the artand iv) provide a milestone for future studies onthe problem.

5.1 Task description

Participants were provided with training and de-velopment data consisting of (source, target, hu-man post-edition) triplets, and were asked to re-turn automatic post-editions for a test set of unseen(source, target) pairs.

DataTraining, development and test data were cre-ated by randomly sampling from a collectionof English-Spanish (source, target, human post-edition) triplets drawn from the news domain.22

Instances were sampled after applying a series ofdata cleaning steps aimed at removing duplicatesand those triplets in which any of the elements(source, target, post-edition) was either too longor too short compared to the others, or includedtags or special problematic symbols. The mainreason for random sampling was to induce somehomogeneity across the three datasets and, in turn,

22The original triplets were provided by Unbabel (https://unbabel.com/).

28

of 84


to increase the chances that correction patternslearned from the training set can be applied alsoto the test set. The downside of losing informa-tion yielded by text coherence (an aspect that someAPE systems might take into consideration) hashence been accepted in exchange for a higher errorrepetitiveness across the three datasets. Table 18provides some basic statistics about the data.

The training and development sets respectivelyconsist of 11, 272 and 1, 000 instances. In eachinstance:

• The source (SRC) is a tokenized Englishsentence having a length of at least 4 to-kens. This constraint on the source lengthwas posed in order to increase the chancesto work with grammatically correct full sen-tences instead of phrases or short keywordlists;

• The target (TGT) is a tokenized Spanishtranslation of the source, produced by an un-known MT system;

• The human post-edition (PE) is a manually-revised version of the target. PEs were col-lected by means of a crowdsourcing platformdeveloped by the data provider.

Test data (1, 817 instances) consists of (source,target) pairs having similar characteristics of thosein the training set. Human post-editions of the testtarget instances were left apart to measure systemperformance.

The data creation procedure adopted, as well asthe origin and the domain of the texts pose specificchallenges to the participating systems. As dis-cussed in Section 5.4, the results of this pilot taskcan be partially explained in light of such chal-lenges. This dataset, however, has three major ad-vantages that made it suitable for the first APE pi-lot: i) it is relatively large (hence suitable to applystatistical methods), ii) it was not previously pub-lished (hence usable for a fair evaluation), iii) it isfreely available (hence easy to distribute and usefor evaluation purposes).

Evaluation metricSystem performance is evaluated by comput-ing the distance between automatic and humanpost-editions of the machine-translated sentencespresent in the test set (i.e. for each of the 1,817target test sentences). This distance is measured

in terms of Translation Error Rate (TER) (Snoveret al., 2006a), an evaluation metric commonlyused in MT-related tasks (e.g. in quality estima-tion) to measure the minimum edit distance be-tween an automatic translation and a referencetranslation.23 Systems are ranked based on the av-erage TER calculated on the test set by using theTERcom24 software: lower average TER scorescorrespond to higher ranks. Each run is evalu-ated in two modes, namely: i) case insensitive andii) case sensitive. Evaluation scripts to computeTER scores in both modalities have been madeavailable to participants through the APE task webpage.25

BaselineThe official baseline is calculated by averaging thedistances computed between the raw MT outputand the human post-edits. In practice, the base-line APE system is a system that leaves all thetest targets unmodified.26 Baseline results com-puted for both evaluation modalities (case sensi-tive/insensitive) are reported in Tables 20 and 21.

Monolingual translation as another term ofcomparison. To get further insights about theprogress with respect to previous APE meth-ods, participants’ results are also analysed withrespect to another term of comparison: a re-implementation of the state-of-the-art approachfirstly proposed by Simard et al. (2007).27 Forthis purpose, a phrase-based SMT system basedon Moses (Koehn et al., 2007) is used. Trans-lation and reordering models were estimated fol-lowing the Moses protocol with default setup us-ing MGIZA++ (Gao and Vogel, 2008) for wordalignment. For language modeling we used the

23Edit distance is calculated as the number of edits (wordinsertions, deletions, substitutions, and shifts) divided by thenumber of words in the reference. Lower TER values indicatebetter MT quality.

24http://www.cs.umd.edu/˜snover/tercom/25http://www.statmt.org/wmt15/ape-task.html26In this case, since edit distance is computed between

each machine-translated sentence and its human-revised ver-sion, the actual evaluation metric is the human-targeted TER(HTER). For the sake of clarity, since TER and HTER com-pute edit distance in the same way (the only difference is inthe origin of correct sentence used for comparison), hence-forth we will use TER to refer to both metrics.

27This is done based on the description provided in Simardet al. (2007). Our re-implementation, however, is not meantto officially represent such approach. Discrepancies with theactual method are indeed possible due to our misinterpreta-tion or to wrong guesses about details that are missing in thepaper.

29

of 84


Tokens Types LemmasSRC TGT PE SRC TGT PE SRC TGT PE

Train (11,272) 238,335 257,643 257,879 23,608 25,121 27,101 13,701 7,624 7,689Dev (1,000) 21,617 23,213 23,098 5,482 5,760 5,966 3,765 2,810 2,819Test (1,817) 38,244 40,925 40,903 7,990 8,498 8,816 5,307 3,778 3,814

Table 18: Data statistics.

KenLM toolkit (Heafield, 2011) for standard n-gram modeling with an n-gram length of 5. Fi-nally, the APE system was tuned on the devel-opment set, optimizing TER with Minimum Er-ror Rate Training (Och, 2003). The results of thisadditional term of comparison, computed for bothevaluation modalities (case sensitive/insensitive),are also reported in Tables 20 and 21.

For each submitted run, the statistical signifi-cance of performance differences with respect tothe baseline and the re-implementation of Simardet al. (2007) is calculated with the bootstraptest (Koehn, 2004).

5.2 Participants

Four teams participated in the APE pilot task bysubmitting a total of seven runs. Participants arelisted in Table 19; a short description of their sys-tems is provided in the following.

Abu-MaTran. The Abu-MaTran team submit-ted the output of two statistical post-editing(SPE) systems, both relying on the MOSES

phrase-based statistical machine translation toolkit(Koehn et al., 2007) and on sentence level clas-sifiers. The first element of the pipeline, theSPE system, is trained on the automatic trans-lation of the News Commentary v8 corpus fromEnglish to Spanish aligned with its reference.This translation is obtained with an out-of-the-box phrase-based SMT system trained on Europarlv7. Both translation and post-editing systems usea 5-gram Spanish LM with modified Kneser-Neysmoothed trained on News Crawl 2011 and 2012with KenLM (Heafield, 2011). For the second el-ement of the pipeline, a binary classifier to selectthe best translation between the given MT outputor its automatic post-edition is used. Two differentapproaches are investigated: a 180-hand-crafted-based regression model trained with a SupportVector Machine (SVM) with a radial basis func-tion kernel to estimate the sentence-level HTERscore, and a Recurrent Neural Network (RNN)classifier using context word embeddings as input

and classifying each word of a sentence as goodor bad. An automatic translation to be post-editedis first decoded by our SPE system, then fed intoone of the classifiers identified as SVM180feat andRNN. The HTER estimator selects the translationwith the lower score while the binary word-levelclassifier selects the translation with the feweramount of bad tags. The official evaluation of theshared task show an advantage of the RNN ap-proach compared to SVM.

FBK. The two runs submitted by FBK (Chat-terjee et al., 2015a) are based on combining thestatistical phrase-based post-editing approach pro-posed by Simard et al. (2007) and its most sig-nificant variant proposed by Bechara et al. (2011).The APE systems are built-in an incremental man-ner. At each stage of the APE pipeline, the bestconfiguration of a component is decided and thenused in the next stage. The APE pipeline beginswith the selection of the best language model fromseveral language models trained on different typesand quantities of data. The next stage addressesthe possible data sparsity issues raised by the rel-atively small size of the training data. Indeed, ananalysis of the original phrase table obtained fromthe training set revealed that a large part of its en-tries is composed of instances that occur only oncein the training. This has the obvious effect of col-lecting potentially unreliable “translation” (or, inthe case of APE, correction) rules. The problem isexacerbated by the “context-aware” approach pro-posed by Bechara et al. (2011), which builds thephrase table by joining source and target tokensthus breaking down the co-occurrence counts intosmaller numbers. To cope with this problem, anovel feature (neg-impact) is designed to prune thephrase table by measuring the usefulness of eachtranslation. The higher is the value of the neg-impact feature, the less useful is the translationoption. After pruning, the final stage of the APEpipeline tries to raise the capability of the decoderto select the correct translation rule by the intro-duction of new task specific features integrated in

30

of 84


ID Participating teamAbu-MaTran Abu-MaTran Project (Prompsit)FBK Fondazione Bruno Kessler, Italy (Chatterjee et al., 2015a)LIMSI Laboratoire d’Informatique pour la Mecanique et les Sciences de

l’Ingenieur, France (Wisniewski et al., 2015)USAAR-SAPE Saarland University, Germany & Jadavpur University, India (Pal et al., 2015b)

Table 19: Participants in the WMT15 Automatic Post-editing pilot task.

the model. These features measure the similarityand the reliability of the translation options andhelp to improve the precision of the resulting APEsystem.

LIMSI. For the first edition of the APE sharedtask LIMSI submitted two systems (Wisniewskiet al., 2015). The first one is based on the approachof Simard et al. (2007) and considers the APE taskas a monolingual translation between a transla-tion hypothesis and its post-edition. This straight-forward approach does not succeed in improvingtranslation quality. The second submitted systemimplements a series of sieves, each applying a sim-ple post-editing rule. The definition of these rulesis based on an analysis of the most frequent er-ror corrections and aims at: i) predicting wordcase; ii) predicting exclamation and interrogationmarks; and iii) predicting verbal endings. Exper-iments with this approach show that this systemalso hurts translation quality. An in-depth analy-sis revealed that this negative result is mainly ex-plained by two reasons: i) most of the post-editionoperations are nearly unique, which makes verydifficult to generalize from a small amount of data;and ii) even when they are not, the high variabilityof post-editing, already pointed out by Wisniewskiet al. (2013), results in predicting legitimate cor-rections that have not been made by the annota-tors, therefore preventing from improving over thebaseline.

USAAR-SAPE. The USAAR-SAPE sys-tem (Pal et al., 2015b) is designed with three basiccomponents: corpus preprocessing, hybrid wordalignment and a state-of-the-art phrase-basedSMT system integrated with the hybrid wordalignment. The preprocessing of the trainingcorpus is carried out by stemming the SpanishMT output and the PE data using Freeling (Padrand Stanilovsky, 2012). The hybrid word align-ment method combines different kinds of wordalignment: GIZA++ word alignment with the

grow-diag-final-and (GDFA) heuristic (Koehn,2010), SymGiza++ (Junczys-Dowmunt and Szal,2011), the Berkeley aligner (Liang et al., 2006),and the edit distance-based aligners (Snover et al.,2006a; Lavie and Agarwal, 2007). These differentword alignment tables (Pal et al., 2013) arecombined by a mathematical union method. Forthe phrase-based SMT system various maximumphrase lengths for the translation model andn–gram settings for the language model are used.The best results in terms of BLEU (Papineni et al.,2002) score are achieved by a maximum phraselength of 7 for the translation model and a 5-gramlanguage model.

5.3 ResultsThe official results achieved by the participatingsystems are reported in Tables 20 and 21. Theseven runs submitted are sorted based on the aver-age TER they achieve on test data. Table 20 showsthe results computed in case sensitive mode, whileTable 21 provides scores computed in the case in-sensitive mode.

Both rankings reveal an unexpected outcome:none of the submitted runs was able to beat thebaselines (i.e. average TER scores of 22.91 and22.22 respectively for case sensitive and case in-sensitive modes). All differences with respect tosuch baselines, moreover, are statistically signif-icant. In practice, this means that what the sys-tems learned from the available data was not reli-able enough to yield valid corrections of the testinstances. A deeper discussion about the possiblecauses of this unexpected outcome is provided inSection 5.4.

Unsurprisingly, for all participants the case in-sensitive evaluation results are slightly better thanthe case sensitive ones. Although the two rank-ings are not identical, none of the systems wasparticularly penalized by the case sensitive eval-uation. Indeed, individual differences in the twomodes are always close to the same value (∼ 0.7TER difference) measured for the two baselines.

31

of 84


ID Avg. TERBaseline 22.913FBK Primary 23.228LIMSI Primary 23.331USAAR-SAPE 23.426LIMSI Contrastive 23.573Abu-MaTran Primary 23.639FBK Contrastive 23.649(Simard et al., 2007) 23.839Abu-MaTran Contrastive 24.715

Table 20: Official results for the WMT15 AutomaticPost-editing task – average TER (↓) case sensitive.

ID Avg. TERBaseline 22.221LIMSI Primary 22.544FBK Primary 22.551USAAR-SAPE 22.710Abu-MaTran Primary 22.769LIMSI Contrastive 22.861FBK Contrastive 22.949(Simard et al., 2007) 23.130Abu-MaTran Contrastive 23.705

Table 21: Official results for the WMT15 AutomaticPost-editing task – average TER (↓) case insensitive.

In light of this, and considering the importance ofcase sensitive evaluation in some language settings(e.g. having German as target), future rounds ofthe task will likely prioritize this more strict eval-uation mode.

Overall, the close results achieved by partici-pants reflect the fact that, despite some small vari-ations, all systems share the same underlying sta-tistical approach of Simard et al. (2007). As an-ticipated in Section 5.1, in order to get a roughidea about the extent of the improvements oversuch state-of-the-art method, we replicated it andconsidered its results as another term of compari-son in addition to the baselines. As shown in Ta-bles 20 and 21, the performance results achievedby our implementation of Simard et al. (2007) are23.839 and 23.130 in terms of TER for the re-spective case sensitive and insensitive evaluations.Compared to these scores, most of the submittedruns achieve better performance, with positive av-erage TER differences that are always statisticallysignificant. We interpret this as a good sign: de-spite the difficulty of the task, the novelties in-troduced by each system allowed to make signifi-cant steps forward with respect to a prior referencetechnique.

5.4 Discussion

To better understand the results and gain useful in-sights about this pilot evaluation round, we per-form two types of analysis. The first one is focusedon the data, and aims to understand the possiblereasons of the difficulty of the task. In particular,by analysing the challenges posed by the originand the domain of the text material used, we tryto find indications for future rounds of the APEtask. The second type of analysis focuses on thesystems and their behaviour. Although they share

the same underlying approach and achieve similarresults, we aim to check if interesting differencescan be captured by a more fine grained analysisthat goes beyond rough TER measurements.

Data analysisIn this section we investigate the possible rela-tion between participants’ results and the natureof the data used in this pilot task (e.g. quan-tity, sparsity, domain and origin) . For this pur-pose, we take advantage of a new dataset – theAutodesk Post-Editing Data corpus28 – which hasbeen publicly released after the organisation of theAPE pilot task. Although it was not usable forthis first round, its characteristics make it partic-ularly suitable for our analysis purposes. In par-ticular: i) Autodesk data predominantly covers thedomain of software user manuals (that is, arestricted domain compared to a general one likenews), and ii) post-edits come from professionaltranslators (that is, at least in principle, a more re-liable source of corrections compared to crowd-sourced workforce). To guarantee a fair compari-son, English-Spanish (source, target, human post-edition) triplets drawn from the Autodesk corpusare split in training, development and test sets un-der the constraint that the total number of targetwords and the TER in each set should be similarto the APE task splits. In this setting, performancedifferences between systems trained on the twodatasets will only depend on the different natureof the data (e.g. domain). Statistics of the trainingsets are reported in Table 22 (those concerning the

28The corpus (https://autodesk.app.box.com/Autodesk-PostEditing) consists of parallel Englishsource-MT/TM target segments post-edited into severallanguages (Chinese, Czech, French, German, Hungarian,Italian, Japanese, Korean, Polish, Brazilian Portuguese,Russian, Spanish) with between 30K to 410K segments perlanguage.

32

of 84


APE Task Autodesk

TokensSRC 238,335 220,671TGT 257,643 257.380PE 257,879 260,324

TypesSRC 23,608 11,858TGT 25,121 11,721PE 27,101 12,399

LemmasSRC 13,701 5,092TGT 7,624 3,186PE 7,689 3,334

RRSRC 2.905 6.346TGT 3.312 8.390PE 3.085 8.482

Table 22: WMT APe Task and Autodesk training data statis-tics.

APE task data are the same of Table 18).

The impact of data sparsity. A key issue inmost evaluation settings is the representativenessof the training data with respect to the test set used.In the case of the statistical approach at the core ofall the APE task submissions, this issue is evenmore relevant given the limited amount of train-ing data available. In the APE scenario, data rep-resentativeness relates to the fact that the correc-tion patterns learned from the training set can beapplied also to the test set (as mentioned in Sec-tion 5.1, in the data creation phase random sam-pling from an original data collection was appliedfor this purpose). From this point of view, dealingwith restricted domains such as software usermanuals should be easier than working with newsdata. Indeed, restricted domains are more likelyto feature smaller vocabularies, be more repetitive(or, in other terms, less sparse) and, in turn, de-termine a higher applicability of the learned errorcorrection patterns.

To check the relation between task difficulty anddata repetitiveness, we compared different mono-lingual indicators (i.e. number of types and lem-mas, and repetition rate29 – RR) computed on theAPE and the Autodesk source, target and post-edited sentences. Although both the datasets havethe same amount of target tokens, Table 22 showsthat the APE training set has nearly double oftypes and lemmas compared to the Autodesk data,

29Repetition rate measures the repetitiveness inside a textby looking at the rate of non-singleton n-gram types (n=1. ..4) and combining them using the geometric mean. Largervalue means more repetitions in the text. For more detailssee Cettolo et al. (2014)

which indicates the presence of less repeated in-formation. A similar conclusion can be drawn byobserving that the Autodesk dataset has a repeti-tion rate that is more than twice the value com-puted for the APE task data.

This monolingual analysis does not provide anyinformation about the level of repetitiveness of thecorrection patterns made by the post-editors, be-cause it does not link the target and the post-editedsentences. To investigate this aspect, two instancesof the re-implemented approach of Simard et al.(2007) introduced in Section 5.1 are respectivelytrained on the APE and the Autodesk training sets.We consider the distribution of the frequency ofthe translation options in the phrase table as a goodindicator of the level of repetitiveness of the cor-rections in the data. For instance, a large numberof translation options that appear just one or onlyfew times in the data indicates a higher level ofsparseness. As expected due to the limited sizeof the training set, the vast majority of the trans-lation options in both phrase tables are singletonsas shown in Table 23. Nevertheless, the Autodeskphrase table is more compact (731k versus 1,066k)and contains 10% fewer singletons than the APEtask phrase table. This confirms that the APE taskdata is more sparse and suggests that it might beeasier to learn more applicable correction patternsfrom the Autodesk domain-specific data.

To verify this last statement, the two APE sys-tems are evaluated on their own test sets. As previ-ously shown, the system trained on the APE taskdata is not able to improve over the performanceachieved by a system that leaves all the test targetsunmodified (see Table 20). On the contrary, start-ing from a baseline of 23.57, the system trainedon the Autodesk data is able to reduce the TER by3.55 points (20.02). Interestingly, the AutodeskAPE system is able to correctly fix the target sen-tences and improve the TER by 1.43 points evenwith only 25% of the training data. These re-sults confirm our intuitions about the usefulness ofrepetitive data and show that, at least in restricted-domain scenarios, automatic post-editing can besuccessfully used as an aid to improve the outputof an MT system.

Professional vs. Crowdsourced post-editionsDifferently from the Autodesk data, for which thepost-editions are created by professional transla-tors, the APE task data contains crowdsourced MTcorrections collected from unknown (likely non-

33

of 84


Percentage of Phrase PairsPhrase Pair APE 2015

AutodeskCount Training1 95.2% 84.6%2 2.5% 8.8%3 0.7% 2.7%4 0.3% 1.2%5 0.2% 0.6%6 0.15% 0.4%7 0.10% 0.3%8 0.07% 0.2%9 0.06% 0.2%10 0.04% 0.1%> 10 0.3% 0.9%Total Entries 1,066,344 703,944

Table 23: Phrase pair count distribution in two phrase tablesbuilt using the APE 2015 training and the Autodesk dataset.

expert) translators. One risk, given the high vari-ability of valid MT corrections, is that the crowd-sourced workforce follows post-editing attitudesand criteria that differ from those of professionaltranslators. Professionals tend to: i) maximizeproductivity by doing only the necessary and suf-ficient corrections to improve translation quality,and ii) follow consistent translation criteria, es-pecially for domain terminology. Such a ten-dency will likely result in coherent and minimallypost-edited data from which learning and draw-ing statistics is easier. This is not guaranteed bycrowdsourced workers which do not have specifictime or consistency constraints. This suggests thatnon-professional post-editions and the correctionpatterns learned from them will feature less coher-ence, higher noise and higher sparsity.

To assess the potential impact of these issues ondata representativeness (and, in turn, on the taskdifficulty), we analyse a subset of the APE test in-stances (221 triples randomly sampled) in whichtarget sentences were post-edited by professionaltranslators. The analysis focuses on TER scorescomputed between:

1. The target sentences and their crowdsourcedpost-editions (avg. TER = 26.02);

2. The target sentences and their professionalpost-editions (avg. TER = 23.85);

3. The crowdsourced post-editions and the pro-fessional ones, using the latter as references(avg. TER = 29.18).

The measured values indicate an attitude of non-professionals to correct more often and differ-ently from the professional translators. Interest-ingly, and similar to the findings of Potet et al.(2012), crowdsourced post-editions feature a dis-tance from the professional ones that is evenhigher than the distance between the original tar-get sentences and the experts’ corrections (29.18vs. 23.85). If we consider the output of profes-sional translators as a gold standard (made of co-herent and minimally post-edited data), these fig-ures suggest a higher difficulty in handling crowd-sourced corrections.

Further insights can be drawn from the anal-ysis of the word level corrections produced bythe two translator profiles. To this aim, word in-sertions, deletions, substitutions and phrase shiftsare extracted using the TERcom software similarto Blain et al. (2012) and Wisniewski et al. (2013).For each error type, the ratio between the num-ber of edit operations and the total number of oc-curred errors operations performed is computed.This quantity provides us with a measure of thelevel of repetitiveness of the errors, with 100%indicating that all the error patterns are unique,and small values indicating that most of the errorsare repeated. Our results show that non-expertshave generally larger ratio values than the pro-fessional translators (insertion +6%, substitution+4%, deletion +4%). This seems to support ourhypothesis that, independently from their quality,post-editions collected from non-experts are lesscoherent than those derived from professionals.It is unlikely that different crowdsourced work-ers will apply the same corrections in the samecontexts. If this hypothesis holds, the difficultyof this APE pilot task could be partially ascribedto this unavoidable intrinsic property of crowd-sourced data. This aspect, however, should be fur-ther investigated to draw definite conclusions.

System/performance analysisThe TER results presented in Tables 20 and 21 ev-idence small differences between participants, butthey do not shed light on the real behaviour of thesystems. To this aim, in this section the submittedruns are analysed by taking into consideration thechanges made by each system to the test instances(case sensitive evaluation mode). In particular, Ta-ble 24 provides the number of modified, improvedand deteriorated sentences, together with the per-centage of edit operations performed (insertions,

34

of 84


Modified Improved Deteriorated Edit operationsID Sentences Sentences Sentences Ins Del Sub ShiftsFBK Primary 276 64 147 17.8 17.8 55.9 8.5LIMSI Primary 339 75 217 19.4 16.8 55.2 8.6USAAR-SAPE 422 53 229 17.6 17.4 56.7 8.4LIMSI Contrastive 454 61 260 17.4 19.0 55.3 8.3Abu-MaTran Primary 275 8 200 17.7 17.2 56.8 8.2FBK Contrastive 422 52 254 18.4 17.0 56.2 8.4Abu-MaTran Contrastive 602 14 451 17.8 16.4 57.7 8.0(Simard et al., 2007) 488 55 298 18.3 17.0 56.4 8.3

Table 24: Number of test sentences modified, improved and deteriorated by each submitted run, together with the correspond-ing percentage of insertions, deletions, substitutions and shifts (case sensitive).

deletions, substitutions, shifts). Looking at thesenumbers, the following conclusions can be drawn.Although it varies considerably between the sub-mitted runs, the number of modified sentences isquite small. Moreover, a general trend can be ob-served: the best systems are the most conservativeones. This situation likely reflects the aforemen-tioned data sparsity and coherence issues. A smallfraction of the correction patterns found in thetraining set seems to be applicable also to the testset, and the risk of performing corrections that areeither wrong, redundant, or different from those inthe reference post-editions is rather high.

From the system point of view, the context inwhich a learned correction pattern will be appliedis crucial. For instance, the same word substitu-tion (e.g. “house”→ “home”) is not applicable inall contexts. While sometimes it will be necessary(Example 1: “The house team won the match”), insome contexts it is optional (Example 2: “I was inmy house”) or wrong (Example 3: “He worked fora brokerage house”). Unfortunately, the unneces-sary word replacement in Example 2 (human post-editors would likely leave it untouched) would in-crease the TER of the sentence exactly as in theclearly wrong replacement in Example 3.

From the evaluation point of view, not penal-ising such correct but unnecessary corrections isalso crucial. Similar to MT, where a source sen-tence can have many valid translations, in the APEtask a target sentence can have many valid post-editions. Indeed, nothing prevents that in our eval-uation some correct post-editions are consideredas “deteriorated” sentences simply because theydiffer from the human post-editions used as ref-erences. As in MT, this well known variabilityproblem might penalise good systems, thus call-ing for alternative evaluation criteria (e.g. based

on multiple references or sensitive to paraphrasematches). Interestingly, for all the systems thenumber of modified sentences is higher than thesum of the improved and the deteriorated ones.Such difference is represented by modified sen-tences for which the corrections do not yield TERvariations. This grey area makes the evaluationproblem related to variability even more evident.

The analysis of the edit operations performed byeach system is not particularly informative. Sim-ilar to the overall performance results, also theproportion of correction types they perform re-flects the adoption of the same underlying statisti-cal approach. The distribution of the four types ofedit operations is almost identical, with a predom-inance lexical substitutions (55.7%-57.7%) andrather few phrasal shifts (8.0%-8.6%).

5.5 Lessons learned and outlook

The objectives of this pilot APE task were to: i)define a sound evaluation framework for futurerounds, ii) identify and understand the most criti-cal aspects in terms of data acquisition and systemevaluation, iii) make an inventory of current ap-proaches, evaluate the state of the art and iv) pro-vide a milestone for future studies on the problem.With respect to the first point, improving the eval-uation is possible, but no major issues emergedor requested radical changes in future evaluationrounds. For instance, using multiple references ora metric sensitive to paraphrase matches to copewith variability in the post-editing would certainlyhelp.

Concerning the most critical aspects of the eval-uation, our analysis highlighted the strong de-pendence of system results on data repetitive-ness/representativeness. This calls into ques-tion the actual usability of text material coming

35

of 84


from general domains like news and, probably, ofpost-editions collected from crowdsourced work-ers (this aspect, however, should be further investi-gated to draw definite conclusions). Nevertheless,it’s worth noting that collecting a large, unpub-lished, public, domain-specific and professional-quality dataset is a hardly achievable goal that willalways require compromise solutions.

Regarding the approaches proposed, this firstexperience was a conservative but, at the sametime, promising first step. Although participantsperformed the task sharing the same statistical ap-proach to APE, the slight variants they explored al-lowed them to outperform the widely used mono-lingual translation technique. Moreover, results’analysis also suggests a possible limitation of thisstate-of-the-art approach: by always performingall the applicable correction patterns, it runs therisk of deteriorating the input translations that itwas supposed to improve. This limitation, com-mon to all the participating systems, is a clue ofa major difference between the APE task and theMT framework. In MT the system must alwaysprocess the entire source sentence by translatingall of its words into the target language. In theAPE scenario, instead, the system has another op-tion for each word: keeping it untouched. A rea-sonable (and this year unbeaten) baseline is infact a system that applies this conservative strat-egy for all the words. By raising this and otherissues as promising research directions, attractingresearchers’ attention to a challenging application-oriented task, and establishing a sound evaluationframework to measure future advancements, thispilot has substantially achieved its goals, pavingthe way for future rounds of the APE evaluationexercise.

Acknowledgments

This work was supported in parts by theMosesCore, QT21, EXPERT and CRACKERprojects funded by the European Commission (7thFramework Programme and H2020).

We would also like to thank Unbabel for pro-viding the data used in the QE Tasks 1 and 2, andin the APE task.

References

Avramidis, E., Popovic, M., and Burchardt, A.(2015). DFKI’s experimental hybrid MT sys-tem for WMT 2015. In Proceedings of the Tenth

Workshop on Statistical Machine Translation,pages 66–73, Lisboa, Portugal. Association forComputational Linguistics.

Banerjee, S. and Lavie, A. (2005). METEOR: AnAutomatic Metric for MT Evaluation with Im-proved Correlation with Human Judgments. InProceedings of the ACL 2005 Workshop on In-trinsic and Extrinsic Evaluation Measures forMT and/or Summarization.

Bechara, H., Ma, Y., and van Genabith, J. (2011).Statistical Post-Editing for a Statistical MT Sys-tem. In Proceedings of the 13th Machine Trans-lation Summit, pages 308–315, Xiamen, China.

Beck, D., Shah, K., Cohn, T., and Specia, L.(2013a). SHEF-Lite: When less is more fortranslation quality estimation. In Proceedingsof the Eighth Workshop on Statistical MachineTranslation, pages 335–340, Sofia, Bulgaria.Association for Computational Linguistics.

Beck, D., Specia, L., and Cohn, T. (2013b). Re-ducing annotation effort for quality estimationvia active learning. In 51st Annual Meeting ofthe Association for Computational Linguistics:Short Papers, ACL, pages 543–548, Sofia, Bul-garia.

Bicici, E. (2013). Referential translation machinesfor quality estimation. In Proceedings of theEighth Workshop on Statistical Machine Trans-lation, Sofia, Bulgaria.

Bicici, E. and Way, A. (2014). Referential Transla-tion Machines for Predicting Translation Qual-ity. In Ninth Workshop on Statistical MachineTranslation, pages 313–321, Baltimore, Mary-land, USA.

Bicici, E., Liu, Q., and Way, A. (2015). ReferentialTranslation Machines for Predicting TranslationQuality and Related Statistics. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 304–308, Lisboa, Portugal.Association for Computational Linguistics.

Blain, F., Schwenk, H., and Senellart, J. (2012).Incremental adaptation using translation infor-mation and post-editing analysis. In Interna-tional Workshop on Spoken Language Trans-lation (IWSLT), pages 234–241, Hong-Kong(China).

Bojar, O., Buck, C., Callison-Burch, C., Feder-mann, C., Haddow, B., Koehn, P., Monz, C.,Post, M., Soricut, R., and Specia, L. (2013).

36

of 84


Findings of the 2013 Workshop on StatisticalMachine Translation. In Proceedings of theEighth Workshop on Statistical Machine Trans-lation, pages 1–42, Sofia, Bulgaria. Associationfor Computational Linguistics.

Bojar, O., Buck, C., Federmann, C., Haddow, B.,Koehn, P., Leveling, J., Monz, C., Pecina, P.,Post, M., Saint-Amand, H., Soricut, R., Spe-cia, L., and Tamchyna, A. (2014). Findings ofthe 2014 workshop on statistical machine trans-lation. In Proceedings of the Ninth Workshopon Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association forComputational Linguistics.

Bojar, O. and Tamchyna, A. (2015). CUNI inWMT15: Chimera Strikes Again. In Proceed-ings of the Tenth Workshop on Statistical Ma-chine Translation, pages 79–83, Lisboa, Portu-gal. Association for Computational Linguistics.

Callison-Burch, C., Fordyce, C., Koehn, P., Monz,C., and Schroeder, J. (2007). (Meta-) evaluationof machine translation. In Proceedings of theSecond Workshop on Statistical Machine Trans-lation (WMT07), Prague, Czech Republic.

Callison-Burch, C., Fordyce, C., Koehn, P., Monz,C., and Schroeder, J. (2008). Further meta-evaluation of machine translation. In Proceed-ings of the Third Workshop on Statistical Ma-chine Translation (WMT08), Colmbus, Ohio.

Callison-Burch, C., Koehn, P., Monz, C., Pe-terson, K., Przybocki, M., and Zaidan, O. F.(2010). Findings of the 2010 joint workshopon statistical machine translation and metricsfor machine translation. In Proceedings of theFourth Workshop on Statistical Machine Trans-lation (WMT10), Uppsala, Sweden.

Callison-Burch, C., Koehn, P., Monz, C., Post, M.,Soricut, R., and Specia, L. (2012). Findings ofthe 2012 workshop on statistical machine trans-lation. In Proceedings of the Seventh Workshopon Statistical Machine Translation, pages 10–51, Montreal, Canada. Association for Compu-tational Linguistics.

Callison-Burch, C., Koehn, P., Monz, C., andSchroeder, J. (2009). Findings of the 2009workshop on statistical machine translation. InProceedings of the Fourth Workshop on Sta-tistical Machine Translation (WMT09), Athens,Greece.

Callison-Burch, C., Koehn, P., Monz, C., andZaidan, O. (2011). Findings of the 2011 work-shop on statistical machine translation. In Pro-ceedings of the Sixth Workshop on StatisticalMachine Translation, pages 22–64, Edinburgh,Scotland.

Camargo de Souza, J. G., Gonzalez-Rubio, J.,Buck, C., Turchi, M., and Negri, M. (2014).Fbk-upv-uedin participation in the wmt14 qual-ity estimation shared-task. In Proceedings of theNinth Workshop on Statistical Machine Trans-lation, pages 322–328, Baltimore, Maryland,USA. Association for Computational Linguis-tics.

Cap, F., Weller, M., Ramm, A., and Fraser, A.(2015). CimS - The CIS and IMS Joint Sub-mission to WMT 2015 addressing morphologi-cal and syntactic differences in English to Ger-man SMT. In Proceedings of the Tenth Work-shop on Statistical Machine Translation, pages84–91, Lisboa, Portugal. Association for Com-putational Linguistics.

Cettolo, M., Bertoldi, N., and Federico, M. (2014).The Repetition Rate of Text as a Predictor ofthe Effectiveness of Machine Translation Adap-tation. In Proceedings of the 11th Biennial Con-ference of the Association for Machine Transla-tion in the Americas (AMTA 2014), pages 166–179, Vancouver, BC, Canada.

Chatterjee, R., Turchi, M., and Negri, M. (2015a).The FBK Participation in the WMT15 Auto-matic Post-editing Shared Task. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 210–215, Lisboa, Portugal.Association for Computational Linguistics.

Chatterjee, R., Weller, M., Negri, M., and Turchi,M. (2015b). Exploring the Planet of the APEs: aComparative Study of State-of-the-art Methodsfor MT Automatic Post-Editing. In Proceedingsof the 53rd Annual Meeting of the Associationfor Computational Linguistics), Beijing, China.

Cho, E., Ha, T.-L., Niehues, J., Herrmann, T., Me-diani, M., Zhang, Y., and Waibel, A. (2015).The Karlsruhe Institute of Technology Transla-tion Systems for the WMT 2015. In Proceed-ings of the Tenth Workshop on Statistical Ma-chine Translation, pages 92–97, Lisboa, Portu-gal. Association for Computational Linguistics.

Cohen, J. (1960). A coefficient of agreement for

37

of 84


nominal scales. Educational and PsychologicalMeasurment, 20(1):37–46.

Dusek, O., Gomes, L., Novak, M., Popel, M., andRosa, R. (2015). New Language Pairs in Tec-toMT. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 98–104, Lisboa, Portugal. Association for Compu-tational Linguistics.

Dyer, C., Lopez, A., Ganitkevitch, J., Weese, J.,Ture, F., Blunsom, P., Setiawan, H., Eidelman,V., and Resnik, P. (2010). cdec: A decoder,alignment, and learning framework for finite-state and context-free translation models. InProceedings of the Association for Computa-tional Linguistics (ACL).

Espla-Gomis, M., Sanchez-Martınez, F., and For-cada, M. (2015a). UAlacant word-level ma-chine translation quality estimation system atWMT 2015. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 309–315, Lisboa, Portugal. Associationfor Computational Linguistics.

Espla-Gomis, M., Sanchez-Martnez, F., and For-cada, M. L. (2015b). Using on-line availablesources of bilingual information for word-levelmachine translation quality estimation. In 18thAnnual Conference of the European Associta-tion for Machine Translation, page 1926, An-talya, Turkey.

Federmann, C. (2012). Appraise: An Open-Source Toolkit for Manual Evaluation of Ma-chine Translation Output. The Prague Bulletinof Mathematical Linguistics (PBML), 98:25–35.

Gao, Q. and Vogel, S. (2008). Parallel Implemen-tations of Word Alignment Tool. In Proceedingsof the ACL 2008 Software Engineering, Testing,and Quality Assurance Workshop, pages 49–57,Columbus, Ohio.

Gimenez, J. and Marquez, L. (2010). Asiya: AnOpen Toolkit for Automatic Machine Transla-tion (Meta-)Evaluation. The Prague Bulletin ofMathematical Linguistics, 94:77–86.

Graham, Y. (2015). Improving Evaluation of Ma-chine Translation Quality Estimation. In 53rdAnnual Meeting of the Association for Compu-tational Linguistics and Seventh InternationalJoint Conference on Natural Language Pro-cessing of the Asian Federation of Natural Lan-

guage Processing, pages 1804–1813, Beijing,China.

Gronroos, S.-A., Virpioja, S., and Kurimo, M.(2015). Tuning Phrase-Based Segmented Trans-lation for a Morphologically Complex Tar-get Language. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 105–111, Lisboa, Portugal. Associationfor Computational Linguistics.

Gwinnup, J., Anderson, T., Erdmann, G., Young,K., May, C., Kazi, M., Salesky, E., and Thomp-son, B. (2015). The AFRL-MITLL WMT15System: Theres More than One Way to De-code It! In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 112–119, Lisboa, Portugal. Association for Compu-tational Linguistics.

Ha, T.-L., Do, Q.-K., Cho, E., Niehues, J., Al-lauzen, A., Yvon, F., and Waibel, A. (2015).The KIT-LIMSI Translation System for WMT2015. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 120–125, Lisboa, Portugal. Association for Compu-tational Linguistics.

Haddow, B., Huck, M., Birch, A., Bogoychev,N., and Koehn, P. (2015). The Edinburgh/JHUPhrase-based Machine Translation Systems forWMT 2015. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 126–133, Lisboa, Portugal. Associationfor Computational Linguistics.

Heafield, K. (2011). KenLM: faster and smallerlanguage model queries. In Proceedings of theSixth Workshop on Statistical Machine Transla-tion, pages 187–197, Edinburgh, United King-dom. Association for Computational Linguis-tics.

Jean, S., Firat, O., Cho, K., Memisevic, R., andBengio, Y. (2015). Montreal Neural MachineTranslation Systems for WMT15. In Proceed-ings of the Tenth Workshop on Statistical Ma-chine Translation, pages 134–140, Lisboa, Por-tugal. Association for Computational Linguis-tics.

Junczys-Dowmunt, M. and Szal, A. (2011). SyM-Giza++: Symmetrized Word Alignment Mod-els for Statistical Machine Translation. In SIIS,volume 7053 of Lecture Notes in Computer Sci-ence, pages 379–390. Springer.

38

of 84


Koehn, P. (2004). Statistical Significance Tests forMachine Translation Evaluation. In Lin, D. andWu, D., editors, Proceedings of EMNLP 2004,pages 388–395, Barcelona, Spain.

Koehn, P. (2005). Europarl: A Parallel Corpus forStatistical Machine Translation. In MT SummitX.

Koehn, P. (2010). Statistical Machine Transla-tion. Cambridge University Press, New York,NY, USA, 1st edition.

Koehn, P., Hoang, H., Birch, A., Callison-Burch,C., Federico, M., Bertoldi, N., Cowan, B.,Shen, W., Moran, C., Zens, R., Dyer, C., Bo-jar, O., Constantin, A., and Herbst, E. (2007).Moses: Open source toolkit for statistical ma-chine translation. In ACL 2007 Demonstrations,Prague, Czech Republic.

Koehn, P. and Monz, C. (2006). Manual and au-tomatic evaluation of machine translation be-tween European languages. In Proceedings ofNAACL 2006 Workshop on Statistical MachineTranslation, New York, New York.

Kolachina, P. and Ranta, A. (2015). GF Wide-coverage English-Finnish MT system for WMT2015. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 141–144, Lisboa, Portugal. Association for Compu-tational Linguistics.

Kreutzer, J., Schamoni, S., and Riezler, S.(2015). QUality Estimation from ScraTCH(QUETCH): Deep Learning for Word-levelTranslation Quality Estimation. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 316–322, Lisboa, Portugal.Association for Computational Linguistics.

Landis, J. R. and Koch, G. G. (1977). The mea-surement of observer agreement for categoricaldata. Biometrics, 33:159–174.

Langlois, D. (2015). LORIA System for theWMT15 Quality Estimation Shared Task. InProceedings of the Tenth Workshop on Statisti-cal Machine Translation, pages 323–329, Lis-boa, Portugal. Association for ComputationalLinguistics.

Lavie, A. and Agarwal, A. (2007). Meteor: AnAutomatic Metric for MT Evaluation with HighLevels of Correlation with Human Judgments.In Proceedings of the Second Workshop on Sta-

tistical Machine Translation, StatMT ’07, pages228–231.

Liang, P., Taskar, B., and Klein, D. (2006). Align-ment by Agreement. In HLTNAACL, New York.

Logacheva, V., Hokamp, C., and Specia, L.(2015). Data enhancement and selection strate-gies for the word-level Quality Estimation. InProceedings of the Tenth Workshop on Statisti-cal Machine Translation, pages 330–335, Lis-boa, Portugal. Association for ComputationalLinguistics.

Luong, N. Q., Besacier, L., and Lecouteux, B.(2014). LIG System for Word Level QE task atWMT14. In Ninth Workshop on Statistical Ma-chine Translation, pages 335–341, Baltimore,Maryland, USA. Association for ComputationalLinguistics.

Luong, N. Q., Lecouteux, B., and Besacier, L.(2013). LIG system for WMT13 QE task: In-vestigating the usefulness of features in wordconfidence estimation for MT. In Proceedingsof the Eighth Workshop on Statistical MachineTranslation, pages 384–389, Sofia, Bulgaria.Association for Computational Linguistics.

Marie, B., Allauzen, A., Burlot, F., Do, Q.-K.,Ive, J., knyazeva, e., Labeau, M., Lavergne, T.,Loser, K., Pecheux, N., and Yvon, F. (2015).LIMSI@WMT’15 : Translation Task. In Pro-ceedings of the Tenth Workshop on StatisticalMachine Translation, pages 145–151, Lisboa,Portugal. Association for Computational Lin-guistics.

Och, F. J. (2003). Minimum Error Rate Trainingin Statistical Machine Translation. In ACL03,pages 160–167, Sapporo, Japan.

Padr, L. and Stanilovsky, E. (2012). Freeling 3.0:Towards wider multilinguality. In Proceedingsof the Eight International Conference on Lan-guage Resources and Evaluation (LREC’12),Istanbul, Turkey.

Pal, S., Naskar, S., and Bandyopadhyay, S. (2013).A Hybrid Word Alignment Model for Phrase-Based Statistical Machine Translation. In Pro-ceedings of the Second Workshop on Hybrid Ap-proaches to Translation, pages 94–101, Sofia,Bulgaria.

Pal, S., Naskar, S., and van Genabith, J. (2015a).UdS-Sant: English–German Hybrid MachineTranslation System. In Proceedings of the Tenth

39

of 84


Workshop on Statistical Machine Translation,pages 152–157, Lisboa, Portugal. Associationfor Computational Linguistics.

Pal, S., Vela, M., Naskar, S. K., and van Gen-abith, J. (2015b). USAAR-SAPE: An English–Spanish Statistical Automatic Post-Editing Sys-tem. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 216–221, Lisboa, Portugal. Association for Compu-tational Linguistics.

Papineni, K., Roukos, S., Ward, T., and Zhu,W.-J. (2002). BLEU: a method for automaticevaluation of machine translation. In Proceed-ings of the 40th Annual Meeting on Associationfor Computational Linguistics, ACL ’02, pages311–318, Morristown, NJ, USA.

Parton, K., Habash, N., McKeown, K., Iglesias,G., and de Gispert, A. (2012). Can Auto-matic Post-Editing Make MT More Meaning-ful? In Proceedings of the 16th Conference ofthe European Association for Machine Transla-tion (EAMT), pages 111–118, Trento, Italy.

Peter, J.-T., Toutounchi, F., Wuebker, J., and Ney,H. (2015). The RWTH Aachen German-EnglishMachine Translation System for WMT 2015. InProceedings of the Tenth Workshop on Statisti-cal Machine Translation, pages 158–163, Lis-boa, Portugal. Association for ComputationalLinguistics.

Potet, M., Esperana-Rodier, E., Besacier, L., andBlanchon, H. (2012). Collection of a largedatabase of french-english smt output correc-tions. In LREC, pages 4043–4048. EuropeanLanguage Resources Association (ELRA).

Quernheim, D. (2015). Exact Decoding with MultiBottom-Up Tree Transducers. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 164–171, Lisboa, Portugal.Association for Computational Linguistics.

Raybaud, S., Langlois, D., and Smali, K. (2011).this sentence is wrong. detecting errors inmachine-translated sentences. Machine Trans-lation, 25(1):1–34.

Rubino, R., Pirinen, T., Espla-Gomis, M.,Ljubesic, N., Ortiz Rojas, S., Papavassiliou, V.,Prokopidis, P., and Toral, A. (2015). Abu-MaTran at WMT 2015 Translation Task: Mor-phological Segmentation and Web Crawling. InProceedings of the Tenth Workshop on Statisti-

cal Machine Translation, pages 184–191, Lis-boa, Portugal. Association for ComputationalLinguistics.

Scarton, C., Tan, L., and Specia, L. (2015a).USHEF and USAAR-USHEF participation inthe WMT15 QE shared task. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 336–341, Lisboa, Portugal.Association for Computational Linguistics.

Scarton, C., Zampieri, M., Vela, M., van Genabith,J., and Specia, L. (2015b). Searching for Con-text: a Study on Document-Level Labels forTranslation Quality Estimation. In The 18th An-nual Conference of the European Associationfor Machine Translation, pages 121–128, An-talya, Turkey.

Schwartz, L., Bryce, B., Geigle, C., Massung, S.,Liu, Y., Peng, H., Raja, V., Roy, S., and Upad-hyay, S. (2015). The University of Illinois sub-mission to the WMT 2015 Shared TranslationTask. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 192–198, Lisboa, Portugal. Association for Compu-tational Linguistics.

Shah, K., Logacheva, V., Paetzold, G., Blain, F.,Beck, D., Bougares, F., and Specia, L. (2015).SHEF-NN: Translation Quality Estimation withNeural Networks. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 342–347, Lisboa, Portugal. Associationfor Computational Linguistics.

Shang, L., Cai, D., and Ji, D. (2015). Strategy-Based Technology for Estimating MT Quality.In Proceedings of the Tenth Workshop on Statis-tical Machine Translation, pages 348–352, Lis-boa, Portugal. Association for ComputationalLinguistics.

Simard, M., Goutte, C., and Isabelle, P. (2007).Statistical Phrase-Based Post-Editing. In Pro-ceedings of the Annual Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics (NAACL HLT), pages508–515, Rochester, New York.

Snover, M., Dorr, B., Schwartz, R., Micciulla, L.,and Makhoul, J. (2006a). A Study of Trans-lation Edit Rate with Targeted Human Anno-tation. In Proceedings of Association for Ma-chine Translation in the Americas, pages 223–231, Cambridge, Massachusetts, USA.

40

of 84


Snover, M., Dorr, B., Schwartz, R., Micciulla, L.,and Makhoul, J. (2006b). A study of transla-tion edit rate with targeted human annotation.In Proceedings of the 7th Biennial Conferenceof the Association for Machine Translation inthe Americas (AMTA-2006), Cambridge, Mas-sachusetts.

Specia, L., Paetzold, G., and Scarton, C. (2015).Multi-level Translation Quality Prediction withQuEst++. In 53rd Annual Meeting of the Asso-ciation for Computational Linguistics and Sev-enth International Joint Conference on NaturalLanguage Processing of the Asian Federation ofNatural Language Processing: System Demon-strations, pages 115–120, Beijing, China.

Specia, L., Shah, K., de Souza, J. G., and Cohn,T. (2013). QuEst - A translation quality esti-mation framework. In 51st Annual Meeting ofthe Association for Computational Linguistics:System Demonstrations, ACL-2013, pages 79–84, Sofia, Bulgaria.

Stanojevic, M., Kamran, A., and Bojar, O.(2015a). Results of the WMT15 Tuning SharedTask. In Proceedings of the Tenth Workshopon Statistical Machine Translation, pages 274–281, Lisboa, Portugal. Association for Compu-tational Linguistics.

Stanojevic, M., Kamran, A., Koehn, P., and Bo-jar, O. (2015b). Results of the WMT15 Met-rics Shared Task. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 256–273, Lisboa, Portugal. Associationfor Computational Linguistics.

Steele, D., Sim Smith, K., and Specia, L.(2015). Sheffield Systems for the Finnish-English WMT Translation Task. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 172–176, Lisboa, Portugal.Association for Computational Linguistics.

Tezcan, A., Hoste, V., Desmet, B., and Macken, L.(2015). UGENT-LT3 SCATE System for Ma-chine Translation Quality Estimation. In Pro-ceedings of the Tenth Workshop on StatisticalMachine Translation, pages 353–360, Lisboa,Portugal. Association for Computational Lin-guistics.

Tiedemann, J., Ginter, F., and Kanerva, J. (2015).Morphological Segmentation and OPUS forFinnish-English Machine Translation. In Pro-ceedings of the Tenth Workshop on Statistical

Machine Translation, pages 177–183, Lisboa,Portugal. Association for Computational Lin-guistics.

Williams, P., Sennrich, R., Nadejde, M., Huck,M., and Koehn, P. (2015). Edinburgh’s Syntax-Based Systems at WMT 2015. In Proceedingsof the Tenth Workshop on Statistical MachineTranslation, pages 199–209, Lisboa, Portugal.Association for Computational Linguistics.

Wisniewski, G., Pecheux, N., and Yvon, F. (2015).Why Predicting Post-Edition is so Hard? Fail-ure Analysis of LIMSI Submission to the APEShared Task. In Proceedings of the TenthWorkshop on Statistical Machine Translation,pages 222–227, Lisboa, Portugal. Associationfor Computational Linguistics.

Wisniewski, G., Singh, A. K., Segal, N., andYvon, F. (2013). Design and Analysis ofa Large Corpus of Post-edited Translations:Quality Estimation, Failure Analysis and theVariability of Post-edition. Machine Transla-tion Summit, 14:117–124.

41

of 84


ON

LIN

E-B

UE

DIN

-JH

U

UE

DIN

-SY

NTA

X

MO

NT

RE

AL

ON

LIN

E-A

CU

-TE

CT

O

TT-

BL

EU

-MIR

A-D

TT-

ILL

C-U

VA

TT-

BL

EU

-ME

RT

TT-

AF

RL

TT-

US

AA

R-T

UN

A

TT-

DC

U

TT-

ME

TE

OR

-CM

U

TT-

BL

EU

-MIR

A-S

P

TT-

HK

US

T-M

EA

NT

ILL

INO

IS

ONLINE-B – .46† .52 .46† .39‡ .25‡ .21‡ .21‡ .21‡ .21‡ .20‡ .20‡ .19‡ .17‡ .16‡ .17‡UEDIN-JHU .54† – .48 .47? .44‡ .26‡ .21‡ .22‡ .20‡ .21‡ .20‡ .19‡ .19‡ .19‡ .19‡ .19‡

UEDIN-SYNTAX .48 .52 – .51 .46? .28‡ .21‡ .22‡ .22‡ .21‡ .21‡ .19‡ .18‡ .20‡ .19‡ .17‡MONTREAL .54† .53? .49 – .45† .28‡ .24‡ .25‡ .24‡ .24‡ .25‡ .24‡ .21‡ .20‡ .20‡ .23‡

ONLINE-A .61‡ .56‡ .54? .55† – .29‡ .24‡ .26‡ .25‡ .25‡ .24‡ .23‡ .22‡ .23‡ .23‡ .22‡CU-TECTO .75‡ .74‡ .72‡ .72‡ .71‡ – .48 .47 .47 .46† .48 .44‡ .43‡ .43‡ .43‡ .41‡

TT-BLEU-MIRA-D .79‡ .79‡ .79‡ .76‡ .76‡ .52 – .51 .41† .43? .38† .43† .41‡ .39‡ .39‡ .43‡TT-ILLC-UVA .79‡ .78‡ .78‡ .75‡ .74‡ .53 .49 – .48 .47 .45 .41‡ .45? .42‡ .40‡ .42‡

TT-BLEU-MERT .79‡ .80‡ .78‡ .76‡ .75‡ .53 .59† .52 – .51 .48 .44† .45† .41‡ .40‡ .41‡TT-AFRL .79‡ .79‡ .79‡ .76‡ .75‡ .54† .57? .53 .49 – .49 .45? .43† .42‡ .42‡ .41‡

TT-USAAR-TUNA .80‡ .80‡ .79‡ .75‡ .76‡ .52 .62† .55 .52 .51 – .45? .45† .41‡ .41‡ .42‡TT-DCU .80‡ .81‡ .81‡ .76‡ .77‡ .56‡ .57† .59‡ .56† .55? .55? – .47 .45† .44† .45†

TT-METEOR-CMU .81‡ .81‡ .82‡ .79‡ .78‡ .57‡ .59‡ .55? .55† .57† .55† .53 – .48 .49 .48TT-BLEU-MIRA-SP .83‡ .81‡ .80‡ .80‡ .77‡ .57‡ .61‡ .58‡ .59‡ .58‡ .59‡ .55† .52 – .53 .50TT-HKUST-MEANT .84‡ .81‡ .81‡ .80‡ .77‡ .57‡ .61‡ .60‡ .60‡ .58‡ .59‡ .56† .51 .47 – .48

ILLINOIS .82‡ .81‡ .83‡ .77‡ .78‡ .59‡ .57‡ .58‡ .59‡ .59‡ .58‡ .55† .52 .50 .52 –score .61 .57 .53 .51 .43 -.12 -.18 -.18 -.19 -.21 -.22 -.26 -.29 -.32 -.32 -.35rank 1 2 3-4 3-4 5 6 7-9 7-10 7-11 8-11 9-11 12-13 13-15 13-15 13-15 15-16

Table 25: Head to head comparison, ignoring ties, for Czech-English systems

A Pairwise System Comparisons by Human Judges

Tables 25–34 show pairwise comparisons between systems for each language pair. The numbers in eachof the tables’ cells indicate the percentage of times that the system in that column was judged to be betterthan the system in that row, ignoring ties. Bolding indicates the winner of the two systems.

Because there were so many systems and data conditions the significance of each pairwise compar-ison needs to be quantified. We applied the Sign Test to measure which comparisons indicate genuinedifferences (rather than differences that are attributable to chance). In the following tables ? indicates sta-tistical significance at p ≤ 0.10, † indicates statistical significance at p ≤ 0.05, and ‡ indicates statisticalsignificance at p ≤ 0.01, according to the Sign Test.

Each table contains final rows showing how likely a system would win when paired against a randomlyselected system (the expected win ratio score) and the rank range according bootstrap resampling (p ≤0.05). Gray lines separate clusters based on non-overlapping rank ranges.

42

of 84


CU

-CH

IME

RA

ON

LIN

E-B

UE

DIN

-JH

U

MO

NT

RE

AL

ON

LIN

E-A

UE

DIN

-SY

NTA

X

CU

-TE

CT

O

CO

MM

ER

CIA

L1

TT-

DC

U

TT-

AF

RL

TT-

BL

EU

-MIR

A-D

TT-

US

AA

R-T

UN

A

TT-

BL

EU

-ME

RT

TT-

ME

TE

OR

-CM

U

TT-

BL

EU

-MIR

A-S

P

CU-CHIMERA – .42‡ .43‡ .44‡ .38‡ .33‡ .29‡ .27‡ .15‡ .15‡ .15‡ .14‡ .14‡ .11‡ .10‡ONLINE-B .58‡ – .50 .50 .44‡ .40‡ .37‡ .32‡ .16‡ .17‡ .17‡ .17‡ .16‡ .13‡ .08‡

UEDIN-JHU .57‡ .50 – .51 .44‡ .39‡ .41‡ .35‡ .18‡ .18‡ .18‡ .18‡ .16‡ .13‡ .10‡MONTREAL .56‡ .50 .49 – .46† .43‡ .39‡ .36‡ .22‡ .21‡ .21‡ .21‡ .19‡ .19‡ .16‡

ONLINE-A .62‡ .56‡ .56‡ .54† – .43‡ .40‡ .36‡ .20‡ .19‡ .20‡ .18‡ .17‡ .15‡ .12‡UEDIN-SYNTAX .67‡ .60‡ .61‡ .57‡ .57‡ – .48 .43‡ .25‡ .25‡ .26‡ .25‡ .23‡ .23‡ .17‡

CU-TECTO .71‡ .62‡ .59‡ .61‡ .60‡ .52 – .44‡ .29‡ .30‡ .28‡ .28‡ .28‡ .23‡ .17‡COMMERCIAL1 .73‡ .68‡ .65‡ .64‡ .64‡ .57‡ .56‡ – .29‡ .28‡ .28‡ .27‡ .27‡ .22‡ .18‡

TT-DCU .85‡ .84‡ .82‡ .78‡ .80‡ .75‡ .71‡ .71‡ – .52 .48 .45† .40‡ .36‡ .27‡TT-AFRL .85‡ .83‡ .82‡ .79‡ .81‡ .75‡ .70‡ .72‡ .48 – .49 .46? .37‡ .33‡ .29‡

TT-BLEU-MIRA-D .85‡ .83‡ .82‡ .79‡ .80‡ .74‡ .72‡ .72‡ .52 .51 – .39‡ .36‡ .36‡ .27‡TT-USAAR-TUNA .86‡ .83‡ .82‡ .79‡ .82‡ .75‡ .72‡ .73‡ .55† .54? .61‡ – .36‡ .37‡ .28‡

TT-BLEU-MERT .86‡ .84‡ .84‡ .81‡ .83‡ .77‡ .72‡ .73‡ .60‡ .63‡ .64‡ .64‡ – .39‡ .28‡TT-METEOR-CMU .89‡ .87‡ .87‡ .81‡ .85‡ .77‡ .77‡ .78‡ .64‡ .67‡ .64‡ .63‡ .61‡ – .32‡

TT-BLEU-MIRA-SP .90‡ .92‡ .90‡ .84‡ .88‡ .83‡ .83‡ .82‡ .73‡ .71‡ .73‡ .72‡ .72‡ .68‡ –score .68 .51 .50 .46 .42 .26 .20 .11 -.34 -.34 -.34 -.37 -.40 -.56 -.80rank 1 2-3 2-3 4 5 6 7 8 9-11 9-11 9-11 12 13 14 15

Table 26: Head to head comparison, ignoring ties, for English-Czech systems

ON

LIN

E-B

UE

DIN

-JH

U

ON

LIN

E-A

UE

DIN

-SY

NTA

X

KIT

RW

TH

MO

NT

RE

AL

ILL

INO

IS

DF

KI

ON

LIN

E-C

ON

LIN

E-F

MA

CA

U

ON

LIN

E-E

ONLINE-B – .41‡ .43‡ .39‡ .39‡ .33‡ .38‡ .25‡ .26‡ .27‡ .26‡ .19‡ .22‡UEDIN-JHU .59‡ – .51 .46? .45† .43† .44† .31‡ .33‡ .36‡ .30‡ .28‡ .27‡

ONLINE-A .57‡ .49 – .52 .53 .48 .44‡ .36‡ .32‡ .31‡ .28‡ .29‡ .26‡UEDIN-SYNTAX .61‡ .54? .48 – .49 .48 .45† .23‡ .33‡ .34‡ .35‡ .27‡ .26‡

KIT .61‡ .55† .47 .51 – .47 .46? .35‡ .38‡ .36‡ .35‡ .26‡ .32‡RWTH .67‡ .57† .52 .52 .53 – .46? .38‡ .39‡ .40‡ .36‡ .31‡ .35‡

MONTREAL .62‡ .56† .56‡ .55† .54? .54? – .42‡ .43‡ .41‡ .35‡ .32‡ .34‡ILLINOIS .75‡ .69‡ .64‡ .77‡ .65‡ .62‡ .58‡ – .48 .49 .48 .38‡ .42‡

DFKI .74‡ .67‡ .68‡ .67‡ .62‡ .61‡ .57‡ .52 – .43† .46? .39‡ .37‡ONLINE-C .73‡ .64‡ .69‡ .66‡ .64‡ .60‡ .59‡ .51 .57† – .46? .42‡ .39‡ONLINE-F .74‡ .70‡ .72‡ .65‡ .65‡ .64‡ .64‡ .52 .54? .54? – .44‡ .40‡

MACAU .81‡ .72‡ .71‡ .73‡ .74‡ .69‡ .68‡ .62‡ .61‡ .58‡ .56‡ – .50ONLINE-E .78‡ .73‡ .74‡ .74‡ .68‡ .65‡ .66‡ .58‡ .63‡ .61‡ .60‡ .50 –

score .56 .31 .29 .25 .22 .14 .09 -.17 -.17 -.22 -.30 -.48 -.54rank 1 2-3 2-4 3-5 4-5 6-7 6-7 8-10 8-10 9-10 11 12-13 12-13

Table 27: Head to head comparison, ignoring ties, for German-English systems

43

of 84


UE

DIN

-SY

NTA

X

MO

NT

RE

AL

PR

OM

T-R

UL

E

ON

LIN

E-A

ON

LIN

E-B

KIT

-LIM

SI

UE

DIN

-JH

U

ON

LIN

E-F

ON

LIN

E-C

KIT

CIM

S

DF

KI

ON

LIN

E-E

UD

S-S

AN

T

ILL

INO

IS

IMS

UEDIN-SYNTAX – .52 .47 .48 .42‡ .36‡ .36‡ .33‡ .37‡ .32‡ .29‡ .32‡ .31‡ .33‡ .19‡ .21‡MONTREAL .48 – .47 .44† .41‡ .35‡ .35‡ .42‡ .37‡ .35‡ .33‡ .33‡ .37‡ .35‡ .24‡ .27‡

PROMT-RULE .53 .53 – .46? .45† .46? .40‡ .35‡ .42‡ .41‡ .37‡ .36‡ .33‡ .37‡ .29‡ .24‡ONLINE-A .52 .56† .54? – .40‡ .43† .37‡ .42‡ .39‡ .39‡ .41‡ .36‡ .36‡ .33‡ .27‡ .28‡ONLINE-B .58‡ .59‡ .55† .60‡ – .45† .45† .45† .44† .39‡ .42‡ .37‡ .41‡ .35‡ .29‡ .32‡KIT-LIMSI .64‡ .65‡ .54? .57† .55† – .52 .49 .44† .40‡ .47 .38‡ .39‡ .37‡ .29‡ .30‡

UEDIN-JHU .64‡ .65‡ .60‡ .63‡ .55† .48 – .47 .51 .46? .43† .45? .44† .41‡ .34‡ .30‡ONLINE-F .67‡ .58‡ .65‡ .58‡ .55† .51 .53 – .50 .46? .49 .44† .46? .39‡ .36‡ .36‡ONLINE-C .63‡ .63‡ .58‡ .61‡ .56† .56† .49 .50 – .52 .48 .45 .40‡ .42‡ .36‡ .35‡

KIT .68‡ .65‡ .59‡ .61‡ .61‡ .60‡ .54? .54? .48 – .51 .43‡ .47 .37‡ .35‡ .33‡CIMS .71‡ .67‡ .62‡ .59‡ .58‡ .53 .57† .51 .52 .49 – .47 .45† .44† .23‡ .34‡DFKI .68‡ .67‡ .64‡ .64‡ .63‡ .62‡ .55? .56† .55 .57‡ .53 – .50 .44† .41‡ .36‡

ONLINE-E .69‡ .63‡ .67‡ .64‡ .59‡ .61‡ .56† .54? .60‡ .53 .55† .50 – .45† .42‡ .38‡UDS-SANT .67‡ .65‡ .63‡ .67‡ .65‡ .63‡ .59‡ .61‡ .58‡ .63‡ .56† .56† .55† – .45† .41‡

ILLINOIS .81‡ .76‡ .71‡ .73‡ .71‡ .71‡ .66‡ .64‡ .64‡ .65‡ .77‡ .59‡ .58‡ .55† – .48IMS .79‡ .73‡ .76‡ .72‡ .68‡ .70‡ .70‡ .64‡ .65‡ .67‡ .66‡ .64‡ .62‡ .59‡ .52 –

score .35 .33 .26 .23 .14 .08 .03 .00 -.00 -.01 -.03 -.13 -.13 -.23 -.40 -.50rank 1-2 1-2 3-4 3-4 5 6 7-9 7-11 7-11 8-11 9-11 12-13 12-13 14 15 16

Table 28: Head to head comparison, ignoring ties, for English-German systems

ON

LIN

E-B

LIM

SI-

CN

RS

UE

DIN

-JH

U

MA

CA

U

ON

LIN

E-A

ON

LIN

E-F

ON

LIN

E-E

ONLINE-B – .50 .49 .47† .44‡ .35‡ .22‡LIMSI-CNRS .50 – .49 .46‡ .45‡ .37‡ .25‡UEDIN-JHU .51 .51 – .47† .46† .35‡ .26‡

MACAU .53† .54‡ .53† – .48 .39‡ .28‡ONLINE-A .56‡ .55‡ .54† .52 – .38‡ .26‡ONLINE-F .65‡ .63‡ .65‡ .61‡ .62‡ – .37‡ONLINE-E .78‡ .75‡ .74‡ .72‡ .74‡ .63‡ –

score .49 .44 .41 .27 .22 -.42 -1.43rank 1-2 1-3 1-3 4-5 4-5 6 7

Table 29: Head to head comparison, ignoring ties, for French-English systems

LIM

SI-

CN

RS

ON

LIN

E-A

UE

DIN

-JH

U

ON

LIN

E-B

CIM

S

ON

LIN

E-F

ON

LIN

E-E

LIMSI-CNRS – .45‡ .44‡ .45‡ .38‡ .36‡ .28‡ONLINE-A .55‡ – .49 .48? .45‡ .37‡ .32‡

UEDIN-JHU .56‡ .51 – .48? .44‡ .41‡ .31‡ONLINE-B .55‡ .52? .52? – .46‡ .40‡ .31‡

CIMS .62‡ .55‡ .56‡ .54‡ – .45‡ .36‡ONLINE-F .64‡ .63‡ .59‡ .60‡ .55‡ – .41‡ONLINE-E .72‡ .68‡ .69‡ .69‡ .64‡ .59‡ –

score .54 .30 .25 .21 -.00 -.33 -.97rank 1 2-3 2-4 3-4 5 6 7

Table 30: Head to head comparison, ignoring ties, for English-French systems

44

of 84


ON

LIN

E-B

PR

OM

T-S

MT

ON

LIN

E-A

UU

-UN

C

UE

DIN

-JH

U

AB

UM

AT

RA

N-C

OM

B

UE

DIN

-SY

NTA

X

ILL

INO

IS

AB

UM

AT

RA

N-H

FS

MO

NT

RE

AL

AB

UM

AT

RA

N

LIM

SI

SH

EFF

IEL

D

SH

EFF

-ST

EM

ONLINE-B – .36‡ .32‡ .35‡ .29‡ .35‡ .35‡ .29‡ .29‡ .31‡ .17‡ .18‡ .15‡ .15‡PROMT-SMT .64‡ – .49 .49 .48 .46 .44† .43† .36‡ .34‡ .25‡ .28‡ .25‡ .24‡

ONLINE-A .68‡ .51 – .50 .46 .42‡ .47 .45? .38‡ .40‡ .32‡ .30‡ .25‡ .25‡UU-UNC .65‡ .51 .50 – .50 .45? .47 .47 .37‡ .34‡ .35‡ .26‡ .26‡ .26‡

UEDIN-JHU .71‡ .52 .54 .50 – .49 .50 .47 .42‡ .38‡ .33‡ .31‡ .24‡ .24‡ABUMATRAN-COMB .65‡ .54 .58‡ .55? .51 – .49 .46 .33‡ .38‡ .23‡ .33‡ .24‡ .24‡

UEDIN-SYNTAX .65‡ .56† .53 .53 .50 .51 – .44† .41‡ .42‡ .36‡ .29‡ .30‡ .30‡ILLINOIS .71‡ .57† .55? .53 .53 .54 .56† – .45? .41‡ .37‡ .33‡ .28‡ .27‡

ABUMATRAN-HFS .71‡ .64‡ .62‡ .63‡ .58‡ .67‡ .59‡ .55? – .42‡ .43† .38‡ .38‡ .37‡MONTREAL .69‡ .66‡ .60‡ .66‡ .62‡ .62‡ .58‡ .59‡ .58‡ – .48 .43† .39‡ .39‡

ABUMATRAN .83‡ .75‡ .68‡ .65‡ .67‡ .77‡ .64‡ .63‡ .57† .52 – .46 .41‡ .41‡LIMSI .82‡ .72‡ .70‡ .74‡ .69‡ .67‡ .71‡ .67‡ .62‡ .57† .54 – .52 .52

SHEFFIELD .85‡ .75‡ .75‡ .74‡ .76‡ .76‡ .70‡ .72‡ .62‡ .61‡ .59‡ .48 – .00SHEFF-STEM .85‡ .76‡ .75‡ .74‡ .76‡ .76‡ .70‡ .73‡ .63‡ .61‡ .59‡ .48 1.00 –

score .67 .28 .24 .23 .18 .16 .14 .08 -.08 -.17 -.27 -.43 -.51 -.52rank 1 2-4 2-5 2-5 4-7 5-7 5-8 7-8 9 10 11 12-13 13-14 13-14

Table 31: Head to head comparison, ignoring ties, for Finnish-English systems

.

ON

LIN

E-B

ON

LIN

E-A

UU

-UN

C

AB

UM

AT

RA

N-U

NC

-CO

M

AB

UM

AT

RA

N-C

OM

B

AA

LTO

UE

DIN

-SY

NTA

X

AB

UM

AT

RA

N-U

NC

CM

U

CH

AL

ME

RS

ONLINE-B – .40‡ .31‡ .28‡ .24‡ .26‡ .25‡ .25‡ .23‡ .18‡ONLINE-A .60‡ – .40‡ .41‡ .36‡ .33‡ .36‡ .34‡ .29‡ .26‡

UU-UNC .69‡ .60‡ – .47? .43‡ .41‡ .37‡ .41‡ .36‡ .27‡ABUMATRAN-UNC-COM .72‡ .59‡ .53? – .45† .46† .45‡ .40‡ .41‡ .32‡

ABUMATRAN-COMB .76‡ .64‡ .57‡ .55† – .45† .46† .47 .42‡ .34‡AALTO .74‡ .67‡ .59‡ .54† .55† – .47 .47? .46† .33‡

UEDIN-SYNTAX .75‡ .64‡ .63‡ .55‡ .54† .53 – .49 .44‡ .34‡ABUMATRAN-UNC .75‡ .66‡ .59‡ .60‡ .53 .53? .51 – .50 .39‡

CMU .77‡ .71‡ .64‡ .59‡ .58‡ .54† .56‡ .50 – .40‡CHALMERS .82‡ .74‡ .73‡ .68‡ .66‡ .67‡ .66‡ .61‡ .60‡ –

score 1.06 .54 .21 .04 -.05 -.14 -.18 -.21 -.34 -.92rank 1 2 3 4 5 6-7 6-8 6-8 9 10

Table 32: Head to head comparison, ignoring ties, for English-Finnish systems

45

of 84


ON

LIN

E-G

ON

LIN

E-B

PR

OM

T-R

UL

E

AF

RL

-MIT

-PB

AF

RL

-MIT

-FA

C

ON

LIN

E-A

AF

RL

-MIT

-H

LIM

SI-

NC

OD

E

UE

DIN

-SY

NTA

X

UE

DIN

-JH

U

US

AA

R-G

AC

HA

US

AA

R-G

AC

HA

ON

LIN

E-F

ONLINE-G – .40‡ .39‡ .35‡ .38‡ .38‡ .34‡ .32‡ .36‡ .33‡ .25‡ .24‡ .21‡ONLINE-B .60‡ – .41‡ .44† .42‡ .43† .40‡ .38‡ .37‡ .35‡ .29‡ .31‡ .22‡

PROMT-RULE .61‡ .59‡ – .46? .47 .51 .47 .47 .46† .48 .40‡ .41‡ .24‡AFRL-MIT-PB .65‡ .56† .54? – .49 .53 .46 .48 .44† .44† .33‡ .33‡ .29‡

AFRL-MIT-FAC .62‡ .58‡ .53 .51 – .50 .48 .45† .45† .46? .34‡ .28‡ .29‡ONLINE-A .62‡ .57† .49 .47 .50 – .44† .49 .48 .44† .36‡ .36‡ .29‡

AFRL-MIT-H .66‡ .60‡ .53 .54 .52 .56† – .50 .47 .46? .40‡ .34‡ .30‡LIMSI-NCODE .68‡ .62‡ .53 .52 .55† .51 .50 – .48 .49 .43† .39‡ .33‡

UEDIN-SYNTAX .64‡ .63‡ .54† .56† .55† .52 .53 .52 – .48 .40‡ .40‡ .34‡UEDIN-JHU .67‡ .65‡ .52 .56† .54? .56† .54? .51 .52 – .36‡ .38‡ .33‡

USAAR-GACHA .75‡ .71‡ .60‡ .67‡ .66‡ .64‡ .60‡ .57† .60‡ .64‡ – .44? .38‡USAAR-GACHA .76‡ .69‡ .59‡ .67‡ .72‡ .64‡ .66‡ .61‡ .60‡ .62‡ .56? – .40‡

ONLINE-F .79‡ .78‡ .76‡ .71‡ .71‡ .71‡ .70‡ .67‡ .66‡ .67‡ .62‡ .60‡ –score .49 .31 .12 .11 .11 .10 .05 .01 -.02 -.03 -.21 -.27 -.78rank 1 2 3-6 3-6 3-6 3-7 6-8 7-10 8-10 8-10 11 12 13

Table 33: Head to head comparison, ignoring ties, for Russian-English systems

PR

OM

T-R

UL

E

ON

LIN

E-G

ON

LIN

E-B

LIM

SI-

NC

OD

E

ON

LIN

E-A

UE

DIN

-JH

U

UE

DIN

-SY

NTA

X

US

AA

R-G

AC

HA

US

AA

R-G

AC

HA

ON

LIN

E-F

PROMT-RULE – .39‡ .29‡ .27‡ .28‡ .26‡ .21‡ .21‡ .21‡ .07‡ONLINE-G .61‡ – .40‡ .38‡ .33‡ .36‡ .30‡ .25‡ .24‡ .12‡ONLINE-B .71‡ .60‡ – .49 .44‡ .44‡ .37‡ .33‡ .32‡ .19‡

LIMSI-NCODE .73‡ .62‡ .51 – .49 .46† .38‡ .36‡ .34‡ .22‡ONLINE-A .72‡ .67‡ .56‡ .51 – .47? .43‡ .40‡ .36‡ .18‡

UEDIN-JHU .74‡ .64‡ .56‡ .54† .53? – .46† .40‡ .36‡ .25‡UEDIN-SYNTAX .79‡ .70‡ .63‡ .62‡ .57‡ .54† – .45† .39‡ .25‡USAAR-GACHA .79‡ .75‡ .67‡ .64‡ .60‡ .60‡ .55† – .46 .29‡USAAR-GACHA .79‡ .76‡ .68‡ .66‡ .64‡ .64‡ .61‡ .54 – .28‡

ONLINE-F .93‡ .88‡ .81‡ .78‡ .82‡ .75‡ .75‡ .71‡ .72‡ –score 1.01 .52 .21 .12 .07 .01 -.13 -.27 -.33 -1.21rank 1 2 3 4-5 4-5 6 7 8 9 10

Table 34: Head to head comparison, ignoring ties, for English-Russian systems

46

of 84


B Results of the WMT15 Metrics Shared Task


Results of the WMT15 Metrics Shared Task

Milos Stanojevic and Amir KamranUniversity of Amsterdam

ILLC{m.stanojevic,a.kamran}@uva.nl

Philipp KoehnJohns Hopkins University

[email protected]

Ondrej BojarCharles University in Prague

MFF [email protected]

Abstract

This paper presents the results of theWMT15 Metrics Shared Task. We askedparticipants of this task to score the out-puts of the MT systems involved in theWMT15 Shared Translation Task. We col-lected scores of 46 metrics from 11 re-search groups. In addition to that, wecomputed scores of 7 standard metrics(BLEU, SentBLEU, NIST, WER, PER,TER and CDER) as baselines. The col-lected scores were evaluated in terms ofsystem level correlation (how well eachmetric’s scores correlate with WMT15 of-ficial manual ranking of systems) and interms of segment level correlation (howoften a metric agrees with humans in com-paring two translations of a particular sen-tence).

1 Introduction

Automatic machine translation metrics play a veryimportant role in the development of MT systemsand their evaluation. There are many differentmetrics of diverse nature and one would like toassess their quality. For this reason, the Met-rics Shared Task is held annually at the Work-shop of Statistical Machine Translation1, startingwith Koehn and Monz (2006) and following up toMachacek and Bojar (2014).

The systems’ outputs, human judgements andevaluated metrics are described in Section 2. Thequality of the metrics in terms of system level cor-relation is reported in Section 3. Section 4 is de-voted to segment level correlation.

2 Data

We used the translations of MT systems involvedin WMT15 Shared Translation Task (Bojar et al.,

1http://www.statmt.org/wmt15

2015) together with reference translations as thetest set for the Metrics Task. This dataset con-sists of 87 systems’ outputs and 10 reference trans-lations in 10 translation directions (English fromand into Czech, Finnish, French, German and Rus-sian). The number of sentences in system and ref-erence translations varies among language pairsranging from 1370 for Finnish-English to 2818 forRussian-English. For more details, please see theWMT15 overview paper (Bojar et al., 2015).

2.1 Manual MT Quality Judgements

During the WMT15 Translation Task, a large scalemanual annotation was conducted to compare thetranslation quality of participating systems. Weused these collected human judgements for theevaluation of the automatic metrics.

The participants in the manual annotation wereasked to evaluate system outputs by ranking trans-lated sentences relative to each other. For eachsource segment that was included in the proce-dure, the annotator was shown five different out-puts to which he or she was supposed to assignranks. Ties were allowed.

These collected rank labels for each five-tupleof outputs were then interpreted as pairwise com-parisons of systems and used to assign each sys-tem a score that reflects how high that system wasusually ranked by the annotators. Several meth-ods have been tested in the past for the exact scorecalculation and WMT15 has adopted TrueSkill asthe official one. Please see the WMT15 overviewpaper for details on how this score is computed.

For the metrics task in 2014, we were still usingthe “Pre-TrueSkill” method called “> Others”, seeBojar et al. (2011). Since we are now moving tothe golden truth calculated by TrueSkill, we reportalso the average “Pre-TrueSkill” score in the rele-vant tables for comparison.

256

of 84


Metric ParticipantBEER, BEER TREEPEL ILLC – University of Amsterdam (Stanojevic and Sima’an, 2015)

BS University of Zurich (Mark Fishel; no corresponding paper)CHRF, CHRF3 DFKI (Popovic, 2015)

DPMF, DPMFCOMB Chinese Academy of Sciences and Dublin City University (Yu et al., 2015)DREEM National Research Council Canada (Chen et al., 2015)

LEBLEU-DEFAULT, LEBLEU-OPTIMIZED Lingsoft and Aalto University (Virpioja and Gronroos, 2015)METEOR-WSD, RATATOUILLE LIMSI-CNRS (Marie and Apidianaki, 2015)

UOW-LSTM University of Wolverhampton (Gupta et al., 2015a)UPF-COBALT Universitat Pompeu Fabra (Fomicheva et al., 2015)

USAAR-ZWICKEL-* Saarland University (Vela and Tan, 2015)VERTA-W, VERTA-EQ, VERTA-70ADEQ30FLU University of Barcelona (Comelles and Atserias, 2015)

Table 1: Participants of WMT15 Metrics Shared Task

2.2 Participants of the Metrics Shared Task

Table 1 lists the participants of the WMT15 SharedMetrics Task, along with their metrics. We havecollected 46 metrics from a total of 11 researchgroups.

Here we give a short description of each metricthat performed the best on at least one languagepair.

2.2.1 BEER and BEER TREEPEL

BEER is a trained metric, a linear model thatcombines features capturing character n-gramsand permutation trees. BEER has participatedlast year in sentence-level evalution. The mainadditions this year are corpus-level aggregationof sentence-level scores and a syntactic versioncalled BEER TREEPEL. BEER TREEPEL in-cludes features checking the match of each typeof arc in the dependency trees of the hypothesisand the reference.

BEER was the best for en-de and en-ru at thesystem level and en-fi and en-ru at the sentencelevel. BEER TREEPEL was the best for system-level evaluation of ru-en.

2.2.2 BSThe metric BS has no corresponding paper, sowe include a summary by Mark Fishel here: TheBS metric was an attempt of moving in a dif-ferent direction than most state-of-the-art metricsand reduce complexity and language resource de-pendence to the minimum. The score is obtainedfrom the number and lengths of “bad segments”:continuous subsequences of words that are presentonly in the hypothesis or the reference, but notboth. To account for morphologically complexlanguages and smooth the score for sparse wordforms poor man’s lemmatization is added: thefloor of one third of each word’s characters are re-

moved from the word’s end. The final score is ei-ther the log-sum of the bad segment lengths (BS)or a simple sum (TOTAL-BS).

BS and DPMF were the best for system-levelEnglish-French evaluation.

2.2.3 CHRF3CHRF3 calculates a simple F-score combination ofthe precision and recall of character n-grams oflength 6. The F-score is calculated with β = 3,giving triple the weight to recall.

CHRF3 was the best for en-fi and en-cs at thesystem level and en-cs at the sentence level.

2.2.4 DPMF and DPMFCOMB

DPMF is a syntax-based metric but unlike manysyntax-based metrics, it does not compute scoreon substructures of the tree returned by a syntac-tic parser. Instead, DPMF parses the referencetranslation with a standard parser and trains a newparser on the tree of the reference translation. Thisnew parser is then used for scoring the hypothesis.Additionally, DPMF uses F-score of unigrams incombination with the syntactic score.

DPMFCOMB is a combination of DPMF withseveral other metrics available in the evaluationtool Asiya2.

DPMF and BS were the best for system-levelevaluation of English-French. DPMF also tiedfor the best place with UOW-LSTM for French-English. DPMFCOMB was the best for fi-en, de-en and cs-en at the sentence level.

2.2.5 DREEMDREEM uses distributed word and sentence rep-resentations of three different kinds: one-hot rep-resentation, a distributed representation learnedwith a neural network and a distributed sentence

2http://asiya-faust.cs.upc.edu/

257

of 84


representation learned with a recursive autoen-coder. The final score is the cosine similarity ofthe representation of the hypothesis and the refer-ence, multiplied with a length penalty.

DREEM was the best for fi-en system-levelevaluation.

2.2.6 LEBLEU-OPTIMIZED

LEBLEU is a relaxation of the strict word n-grammatching that is used in standard BLEU. Unlikeother similar relaxations, LEBLEU uses fuzzymatching of longer chunks of text that allows, forexample, to match two independent words with acompound. LEBLEU-OPTIMIZED applies fuzzymatch threshold and n-gram length optimized foreach language pair.

LEBLEU-OPTIMIZED was the best for en-de atthe sentence level.

2.2.7 RATATOUILLERATATOUILLE is a metric combination ofBLEU, BEER, Meteor and few more metrics outof which METEOR-WSD is a novel contribution.METEOR-WSD is an extension of Meteor that in-cludes synonym mappings to languages other thanEnglish based on alignments and rewards seman-tically adequate translations in context.

RATATOUILLE was the best for sentence-level French-English evaluation in both directions.

2.2.8 UOW-LSTM

UOW-LSTM uses dependency-tree recursive neu-ral network to represent both the hypothesis andthe reference with a dense vector. The finalscore is obtained from a neural network trained onjudgements from previous years converted to sim-ilarity scores, taking into account both the distanceand angle of the two representations.

UOW-LSTM tied for the best place in fr-ensystem-level evaluation with DPMF.

2.2.9 UPF-COBALT

UPF-COBALT pays an increased attention to syn-tactic context (for example arguments, comple-ments, modifiers etc.) both in aligning the wordsof the hypothesis and reference as well as in scor-ing of the matched words. It relies on additionalresources including stemmers, WordNet synsets,paraphrase databases and distributed word repre-sentations. UPF-COBALT system-level score wascalculated by taking the ratio of sentences inwhich each system from a set of competitors wasassigned the highest sentence-level score.

UPF-COBALT was the best on system-level eval-uation for de-en and, together with VERTA-70ADEQ30FLU, for cs-en.

2.2.10 VERTA-70ADEQ30FLU

VERTA-70ADEQ30FLU aims at the combinationof adequacy and fluency features that use manysources of different linguistic information: syn-onyms, lemmas, PoS tags, dependency parses andlanguage models. On previous works VERTA’slinguistic features combination were set depend-ing on whether adequacy or fluency was evaluated.VERTA-70ADEQ30FLU is a weighted combina-tion of VERTA setups for adequacy (0.70) and flu-ency (0.30).

VERTA-70ADEQ30FLU was, together withUPF-COBALT, the best on cs-en on system level.

2.2.11 Baseline MetricsIn addition to the submitted metrics, we have com-puted the following two groups of standard met-rics as baselines for the system level:

• Mteval. The metrics BLEU (Papineniet al., 2002) and NIST (Dodding-ton, 2002) were computed using thescript mteval-v13a.pl3 which isused in the OpenMT Evaluation Cam-paign and includes its own tokeniza-tion. We run mteval with the flag--international-tokenizationsince it performs slightly better (Machacekand Bojar, 2013).

• Moses Scorer. The metrics TER (Snover etal., 2006), WER, PER and CDER (Leusch etal., 2006) were computed using the Mosesscorer which is used in Moses model opti-mization. To tokenize the sentences, we usedthe standard tokenizer script as available inMoses toolkit.

For segment level baseline, we have used thefollowing modified version of BLEU:

• SentBLEU. The metric SentBLEU is com-puted using the script sentence-bleu, part ofthe Moses toolkit. It is a smoothed versionof BLEU that correlates better with humanjudgements for segment level.

3http://www.itl.nist.gov/iad/mig/tools/

258

of 84


We have normalized all metrics’ scores suchthat better translations get higher scores.

For computing the scores we used the samescript from the last year metric task.

3 System-Level Results

Same as last year, we used Pearson correlation co-efficient as the main measure for system level met-rics correlation. We use the following formula tocompute the Pearson’s r for each metric and trans-lation direction:

r =∑n

i=1(Hi − H)(Mi − M)√∑ni=1(Hi − H)2

√∑ni=1(Mi − M)2

(1)where H is the vector of human scores of all

systems translating in the given direction, M is thevector of the corresponding scores as predicted bythe given metric. H and M are their means re-spectively.

Since we have normalized all metrics such thatbetter translations get higher score, we considermetrics with values of Pearson’s r closer to 1 asbetter.

You can find the system-level correlations fortranslations into English in Table 2 and for transla-tions out of English in Table 3. Each row in the ta-bles contains correlations of a metric in each of theexamined translation directions. The upper part ofeach table lists metrics that participated in all lan-guage pairs and it is sorted by average Pearson cor-relation coefficient across translation directions.The lower part contains metrics limited to a subsetof the language pairs, so the average correlationcannot be directly compared with other metricsany more. The best results in each direction are inbold. The reported empirical confidence intervalsof system level correlations were obtained throughbootstrap resampling of 1000 samples (confidencelevel of 95%).

The move to TrueSkill golden truth slightly in-creased the correlations and changed the rank-ing of the metrics a little, but the general pat-terns hold. (The correlation between “Average”and “Pre-TrueSkill Average” is .999 for both di-rections.)

Both tables also include the average Spearman’srank correlation, which used to be the evaluationmeasure in the past. Spearman’s rank correlationconsiders only the ranking of the systems and not

the distances between them. It is thus more sus-ceptible to instability if several systems have sim-ilar scores.

3.1 System-Level Discussion

As in the previous years, many metrics outperformBLEU both into as well as out of English. Notethat the original BLEU was designed to work with4 references and WMT provides just one; see Bo-jar et al. (2013) for details on BLEU correlationwith varying number of references, up to severalthousands. This year, BLEU with one referencereaches the average correlation of .92 into Englishor .78 out of English. The best performing metricsget up to .98 into English and .92 out of English.CDER is the best of the baselines, reaching .94into English and .81 out of English.

The winning metric for each language pair isdifferent, with interesting outliers: DREEM per-formed best when evaluating English translationsfrom Finnish but on average, 12 other metricsinto English performed better and DREEM appearsto be among the worst metrics out of English.RATATOUILLE is fifth to tenth when evaluatedby average Pearson but wins in both directions inaverage Spearman’s rank correlation.

Two metrics confirm the effectiveness ofcharacter-level measures, esp. the winners for outof English evaluation: CHRF3 and BEER. Themetric CHRF3 is particularly interesting becauseit does not require any resources whatsoever. It isdefined as a simple F-measure of character-level 6-grams (spaces are ignored), with recall weighted 3times more than precision. The balance betweenthe precision and recall seems important depend-ing on morphological richness of the target lan-guage: for evaluations into English, CHRF (equalweights) performs better than CHRF3.

As we already observed in the past, the winningmetrics are trained on previous years of WMT.This holds for DPMFCOMB, UOW-LSTM andBEER including BEER TREEPEL. DPMF andUPF-COBALT are not combination or trained met-rics of any kind, DPMF is based on dependencyanalysis of the candidate and reference sentencesand UPF-COBALT uses contextual information ofcompared words in the candidate and the refer-ence.

We see an interesting difference in the perfor-mance of UOW-LSTM. It is the second metric insystem-level correlation but falls among the worst

259

of 84


ones in segment-level correlations, see Table 4 be-low. Gupta et al. (2015b) suggest that the discrep-ancy in performance could be based by low inter-annotator agreement and Kendall’s τ not reflectingthe distances in translation quality between candi-dates, an issue similar to what we see with Pearsonvs. Spearman’s rank correlations.

Another dense-representation metric, DREEM,seems to suffer a similar discrepancy when evalu-ating into English. Out of English, DREEM didnot perform very well.

An untested speculation is that the densesentence-level representation present in someform in both UOW-LSTM as well as in DREEMconfuses the metrics in their judgements of indi-vidual sentences.

3.2 Comparison with BLEU

In Appendix A, we provide two correlation plotsfor each language pair. The first plot visualizesthe correlation of BLEU and manual judgements,the second plot shows the correlation for the bestperforming metric for that pair.

The BLEU plots include grey ellipses to indi-cate the confidence intervals of both BLEU as wellas manual judgements. The ellipses are tilted onlyto indicate that BLEU and the manual score aredependent variables. Only the width and heightof each ellipse represent a value, that is the confi-dence interval in each direction. The same verti-cal confidence intervals hold for plots in the right-hand column, but since we don’t have any con-fidence estimates for the individual metrics, weomit them.

Czech-English plots indicate that UPF-COBALT

was able to account for the very different be-haviour of the transfer-based deep-syntactic sys-tem CU-TECTO. It was also able to appreciate thehigher translation quality of montreal, UEDIN-*and online-b. The big cluster of systems labelledTT-* are submissions to the WMT15 Tuning Task(Stanojevic et al., 2015).

For English-Czech, we see that UEDIN-JHU andMONTREAL are overfit for BLEU. In terms ofBLEU, they are very close to the winning systemCU-CHIMERA (a combination of CU-TECTO andphrase-based Moses, followed by automatic post-editing). CHRF3 is able to recognize the overfittingfor MONTREAL, a neural-network based system,but not for UEDIN-JHU. CHRF3 also better recog-nizes the distance in quality between larger sys-

tems (from COMMERCIAL1 above) and the small-data tuning task systems.

For German-English, we see the same over-fit of UEDIN-JHU towards BLEU. While neitherUPF-COBALT nor CHRF3 could recognize this fortranslations involving Czech, the issue is spot-ted by UPF-COBALT for systems involving Ger-man. Syntax-based systems like UEDIN-SYNTAX

for English-German and (presumably) ONLINE-B

for German-English are among those where thecorrelation got most improved over BLEU.

The French dataset was in a different domain,which may explain why the best performing met-ric DPMF does actually not improve much aboveBLEU. DPMF uses a syntactic parser on the ref-erence, and the performance of parsers on discus-sions is likely to be lower than the generally usednews domain.

In Finnish results, we see again UEDIN-JHU andABUMATRAN (Rubino et al., 2015) overvalued byBLEU. DREEM based on distributed representa-tion of words and sentences is able to recognizethis for translation into English but it falls amongthe worst metrics in the other direction. For trans-lation into Finnish, character-based n-grams ofCHRF3 are much more reliable. Variants of ABU-MATRAN were again those most overvalued byBLEU. ABUMATRAN uses several types of mor-phological segmentation and reconstructs Finnishwords from the segments by concatenation. ABU-MATRAN is loaded with many other features, likeweb-crawled data and domain handling, and sys-tem combination of several approaches. The opti-mization towards BLEU (unreliable for Finnish, aswe have learned in this task), could be among themain reasons behind the comparably lower man-ual scores.

For Russian, BEER is the best metric, in itssyntax-aware variant BEER TREEPEL for evalu-ating English. Compared to BLEU, the improve-ment in correlation is not that striking for Russian-English. (It would be interesting to know whetherONLINE-G is better than ONLINE-B because of En-glish syntax or addressing source-side morphol-ogy better. BEER TREEPEL captures both as-pects.) In the other direction, targetting Russian,BLEU was effectively unable to rank the systemsat all. It is probably the character-level features inBEER that allow it to reach a very good correla-tion, .97.

260

of 84


Cor

rela

tion

coef

ficie

ntPe

arso

nC

orre

latio

nC

oeffi

cien

tSp

earm

an’s

Dir

ectio

nfr

-en

fi-en

de-e

ncs

-en

ru-e

nAv

erag

ePr

e-Tr

ueSk

illAv

erag

eC

onsi

dere

dSy

stem

s7

1413

1613

Aver

age

DP

MF

CO

MB

.995±

.004

.958±

.011

.973±

.009

.991±

.002

.974±

.008

.978±

.007

.970±

.012

.882±

.041

UO

W-L

ST

M.9

97±

.003

.976±

.008

.960±

.010

.983±

.003

.963±

.009

.976±

.007

≀.976±

.011

≀.916±

.038

BE

ER

TR

EE

PE

L.9

81±

.008

.971±

.010

.952±

.012

.992±

.002

.981±

.008

.975±

.008

.962±

.014

.861±

.051

DP

MF

.997±

.003

.951±

.011

.960±

.010

.984±

.003

.973±

.008

.973±

.007

≀.965±

.012

≀.893±

.035

UP

F-C

OB

ALT

.987±

.006

.962±

.010

.981±

.007

.993±

.002

.929±

.014

.971±

.008

≀.970±

.012

.888±

.040

ME

TE

OR

-WS

D.9

82±

.007

.950±

.012

.953±

.011

.983±

.003

.976±

.008

.969±

.008

.960±

.014

.832±

.051

BE

ER

.979±

.008

.965±

.010

.946±

.012

.983±

.003

.971±

.009

.969±

.009

.958±

.015

≀.838±

.049

VE

RTA

-70A

DE

Q30

FL

U.9

82±

.007

.949±

.012

.934±

.014

.993±

.002

.972±

.010

.966±

.009

.952±

.015

≀.883±

.038

VE

RTA

-W.9

77±

.008

.955±

.011

.928±

.015

.988±

.003

.964±

.011

.963±

.010

.949±

.016

.873±

.042

CH

RF

.993±

.005

.947±

.012

.934±

.014

.981±

.004

.938±

.013

.959±

.009

.944±

.016

.871±

.037

RA

TAT

OU

ILL

E.9

86±

.006

.902±

.016

.958±

.011

.961±

.005

.955±

.011

.952±

.010

≀.956±

.014

≀.919±

.039

VE

RTA

-EQ

.983±

.007

.921±

.015

.906±

.017

.990±

.003

.953±

.012

.950±

.011

.934±

.017

.857±

.041

DR

EE

M.9

50±

.012

.977±

.008

.889±

.018

.986±

.003

.929±

.015

.946±

.011

.927±

.018

.825±

.053

CD

ER

.983±

.007

.966±

.009

.890±

.018

.960±

.005

.920±

.016

.944±

.011

.923±

.018

.814±

.046

CH

RF

3.9

79±

.008

.903±

.016

.956±

.011

.968±

.004

.898±

.016

.941±

.011

≀.944±

.016

≀.818±

.047

NIS

T.9

80±

.008

.894±

.016

.901±

.017

.973±

.004

.910±

.017

.932±

.013

.906±

.020

≀.828±

.055

LE

BL

EU

-DE

FAU

LT.9

55±

.012

.900±

.016

.916±

.016

.947±

.006

.908±

.015

.925±

.013

≀.926±

.019

.814±

.049

LE

BL

EU

-OP

TIM

IZE

D.9

84±

.007

.900±

.016

.916±

.016

.976±

.004

.842±

.020

.923±

.013

≀.928±

.018

≀.855±

.042

BS

.986±

.007

.925±

.014

.872±

.019

.976±

.004

.847±

.021

.921±

.013

.891±

.021

.793±

.045

PE

R.9

78±

.008

.871±

.019

.846±

.021

.963±

.005

.931±

.015

.918±

.014

≀.898±

.021

≀.811±

.050

BL

EU

.975±

.009

.929±

.014

.865±

.020

.957±

.006

.851±

.022

.915±

.014

.889±

.021

.796±

.052

TE

R.9

79±

.008

.872±

.019

.890±

.018

.907±

.008

.907±

.017

.911±

.014

.884±

.022

.768±

.054

WE

R.9

77±

.009

.853±

.020

.884±

.018

.888±

.008

.895±

.018

.899±

.015

.871±

.023

.747±

.057

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-M

ED

IAN

n/a

.936±

.013

.961±

.010

.976±

.004

.965±

.010

.959±

.009

.955±

.014

.871±

.034

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-H

AR

MO

NIC

n/a

.509±

.032

.565±

.030

.690±

.013

.309±

.034

.518±

.027

.545±

.041

.768±

.033

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-ME

DIA

Nn/

a−.

220±

.037

−.098±

.037

.500±

.015

.042±

.035

.056±

.031

.086±

.046

−.038±

.071

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-M

EA

Nn/

a.9

52±

.011

.957±

.011

.985±

.003

.976±

.008

.968±

.008

.957±

.014

.854±

.034

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-A

RIG

EO

n/a

.952±

.011

.957±

.011

.985±

.003

.976±

.008

.968±

.008

.957±

.014

.854±

.034

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-R

MS

n/a

.958±

.011

.944±

.013

.988±

.003

.974±

.009

.966±

.009

.947±

.015

.861±

.032

US

AA

R-Z

WIC

KE

L-C

OM

ET-

RM

Sn/

a.8

73±

.019

.898±

.016

.877±

.009

.846±

.019

.874±

.016

.842±

.025

.705±

.050

US

AA

R-Z

WIC

KE

L-C

OM

ET-

AR

IGE

On/

a.8

36±

.021

.844±

.020

.844±

.010

.825±

.021

.837±

.018

.819±

.028

.718±

.049

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-RM

Sn/

a−.

088±

.038

−.302±

.035

.390±

.016

.379±

.035

.095±

.031

.087±

.045

.038±

.076

US

AA

R-Z

WIC

KE

L-C

OS

INE

-ME

DIA

Nn/

a−.

414±

.035

−.514±

.033

.816±

.010

.440±

.035

.082±

.028

.047±

.041

−.020±

.070

US

AA

R-Z

WIC

KE

L-C

OM

ET-

ME

AN

n/a

.836±

.021

.844±

.020

.844±

.010

.825±

.021

.837±

.018

.819±

.028

.718±

.049

US

AA

R-Z

WIC

KE

L-C

OM

ET-

HA

RM

ON

ICn/

a.4

45±

.034

.525±

.031

.602±

.015

.307±

.034

.470±

.028

.487±

.043

.561±

.053

US

AA

R-Z

WIC

KE

L-C

OM

ET-

ME

DIA

Nn/

a−.

108±

.038

.135±

.036

.638±

.013

.167±

.035

.208±

.030

.235±

.046

.146±

.069

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-ME

AN

n/a

−.119±

.037

−.389±

.034

.441±

.016

.371±

.035

.076±

.031

.087±

.045

.038±

.076

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-AR

IGE

On/

a−.

119±

.037

−.389±

.034

.441±

.016

.371±

.035

.076±

.031

.087±

.045

.038±

.076

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-HA

RM

ON

ICn/

a−.

341±

.035

−.178±

.038

−.050±

.017

.253±

.034

−.079±

.031

−.083±

.046

.025±

.073

US

AA

R-Z

WIC

KE

L-C

OS

INE

-ME

AN

n/a

nan

.002±

.038

.906±

.007

nan

nan

nan

.133±

.052

US

AA

R-Z

WIC

KE

L-C

OS

INE

-HA

RM

ON

ICn/

ana

n−.

124±

.038

.897±

.007

nan

nan

nan

.038±

.048

US

AA

R-Z

WIC

KE

L-C

OS

INE

-RM

Sn/

ana

n.0

64±

.038

.910±

.007

nan

nan

nan

.146±

.052

Tabl

e2:

Syst

em-l

evel

corr

elat

ions

ofau

tom

atic

eval

uatio

nm

etri

csan

dth

eof

ficia

lWM

Thu

man

scor

esw

hen

tran

slat

ing

into

Eng

lish.

The

sym

bol“≀”

indi

cate

sw

here

the

aver

age

isou

tofs

eque

nce

com

pare

dto

the

mai

nPe

arso

nav

erag

e.

261

of 84


Cor

rela

tion

coef

ficie

ntPe

arso

nC

orre

latio

nC

oeffi

cien

tSp

earm

an’s

Dir

ectio

nen

-fr

en-fi

en-d

een

-cs

en-r

uAv

erag

ePr

e-Tr

ueSk

illAv

erag

eC

onsi

dere

dSy

stem

s7

1016

1510

Aver

age

CH

RF

3.9

32±

.018

.878±

.017

.848±

.020

.977±

.003

.946±

.008

.916±

.013

.899±

.021

.835±

.032

BE

ER

.961±

.014

.808±

.021

.879±

.018

.962±

.003

.970±

.006

.916±

.012

≀.907±

.018

≀.891±

.036

LE

BL

EU

-DE

FAU

LT.9

33±

.018

.835±

.020

.850±

.019

.953±

.004

.896±

.011

.893±

.014

.875±

.021

.846±

.042

LE

BL

EU

-OP

TIM

IZE

D.9

33±

.018

.803±

.022

.868±

.019

.952±

.004

.908±

.010

.893±

.014

≀.882±

.021

.845±

.043

RA

TAT

OU

ILL

E.9

57±

.015

.763±

.025

.862±

.019

.965±

.003

.913±

.010

.892±

.014

.868±

.021

≀.915±

.029

CH

RF

.930±

.018

.841±

.021

.690±

.027

.971±

.003

.915±

.010

.869±

.016

.846±

.023

.837±

.027

ME

TE

OR

-WS

D.9

59±

.014

.760±

.024

.650±

.029

.953±

.004

.892±

.011

.843±

.017

.816±

.024

.837±

.036

CD

ER

.953±

.015

.640±

.029

.660±

.028

.929±

.004

.863±

.012

.809±

.018

.777±

.025

.704±

.051

NIS

T.9

49±

.015

.692±

.028

.502±

.032

.958±

.003

.893±

.003

.799±

.018

.771±

.026

≀.769±

.047

TE

R.9

48±

.015

.614±

.032

.564±

.031

.917±

.005

.883±

.011

.785±

.019

.755±

.026

.724±

.050

WE

R.9

41±

.016

.608±

.032

.568±

.030

.910±

.005

.884±

.011

.782±

.019

.752±

.027

.702±

.051

BL

EU

.948±

.016

.602±

.030

.573±

.030

.936±

.004

.841±

.013

.780±

.019

.751±

.027

.691±

.052

PE

R.9

49±

.016

.603±

.031

.316±

.035

.908±

.004

.858±

.013

.727±

.020

.696±

.028

.609±

.030

BS

.964±

.013

−.336±

.035

.714±

.026

.953±

.004

.852±

.013

.629±

.018

.625±

.025

≀.686±

.049

DR

EE

M.8

71±

.023

.385±

.032

−.074±

.039

.883±

.006

.968±

.006

.607±

.021

.608±

.031

.682±

.039

DP

MF

.964±

.014

n/a

.724±

.026

n/a

n/a

.844±

.020

.827±

.027

.823±

.048

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-M

ED

IAN

n/a

n/a

.741±

.025

n/a

n/a

.741±

.025

.685±

.038

.750±

.046

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-M

EA

Nn/

an/

a.6

35±

.029

n/a

n/a

.635±

.029

.581±

.041

.615±

.041

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-R

MS

n/a

n/a

.542±

.033

n/a

n/a

.542±

.033

.494±

.044

.541±

.041

US

AA

R-Z

WIC

KE

L-C

OM

ET-

HA

RM

ON

ICn/

an/

a.3

96±

.033

n/a

n/a

.396±

.033

.386±

.045

.309±

.057

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-H

AR

MO

NIC

n/a

n/a

.357±

.032

n/a

n/a

.357±

.032

.330±

.048

≀.550±

.053

US

AA

R-Z

WIC

KE

L-C

OS

INE

-ME

DIA

Nn/

an/

a.3

10±

.036

n/a

n/a

.310±

.036

.330±

.048

.291±

.071

US

AA

R-Z

WIC

KE

L-C

OM

ET-

AR

IGE

On/

an/

a.3

10±

.037

n/a

n/a

.310±

.037

.304±

.048

≀.671±

.050

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-ME

DIA

Nn/

an/

a.0

44±

.037

n/a

n/a

.044±

.037

.031±

.051

−.047±

.066

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-HA

RM

ON

ICn/

an/

a−.

004±

.038

n/a

n/a

−.004±

.038

.059±

.050

≀.009±

.044

US

AA

R-Z

WIC

KE

L-C

OM

ET-

ME

DIA

Nn/

an/

a−.

048±

.038

n/a

n/a

−.048±

.038

−.061±

.050

≀.032±

.057

US

AA

R-Z

WIC

KE

L-C

OM

ET-

RM

Sn/

an/

a−.

117±

.039

n/a

n/a

−.117±

.039

−.127±

.050

≀.415±

.054

US

AA

R-Z

WIC

KE

L-C

OM

ET-

ME

AN

n/a

n/a

−.126±

.039

n/a

n/a

−.126±

.039

−.135±

.051

.412±

.050

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-AR

IGE

On/

an/

a−.

155±

.036

n/a

n/a

−.155±

.036

−.156±

.050

−.168±

.065

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-ME

AN

n/a

n/a

−.155±

.036

n/a

n/a

−.155±

.036

−.156±

.050

−.168±

.065

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

-RM

Sn/

an/

a−.

197±

.035

n/a

n/a

−.197±

.035

−.188±

.050≀−

.188±

.063

US

AA

R-Z

WIC

KE

L-M

ET

EO

R-A

RIG

EO

n/a

n/a

−.419±

.034

n/a

n/a

−.419±

.034

−.336±

.050

−.162±

.071

Tabl

e3:

Syst

em-l

evel

corr

elat

ions

ofau

tom

atic

eval

uatio

nm

etri

csan

dth

eof

ficia

lWM

Thu

man

scor

esw

hen

tran

slat

ing

outo

fEng

lish.

The

sym

bol“≀”

indi

cate

sw

here

the

aver

age

isou

tofs

eque

nce

com

pare

dto

the

mai

nPe

arso

nav

erag

e.

262

of 84


4 Segment-Level Results

We measure the quality of metrics’ segment-levelscores using Kendall’s τ rank correlation coeffi-cient. In this type of evaluation, a metric is ex-pected to predict the result of the manual pairwisecomparison of two systems. Note that the goldentruth is obtained from a compact annotation of fivesystems at once, while an experiment with text-to-speech evaluation techniques by Vazquez-Alvarezand Huckvale (2002) suggest that a genuine pair-wise comparison is likely to lead to more stableresults.

The basic formula for Kendall’s τ is:

τ =|Concordant| − |Discordant||Concordant| + |Discordant| (2)

where Concordant is the set of all human com-parisons for which a given metric suggests thesame order and Discordant is the set of all humancomparisons for which a given metric disagrees.The formula is not specific with respect to ties, i.e.cases where the annotation says that the two out-puts are equally good.

The way in which ties (both in human andmetric judgment) were incorporated in comput-ing Kendall τ changed each year of WMT metrictasks. Here we adopt the version from WMT14.For a detailed discussion on other options, seeMachacek and Bojar (2014).

The method is formally described using the fol-lowing matrix:

Metric< = >

Hum

an < 1 0 -1= X X X> -1 0 1

Given such a matrix Ch,m where h, m ∈ {<,=, >}4 and a metric, we compute the Kendall’s τ forthe metric the following way:

We insert each extracted human pairwise com-parison into exactly one of the nine sets Sh,m ac-cording to human and metric ranks. For examplethe set S<,> contains all comparisons where theleft-hand system was ranked better than right-handsystem by humans and it was ranked the other wayround by the metric in question.

To compute the numerator of Kendall’s τ , wetake the coefficients from the matrix Ch,m, use

4Here the relation < always means ”is better than“ evenfor metrics where the better system receives a higher score.

them to multiply the sizes of the correspondingsets Sh,m and then sum them up. We do not in-clude sets for which the value of Ch,m is X. Tocompute the denominator of Kendall’s τ , we sim-ply sum the sizes of all the sets Sh,m except thosewhere Ch,m = X. To define it formally:

τ =

∑h,m∈{<,=,>}

Ch,m =X

Ch,m|Sh,m|

∑h,m∈{<,=,>}

Ch,m =X

|Sh,m| (3)

To summarize, the WMT14 matrix specifies to:

• exclude all human ties,

• count metric’s ties only for the denominatorof Kendall τ (thus giving no credit for givinga tie),

• all cases of disagreement between hu-man and metric judgements are counted asDiscordant,

• all cases of agreement between humanand metric judgements are counted asConcordant.

You can find the system-level correlations fortranslations into English in Table 4 and for trans-lations out of English in Table 5. Again, the upperpart of each table contains metrics participating inall language pairs and it is sorted by average τacross translation directions. The lower part con-tains metrics limited to a subset of the languagepairs, so the average cannot be directly comparedwith other metrics any more.

4.1 Segment-Level DiscussionAs usual, segment-level correlations are signifi-cantly lower than system-level ones. The highestcorrelation is reached by DPMFCOMB on Czech-to-English: .495 of Kendall’s τ . The correlationsreach on average .447 into English and .400 out ofEnglish.

DPMFCOMB is the clear winner into English,followed by BEER TREEPEL, both of which con-sider syntactic structure of the sentence, combinedwith several other independent features or metrics.

RATATOUILLE, also a combined metric, isthe best option for evaluation to and from French.

Metrics considering character-level n-grams(BEER and CHRF3) are particularly good for

263

of 84


Dir

ectio

nfr

-en

fi-en

de-e

ncs

-en

ru-e

nAv

erag

eE

xtra

cted

-pai

rs29

770

3157

740

535

8587

744

539

DP

MF

CO

MB

.395±

.012

.445±

.012

.482±

.009

.495±

.007

.418±

.013

.447±

.011

BE

ER

TR

EE

PE

L.3

89±

.014

.438±

.010

.447±

.008

.471±

.007

.403±

.014

.429±

.011

RA

TAT

OU

ILL

E.3

98±

.010

.421±

.011

.441±

.010

.472±

.007

.393±

.013

.425±

.010

UP

F-C

OB

ALT

.386±

.012

.437±

.013

.427±

.011

.457±

.007

.402±

.013

.422±

.011

BE

ER

.393±

.012

.422±

.012

.438±

.010

.457±

.008

.396±

.014

.421±

.011

CH

RF

.383±

.011

.417±

.012

.424±

.010

.446±

.008

.384±

.014

.411±

.011

CH

RF

3.3

83±

.013

.397±

.011

.421±

.010

.449±

.008

.386±

.013

.407±

.011

ME

TE

OR

-WS

D.3

75±

.012

.406±

.010

.420±

.011

.438±

.008

.387±

.012

.405±

.010

DP

MF

.368±

.012

.411±

.011

.418±

.011

.436±

.008

.378±

.011

.402±

.011

LE

BL

EU

-OP

TIM

IZE

D.3

76±

.013

.391±

.010

.399±

.010

.438±

.008

.374±

.012

.396±

.011

LE

BL

EU

-DE

FAU

LT.3

73±

.013

.383±

.011

.402±

.009

.436±

.007

.376±

.011

.394±

.010

VE

RTA

-EQ

.388±

.012

.369±

.013

.410±

.011

.447±

.007

.346±

.013

.392±

.011

VE

RTA

-70A

DE

Q30

FL

U.3

74±

.012

.365±

.014

.418±

.011

.438±

.007

.344±

.013

.388±

.011

VE

RTA

-W.3

83±

.010

.344±

.014

.416±

.010

.445±

.007

.345±

.013

.387±

.011

DR

EE

M.3

62±

.012

.340±

.010

.368±

.011

.423±

.007

.348±

.013

.368±

.011

UO

W-L

ST

M. 3

32±

.011

.376±

. 012

. 375±

.011

.385±

. 008

. 356±

.010

. 365±

.011

SE

NT

BL

EU

.358±

.013

.308±

.012

.360±

.011

.391±

.006

.329±

.011

.349±

.011

TO

TAL

-BS

.332±

.013

.319±

.013

.333±

.010

.381±

.007

.321±

.011

.337±

.011

US

AA

R-Z

WIC

KE

L-M

ET

EO

Rn/

a.4

06±

.011

.422±

.011

.439±

.008

.386±

.012

.413±

.011

US

AA

R-Z

WIC

KE

L-C

OM

ET

n/a

.021±

.013

.050±

.010

.072±

.009

.084±

.010

.057±

.011

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

n/a

.001±

.013

−.011±

.010

.020±

.009

.041±

.010

.013±

.011

US

AA

R-Z

WIC

KE

L-C

OS

INE

n/a

−.035±

.013

−.019±

.010

.090±

.008

.014±

.013

.012±

.011

Tabl

e4:

Segm

ent-

leve

lKen

dall’

sτ

corr

elat

ions

ofau

tom

atic

eval

uatio

nm

etri

csan

dth

eof

ficia

lWM

Thu

man

judg

emen

tsw

hen

tran

slat

ing

into

Eng

lish.

264

of 84


Dir

ectio

nen

-fr

en-fi

en-d

een

-cs

en-r

uAv

erag

eE

xtra

cted

-pai

rs34

512

3269

454

447

1368

9049

302

BE

ER

.352±

.010

.380±

.010

.393±

.010

.435±

.006

.439±

.010

.400±

.009

CH

RF

3.3

35±

.013

.373±

.012

.398±

.008

.446±

.005

.420±

.010

.395±

.010

RA

TAT

OU

ILL

E.3

66±

.013

.318±

.011

.381±

.008

.429±

.006

.436±

.010

.386±

.010

LE

BL

EU

-OP

TIM

IZE

D.3

47±

.009

.368±

.010

.399±

.008

.410±

.006

.404±

.011

.386±

.009

CH

RF

.342±

.012

.359±

.010

.372±

.010

.444±

.005

.410±

.011

.385±

.010

LE

BL

EU

-DE

FAU

LT.3

45±

.010

.368±

.010

.398±

.009

.406±

.006

.404±

.012

.384±

.009

ME

TE

OR

-WS

D.3

42±

.012

.286±

.010

.344±

.007

.390±

.006

.399±

.010

.352±

.009

DR

EE

M.3

38±

.012

.280±

.011

.317±

.010

.395±

.006

.366±

.010

.339±

.010

SE

NT

BL

EU

.318±

.011

.227±

.011

.294±

.009

.360±

.005

.347±

.010

.309±

.009

TO

TAL

-BS

.297±

.011

.223±

.009

.278±

.009

.345±

.005

.356±

.011

.300±

.009

DP

MF

.335±

.012

n/a

.350±

.009

n/a

n/a

.343±

.010

US

AA

R-Z

WIC

KE

L-M

ET

EO

Rn/

an/

a.3

42±

.008

n/a

n/a

.342±

.008

US

AA

R-Z

WIC

KE

L-C

OM

ET

n/a

n/a

.056±

.019

n/a

n/a

.056±

.009

US

AA

R-Z

WIC

KE

L-C

OS

INE

n/a

n/a

−.007±

.010

n/a

n/a

−.007±

.010

US

AA

R-Z

WIC

KE

L-C

OS

INE

2ME

TE

OR

n/a

n/a

−.027±

.019

n/a

n/a

−.027±

.009

Tabl

e5:

Segm

ent-

leve

lKen

dall’

sτ

corr

elat

ions

ofau

tom

atic

eval

uatio

nm

etri

csan

dth

eof

ficia

lWM

Thu

man

judg

emen

tsw

hen

tran

slat

ing

outo

fEng

lish.

265

of 84


2014 2015 Delta

BE

ER

Average en→* .319±.011 .401±.009 0.082en-cs .344±.009 .435±.006 0.091en-de .268±.009 .396±.008 0.128en-fr .292±.012 .352±.010 0.060en-ru .440±.013 .440±.012 0.000Average *→en .362±.013 .423±.010 0.061cs-en .284±.016 .457±.008 0.173de-en .337±.014 .438±.010 0.101fr-en .417±.013 .393±.012 -0.024ru-en .333±.011 .406±.009 0.073

SE

NT

BL

EU

Average en→* .269±.011 .310±.009 0.041en-cs .290±.009 .360±.005 0.070en-de .191±.009 .296±.010 0.105en-fr .256±.012 .318±.011 0.062en-ru .381±.013 .347±.010 -0.034Average *→en .285±.013 .351±.011 0.066cs-en .213±.016 .391±.006 0.178de-en .271±.014 .360±.011 0.089fr-en .378±.013 .358±.013 -0.020ru-en .263±.011 .340±.012 0.077

Average 0.07±0.06

Table 6: Kendall’s τ scores for two metrics acrossyears.

evaluation out of English and their margin seemsto the highest for English-to-Finnish, up to .06points.

Only two segment-level metrics took part in2014 and 2015, BEER in a slightly improvedimplementation (with some small effect on thescores) and SENTBLEU in exactly the same im-plementation. Table 6 documents that this year,the scores are on average slightly higher. The mainreason lies probably in the test set, which may besomewhat easier this year. French is different, thecorrelations decreased somewhat this year, whichcan be easily explained by the domain change:news in 2014 and discussions in 2015. The in-crease should not be caused by the redundancycleanup of WMT manual rankings, see Bojar et al.(2015), since the collapsed systems get a tie afterexpanding and our implementation ignores all tiedmanual comparisons.

5 Conclusion

In this paper, we summarized the results of theWMT15 Metrics Shared Task, which assesses thequality of various automatic machine translationmetrics. As in previous years, human judgementscollected in WMT15 serve as the golden truth andwe check how well the metrics predict the judge-ments at the level of individual sentences as wellas at the level of the whole test set (system-level).

Across the two types of evaluation and the10 language pairs, we saw great performance

of trained and combined metrics (DPMFCOMB,BEER, RATATOUILLE and others). Neural net-works for continuous word and sentence repre-sentations have also shown their generalizationpower, with an interesting discrepancy in system-vs. segment-level performance of UOW-LSTM

and to a smaller degree of DREEM.We value high the metric CHRF or CHRF3 for

its extreme simplicity and very good performanceat both system and segment level and especiallyout of English. We are curious to see if CHRF3has the potential of becoming “the BLEU for thenext five years”. It would be very interesting to testits usability in system tuning. It is known that intuning, metrics putting too much attention to recallcan be easily tricked, but perhaps a careful settingof CHRF’s β will be sufficient.

The WMT Metrics Task again attracted a goodnumber of participants and the majority of submit-ted metrics are actually new ones. This is goodnews, indicating that MT metrics are an activefield of research. Most, if not all metrics comewith the source code, so it should be relativelyeasy to use them in own experiments. Still, wewould expect much wider adoption of the metrics,if they made it for example to the standard Mosesscorer or at least to the Asyia toolkit.

Acknowledgments

This project has received funding from theEuropean Union’s Horizon 2020 research andinnovation programme under grant agreementsno 645452 (QT21) and no 644402 (HimL). Thework on this project was also supported by theDutch organisation for scientific research STWgrant nr. 12271.

ReferencesOndrej Bojar, Milos Ercegovcevic, Martin Popel, and

Omar Zaidan. 2011. A Grain of Salt for the WMTManual Evaluation. In Proceedings of the SixthWorkshop on Statistical Machine Translation, pages1–11, Edinburgh, Scotland, July. Association forComputational Linguistics.

Ondrej Bojar, Matous Machacek, Ales Tamchyna, andDaniel Zeman. 2013. Scratching the Surface of Pos-sible Translations. In Proc. of TSD 2013, LectureNotes in Artificial Intelligence, Berlin / Heidelberg.Zapadoceska univerzita v Plzni, Springer Verlag.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,

266

of 84


Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 Workshop on Statistical Machine Translation.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Boxing Chen, Hongyu Guo, and Roland Kuhn. 2015.Multi-level Evaluation for Machine Translation. InProceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Elisabet Comelles and Jordi Atserias. 2015. VERTa: aLinguistically-motivated Metric at the WMT15 Met-rics Task. In Proceedings of the Tenth Workshopon Statistical Machine Translation, Lisboa, Portu-gal, September. Association for Computational Lin-guistics.

George Doddington. 2002. Automatic evaluationof machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Sec-ond International Conference on Human LanguageTechnology Research, HLT ’02, pages 138–145, SanFrancisco, CA, USA. Morgan Kaufmann PublishersInc.

Marina Fomicheva, Nuria Bel, Iria da Cunha, and An-ton Malinovskiy. 2015. UPF-Cobalt Submission toWMT15 Metrics Task. In Proceedings of the TenthWorkshop on Statistical Machine Translation, Lis-boa, Portugal, September. Association for Computa-tional Linguistics.

Rohit Gupta, Constantin Orasan, and Josef van Gen-abith. 2015a. Machine Translation Evaluation us-ing Recurrent Neural Networks. In Proceedings ofthe Tenth Workshop on Statistical Machine Transla-tion, Lisboa, Portugal, September. Association forComputational Linguistics.

Rohit Gupta, Constantin Orasan, and Josef van Gen-abith. 2015b. ReVal: A Simple and Effective Ma-chine Translation Evaluation Metric Based on Re-current Neural Networks. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, EMNLP ’15, Lisbon, Portugal.

Philipp Koehn and Christof Monz. 2006. Manual andautomatic evaluation of machine translation betweeneuropean languages. In Proceedings of the Work-shop on Statistical Machine Translation, StatMT’06, pages 102–121, Stroudsburg, PA, USA. Asso-ciation for Computational Linguistics.

Gregor Leusch, Nicola Ueffing, and Hermann Ney.2006. Cder: Efficient mt evaluation using blockmovements. In In Proceedings of EACL, pages 241–248.

Matous Machacek and Ondrej Bojar. 2013. Resultsof the WMT13 Metrics Shared Task. In Proceed-ings of the Eighth Workshop on Statistical Machine

Translation, pages 45–51, Sofia, Bulgaria, August.Association for Computational Linguistics.

Matous Machacek and Ondrej Bojar. 2014. Results ofthe WMT14 Metrics Shared Task. In Proceedingsof the Ninth Workshop on Statistical Machine Trans-lation, pages 293–301, Baltimore, Maryland, USA,June. Association for Computational Linguistics.

Benjamin Marie and Marianna Apidianaki. 2015.Alignment-based sense selection in METEOR andthe RATATOUILLE recipe. In Proceedings of theTenth Workshop on Statistical Machine Translation,Lisboa, Portugal, September. Association for Com-putational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics, ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Maja Popovic. 2015. chrF: character n-gram F-scorefor automatic MT evaluation. In Proceedings of theTenth Workshop on Statistical Machine Translation,Lisboa, Portugal, September. Association for Com-putational Linguistics.

Raphael Rubino, Tommi Pirinen, Miquel Espla-Gomis, Nikola Ljubesic, Sergio Ortiz Rojas, Vas-silis Papavassiliou, Prokopis Prokopidis, and Anto-nio Toral. 2015. Abu-MaTran at WMT 2015 Trans-lation Task: Morphological Segmentation and WebCrawling. In Proceedings of the Tenth Workshopon Statistical Machine Translation, Lisboa, Portu-gal, September. Association for Computational Lin-guistics.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A studyof translation edit rate with targeted human annota-tion. In In Proceedings of Association for MachineTranslation in the Americas, pages 223–231.

Milos Stanojevic and Khalil Sima’an. 2015. BEER1.1: ILLC UvA submission to metrics and tuningtask. In Proceedings of the Tenth Workshop onStatistical Machine Translation, Lisboa, Portugal,September. Association for Computational Linguis-tics.

Milos Stanojevic, Amir Kamran, and Ondrej Bojar.2015. Results of the WMT15 Tuning Shared Task.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Yolanda Vazquez-Alvarez and Mark Huckvale. 2002.The reliability of the ITU-t p.85 standard for theevaluation of text-to-speech systems. In Proc. of IC-SLP - INTERSPEECH.

267

of 84


Mihaela Vela and Liling Tan. 2015. Predicting Ma-chine Translation Adequacy with Document Embed-dings. In Proceedings of the Tenth Workshop onStatistical Machine Translation, Lisboa, Portugal,September. Association for Computational Linguis-tics.

Sami Virpioja and Stig-Arne Gronroos. 2015.LeBLEU: N-gram-based Translation EvaluationScore for Morphologically Complex Languages. InProceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Hui Yu, Qingsong Ma, Xiaofeng Wu, and Qun Liu.2015. CASICT-DCU Participation in WMT2015Metrics Task. In Proceedings of the Tenth Workshopon Statistical Machine Translation, Lisboa, Portu-gal, September. Association for Computational Lin-guistics.

268

of 84


A System-Level Correlation Plots

The following figures plot the system-level results of BLEU (left-hand plots) and the best performingmetric for the given language pair (right-hand plots) against manual score. See the discussion in Sec-tion 3.2.

Czech-English

..

10

.

12

.

14

.

16

.

18

.

20

.

22

.

24

.

26

.

28

.

BLEU

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

.

TT-BLEU-MERT

.

ILLINOIS

.

TT-ILLC-UVA

.

UEDIN-JHU

.

MONTREAL

.

UEDIN-SYNTAX

.

TT-BLEU-MIRA-SP

.


.

TT-AFRL

.

TT-BLEU-MIRA-D

.

ONLINE-B

.

ONLINE-A

.CU-TECTO

.

TT-METEOR-CMU

.

TT-DCU

.

TT-HKUST-MEANT

..

0.05

.

0.1

.

0.15

.

0.2

.

0.25

.

0.3

.

0.35

.

0.4

.

0.45

.

UPF-COBALT

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

.


.

TT-ILLC-UVA

.

ILLINOIS

.

ONLINE-A

.

MONTREAL

.

TT-METEOR-CMU

.

TT-AFRL

.

UEDIN-JHU

.CU-TECTO

.

TT-DCU

.

TT-HKUST-MEANT

.

UEDIN-SYNTAX

.

ONLINE-B

.

TT-BLEU-MERT

.

TT-BLEU-MIRA-SP

.

TT-BLEU-MIRA-D

English-Czech

..

6

.

8

.

10

.

12

.

14

.

16

.

18

.

20

.

BLEU

.

-.8

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

HUMAN

.

TT-BLEU-MERT

.

TT-AFRL

.

TT-BLEU-MIRA-D

.

UEDIN-JHU

.

ONLINE-B

.

UEDIN-SYNTAX

.

ONLINE-A

.

MONTREAL

.

TT-BLEU-MIRA-SP

.

TT-METEOR-CMU

.

CU-TECTO

.

TT-DCU

.

CU-CHIMERA

.


. COMMERCIAL1..

36

.

38

.

40

.

42

.

44

.

46

.

48

.

50

.

CHRF3

.

-.8

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

HUMAN

.

ONLINE-B

.

TT-BLEU-MERT

.

TT-AFRL

.

UEDIN-JHU

.

CU-TECTO

.

TT-BLEU-MIRA-D

.

TT-BLEU-MIRA-SP

.

TT-DCU

.

CU-CHIMERA

.


. COMMERCIAL1.

MONTREAL

.

ONLINE-A

.

UEDIN-SYNTAX

.

TT-METEOR-CMU

269

of 84


German-English

..

16

.

18

.

20

.

22

.

24

.

26

.

28

.

30

.

BLEU

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

.

MACAU

.

DFKI

.

RWTH

.

KIT

.

ONLINE-B

.

UEDIN-JHU

.MONTREAL

.

ONLINE-A

.

UEDIN-SYNTAX

.

ILLINOIS

.

ONLINE-E

.

ONLINE-C

..

0.1

.

0.15

.

0.2

.

0.25

.

0.3

.

0.35

.

0.4

.

0.45

.

0.5

.

UPF-COBALT

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

.

MACAU

.

ONLINE-B

.

RWTH

.

KIT

.

UEDIN-JHU

.

ONLINE-C

.

DFKI

.

ONLINE-E

.MONTREAL

.

ONLINE-A

.

ILLINOIS

.

UEDIN-SYNTAX

English-German

..

12

.

14

.

16

.

18

.

20

.

22

.

24

.

26

.

BLEU

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

HUMAN

.

ONLINE-B

.

IMS

. UEDIN-JHU.

MONTREAL

.

UEDIN-SYNTAX

.

ONLINE-A

.

UDS-SANT

.CIMS

.KIT-LIMSI

.

ONLINE-E

.

ILLINOIS

.ONLINE-C

.

DFKI

.KIT

.

PROMT-RULE

..

0.1

.

0.11

.

0.12

.

0.13

.

0.14

.

0.15

.

0.16

.

0.17

.

BEER

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

HUMAN

.

UDS-SANT

.KIT

.

ONLINE-B

.KIT-LIMSI

.

IMS

.

DFKI

. ONLINE-C. UEDIN-JHU.CIMS

.

UEDIN-SYNTAX

.

PROMT-RULE

.

ILLINOIS

.

ONLINE-E

.

ONLINE-A

.

MONTREAL

270

of 84


French-English

..30

.32

.34

.BLEU

.

.2

.

.4

.

.6

.

HUMAN

.

LIMSI-CNRS

.

ONLINE-A

.

ONLINE-B

.

MACAU

.

UEDIN-JHU

..0.215

.0.22

.0.225

.DPMF

.

.2

.

.4

.

.6

.

HUMAN

.

UEDIN-JHU

.

ONLINE-A

.

ONLINE-B

.

MACAU

.

LIMSI-CNRS

English-French

..

30

.

32

.

34

.

BLEU

. .0.

.2

.

.4

.

.6

.

HUMAN

.

LIMSI-CNRS

.

ONLINE-A

.

ONLINE-B

.

UEDIN-JHU

.CIMS

..

0.21

.

0.215

.

0.22

.

DPMF

. .0.

.2

.

.4

.

.6

.

HUMAN

.CIMS

.

ONLINE-A

.

UEDIN-JHU

.

ONLINE-B

.

LIMSI-CNRS

271

of 84


Finnish-English

..

12

.

14

.

16

.

18

.

20

.

22

.

BLEU

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

HUMAN

.

UU-UNC

.ABUMATRAN-HFS

.

ABUMATRAN-COMB

.

SHEFFIELD

.

SHEFF-STEM

.UEDIN-SYNTAX

.

ONLINE-A

.

MONTREAL

.

PROMT-SMT

.

UEDIN-JHU

.

ONLINE-B

.

LIMSI

. ILLINOIS.

ABUMATRAN

..

0.15

.

0.2

.

0.25

.

0.3

.

0.35

.

DREEM

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

HUMAN

.

SHEFFIELD

. ILLINOIS.

MONTREAL

.

ONLINE-A

.UEDIN-SYNTAX

.

UEDIN-JHU

.

ABUMATRAN

.

UU-UNC

.

SHEFF-STEM

.

PROMT-SMT

.

ABUMATRAN-COMB

.

LIMSI

.

ONLINE-B

.ABUMATRAN-HFS

English-Finnish

..

4

.

6

.

8

.

10

.

12

.

14

.

16

.

BLEU

.

-1.0

.

-.8

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

1.0

.

1.2

.

HUMAN

.

UEDIN-SYNTAX

.

ONLINE-A

.

AALTO

.

ONLINE-B

. ABUMATRAN-UNC-COMB.

UU-UNC

.

CHALMERS

.

ABUMATRAN-UNC

.ABUMATRAN-COMB

.

CMU

..

38

.

40

.

42

.

44

.

46

.

48

.

50

.

CHRF3

.

-1.0

.

-.8

.

-.6

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

1.0

.

1.2

.

HUMAN

.

UU-UNC

.

ONLINE-B

.ABUMATRAN-COMB

.

UEDIN-SYNTAX

.

CMU

. ABUMATRAN-UNC-COMB.

ONLINE-A

.

ABUMATRAN-UNC

.

AALTO

.

CHALMERS

272

of 84


Russian-English

..

20

.

22

.

24

.

26

.

28

.

30

.

BLEU

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

.

PROMT-RULE

.

AFRL-MIT-PB

. LIMSI-NCODE.

USAAR-GACHA

.

AFRL-MIT-FAC

. AFRL-MIT-H.

USAAR-GACHA2

.

ONLINE-G

.UEDIN-JHU

.

ONLINE-B

.ONLINE-A

.UEDIN-SYNTAX

..

0.1

.

0.11

.

0.12

.

0.13

.

0.14

.

0.15

.

0.16

.

BEER-TREEPEL

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

HUMAN

. AFRL-MIT-H. LIMSI-NCODE. ONLINE-A.

PROMT-RULE

.UEDIN-SYNTAX

.

ONLINE-B

.

USAAR-GACHA2

.

USAAR-GACHA

.

ONLINE-G

.UEDIN-JHU

.AFRL-MIT-FAC

.

AFRL-MIT-PB

English-Russian

..

20

.

22

.

24

.

26

.

BLEU

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

1.0

.

HUMAN

.LIMSI-NCODE

.

PROMT-RULE

.

USAAR-GACHA

.

ONLINE-B

.UEDIN-JHU

.

UEDIN-SYNTAX

.ONLINE-A

.

ONLINE-G

.

USAAR-GACHA2

..

0.13

.

0.14

.

0.15

.

0.16

.

0.17

.

0.18

.

0.19

.

0.2

.

BEER

.

-.4

.

-.2

. .0.

.2

.

.4

.

.6

.

.8

.

1.0

.

HUMAN

.

USAAR-GACHA

.

ONLINE-G

.

USAAR-GACHA2

.

ONLINE-B

. UEDIN-JHU.LIMSI-NCODE

.

UEDIN-SYNTAX

.

PROMT-RULE

.ONLINE-A

273

of 84


C Results of the WMT15 Tuning Shared Task


Results of the WMT15 Tuning Shared Task

Milos Stanojevic and Amir KamranUniversity of Amsterdam

ILLC{m.stanojevic,a.kamran}@uva.nl

Ondrej BojarCharles University in Prague

MFF [email protected]

Abstract

This paper presents the results of theWMT15 Tuning Shared Task. We pro-vided the participants of this task with acomplete machine translation system andasked them to tune its internal parameters(feature weights). The tuned systems wereused to translate the test set and the out-puts were manually ranked for translationquality. We received 4 submissions in theEnglish-Czech and 6 in the Czech-Englishtranslation direction. In addition, we ran3 baseline setups, tuning the parameterswith standard optimizers for BLEU score.

1 Introduction

Almost all modern statistical machine translation(SMT) systems internally consider translation can-didates from several aspects. Some of these as-pects can be very simple and one parameter is suf-ficient to capture them, such as the word penaltyincurred for every word produced or the phrasepenalty controlling whether the sentence should betranslated in fewer or more independent phrases,leading to more or less word-for-word translation.Other aspects try to assess e.g. the fidelity of thetranslation, the fluency of the output or the amountof reordering. These are far more complex and for-mally captured in a model such as the translationmodel or language model.

Both the simple penalties as well as the scoresfrom the more complex models are called featuresand need to be combined to a single score to allowfor ranking of translation candidates. This is usu-ally done using a linear combination of the scores:

score(e) =M∑

m=1

λmhm(e, f) (1)

where e and f are the candidate translation andthe source, respectively, and hm(·, ·) is one of the

M penalties or models. The tuned parameters areλm ∈ R, called feature weights.

Feature weights have a tremendous effect onthe final translation quality. For instance the sys-tem can produce extremely long outputs, fabulat-ing words just in order to satisfy a negatively-weighted word penalty, i.e. a bonus for each wordproduced. An inherent part of the preparationof MT systems is thus some optimization of theweight settings.

If we had to set the weights manually, we wouldhave to try a few configurations and pick one thatleads to reasonable outputs. The common prac-tice is to use an optimization algorithm that ex-amines many settings, evaluating the producedtranslations automatically against reference trans-lations using some evaluation measure (tradition-ally called “metric” in the MT field). In short,the optimizer tunes model weights so that the finalcombined model score correlates with the metricscore.

The metric score, in turn, is designed to cor-relate well with human judgements of translationquality, see Stanojevic et al. (2015) and the pre-vious papers summarizing WMT metrics tasks.However, a metric that correlates well with hu-mans on final output quality may not be usablein weight optimization for various technical rea-sons. BLEU (Papineni et al., 2002) was shown tobe very hard to surpass (Cer et al., 2010) and this isalso confirmed by the results of the invitation-onlyWMT11 Tunable Metrics Task (Callison-Burch etal., 2010)1. Note however, that some metrics havebeen successfully used for system tuning (Liu etal., 2011; Beloucif et al., 2014).

The aim of the WMT15 Tuning Task2 is to at-tract attention to the exploration of all the three

1http://www.statmt.org/wmt11/tunable-metrics-task.html

2http://www.statmt.org/wmt15/tuning-task/

274

of 84


Sentences Tokens TypesSource cs en cs en cs en

LM corpora News Commentary v8 162309 247966 3.6M 6.2M 162K 81KTM corpora Europarl v7, CCrawl and News Comm. v9 911952 17.7M 20.8M 652K 361KDev set newstest2014 3003 51K 60K 19K 13KTest set newstest2015 2656 39K 47K 16K 11K

Table 1: Data used in the WMT15 tuning task.

Dev TestDirection Token Type Token Type

en-cs 2570 2032 2003 1655cs-en 3891 3415 3381 3011

Table 2: Out of vocabulary word counts

aspects of model optimization: (1) the set of fea-tures in the model, (2) optimization algorithm, and(3) MT quality metric used in optimization.

For (1), we provide a fixed set of “dense” fea-tures and also allow participants to add additional“sparse” features. For (2), the optimization al-gorithm, task participants are free to use one ofthe available algorithms for direct loss optimiza-tion (Och, 2003; Zhao and Chen, 2009), which areusually capable of optimizing only a dozen of fea-tures, or one of the optimizers handling also verylarge sets of features (Cherry and Foster, 2012;Hopkins and May, 2011), or a custom algorithm.And finally for (3), participants can use any estab-lished evaluation metric or a custom one.

1.1 Tuning Task Assignment

Tuning task participants were given a completemodel for the hierarchical variant of the machinetranslation system Moses (Hoang et al., 2009)and the development set (newstest2014), i.e. thesource and reference translations. No “dev test”set was provided, since we expected that partic-ipants will internally evaluate various variants oftheir method by manually judging MT outputs. Infact, we offered to evaluate a certain number oftranslations into Czech for free to ease the partici-pation for teams without any access to speakers ofCzech; only one team used this service once.

A complete model consists of a rule table ex-tracted from the parallel corpus, the default gluegrammar and the language model extracted fromthe monolingual data. As such, this defines a fixedset of dense features. The participants were al-lowed to add any sparse features implemented inMoses Release 3.0 (corresponds to Github com-mit 5244a7b607) and/or to use any optimizationalgorithm and evaluation metric. Fully manual

optimization was also not excluded but nobodyseemed to take this approach.

Each submission in the tuning task consisted ofthe configuration of the MT system, i.e. the addi-tional sparse features (if any) and the values of allthe feature weights, λm.

2 Details of Systems Tuned

The systems that were distributed for tuning arebased on Moses (Hoang et al., 2009) implementa-tion of hierarchical phrase-based model (Chiang,2005). The language models were 5-gram mod-els with Kneser-Ney smoothing (Kneser and Ney,1995) built using KenLM (Heafield et al., 2013).For word alignments, we used Mgiza++ (Gao andVogel, 2008).

The parallel data used for training translationmodels consisted of the Europarl v7, News Com-mentary data (parallel-nc-v9) and Com-monCrawl, as released for WMT14.3 We excludedCzEng because we wanted to keep the task smalland accessible to more groups.

Since the test set (newstest2015) and the de-velopment set (newstest2014) are in the news do-main, we opted to exclude Europarl from the lan-guage model data. We did not add any monolin-gual news on top of News Commentary, which arequite close to the news domain. In retrospect, weshould have added also some of the monolingualnews data as released by WMT, esp. since we useda 5-gram LM.

Before any further processing, the data was to-kenized (using Moses tokenizer) and lowercased.We also removed sentences longer than 60 wordsor shorter than 4 words. Table 1 summarizes thefinal dataset sizes and Table 2 provides details onout-of-vocabulary items.

Aside from the dev set provided, the partici-pants were free to use any other data for tuning(making their submission “unconstrained”), but noparticipant decided to do that. All tuning task sub-missions are therefore also constraint in terms of

3http://www.statmt.org/wmt14/translation-task.html

275

of 84


System ParticipantBLEU-* baselinesAFRL United States Air Force Research Laboratory (Erdmann and Gwinnup, 2015)DCU Dublin City University (Li et al., 2015)

HKUST Hong Kong University of Science and Technology (Lo et al., 2015)ILLC-UVA ILLC – University of Amsterdam (Stanojevic and Sima’an, 2015)

METEOR-CMU Carnegie Mellon University (Denkowski and Lavie, 2011)USAAR-TUNA Saarland University (Liling Tan and Mihaela Vela; no corresponding paper)

Table 3: Participants of WMT15 Tuning Shared Task

the WMT15 Translation Task (Bojar et al., 2015).We leave all decoder settings (n-best list size,

pruning limits etc.) at their default values. Whilethe participants may have used different limits dur-ing tuning, the final test run was performed at oursite with the default values. It is indeed only thefeature weights that differ.

3 Tuning Task Participants

The list of participants and the names of the sub-mitted systems are shown in Table 3, along withreferences to the details of each method.

USAAR-TUNA by Liling Tan and MihaelaVela has no accompanying paper, so we sketchit here. The method sets each weight as the har-monic mean ( 2xy

x+y ) of the weight proposed bybatch MIRA and MERT. Batch MIRA and MERTare run side by side and the harmonic mean istaken and used in moses.ini at every iteration.The optimization stops when the averaged weightschange only very little, which happened around it-eration 17 or 18 in this case (Liling Tan, pc).

ILLC-UVA (Stanojevic and Sima’an, 2015)was tuned using KBMIRA with modified version ofBEER evaluation metric. The authors claim thatstandard trained evaluation metrics learn to givetoo much importance to recall and thus lead tooverly long translations in tuning. For that reasonthey modify the training of BEER to value recalland precision equally. This modified version ofBEER is used to train the MT system.

DCU (Li et al., 2015) is tuned with RED, anevaluation metric based on maching of depen-dency n-grams. Authors have tried tuning withboth MERT and KBMIRA and found that KBMIRA

gives better results so the submitted system usesKBMIRA.

HKUST (Lo et al., 2015) is with an improvedversion of MEANT. MEANT is an evaluation met-ric that pays more attention to semantic aspectof translation. Better correlation on the sentencelevel was achieved by integrating distributional se-

mantics into MEANT and handling failures of theunderlying semantic parser. The submission ofHKUST contained a bug that was discovered af-ter human evaluation period so the corrected sub-mission HKUST-LATE is evaluated only withBLEU.

METEOR-CMU (Denkowski and Lavie,2011) is a system tuned for an adapted version ofMeteor. Meteor’s parameters are set to give anequal importance to precision and recall.

AFRL (Erdmann and Gwinnup, 2015) is theonly submission trained with a new tuning al-gorithm “Drem” instead of the standard MERT

or KBMIRA. Drem uses scaled derivative-freetrust-region optimization instead of line search or(sub)gradient approximations. For weight settingsthat were not tested in the decoder yet, it interpo-lates the decoder output using the information ofwhich settings produced which translations. Theoptimized metric is a weighted combination ofNIST, Meteor and Kendall’s τ .

In addition to the systems submitted, we pro-vided three baselines:

• BLEU-MERT-DENSE – MERT tuning withBLEU without additional features

• BLEU-MIRA-DENSE – KBMIRA tuning withBLEU without additional features

• BLEU-MIRA-SPARSE – KBMIRA tuningwith BLEU with additional sparse features

Since all the submissions including the base-lines were subject to manual evaluation, we didnot run the MERT or MIRA optimizations morethan once (as is the common practice for estimat-ing variance due to optimizer instability). We sim-ply used the default settings and stopping criteriaand picked the weights that performed best on thedev set according to BLEU.

Of all the submissions, only the submissionMETEOR-CMU used sparse features. For amore interesting comparison, we set our baseline

276

of 84


(BLEU-MIRA-SPARSE) to use the very same setof sparse features. These features are automati-cally constructed using Moses’ feature templatesnamed PhraseLengthFeature0, SourceWordDele-tionFeature0, TargetWordInsertionFeature0 andWordTranslationFeature0. They were made forthe 50 most frequent words in the training data.For both language pairs these feature templatesproduce around 1000 features.

4 Results

We used the submitted moses.ini and (option-ally) sparse weights files to translate the test set.The test set was not available to the participants atthe time of their submission (not even the sourceside). We used the Moses recaser trained on thetarget side of the parallel corpus to recase the out-puts of all the models.

Finally, the recased outputs were manually eval-uated, jointly with regular translation task submis-sions of WMT (Bojar et al., 2015). This was notenough to reliably separate tuning systems in theCzech-to-English direction, so we asked task par-ticipants to provide some further rankings.

The resulting human rankings were used tocompute the overall manual score using theTrueSkill method, same as for the main translationtask (Bojar et al., 2015). We report two variantsof the score: one is based on manual judgementsrelated to tuning systems only and one is basedon all judgements. Note that the actual rankingtasks shown to the annotators were identical, mix-ing tuning systems with regular submissions.

Tables 4 and 5 contain the results of the submit-ted systems sorted by their manual scores.

The horizonal lines represent separation be-tween clusters of systems that perform similarly.Cluster boundaries are established by the samemethod as for the main translation task. Inter-estingly, cluster boundaries for Czech-to-Englishvary as we change the set of judgements.

Some systems do not have the TrueSkillscore because they were either submitted af-ter the deadline (HKUST-LATE) or served asadditional baselines and performed similarly toour baselines (USAAR-BASELINE-MIRA andUSAAR-BASELINE-MERT).

5 Discussion

There are a few interesting observations that canbe made about the baseline results. Various details

System Name TrueSkill Score BLEUTuning-Only All

BLEU-MIRA-DENSE 0.153 -0.182 12.28ILLC-UVA 0.108 -0.189 12.05

BLEU-MERT-DENSE 0.087 -0.196 12.11AFRL 0.070 -0.210 12.20

USAAR-TUNA 0.011 -0.220 12.16DCU -0.027 -0.263 11.44

METEOR-CMU -0.101 -0.297 10.88BLEU-MIRA-SPARSE -0.150 -0.320 10.84

HKUST -0.150 -0.320 10.99HKUST-LATE — — 12.20

Table 4: Results on Czech-English tuning

System Name TrueSkill Score BLEUTuning-Only All

DCU 0.320 -0.342 4.96BLEU-MIRA-DENSE 0.303 -0.346 5.31

AFRL 0.303 -0.342 5.34USAAR-TUNA 0.214 -0.373 5.26

BLEU-MERT-DENSE 0.123 -0.406 5.24METEOR-CMU -0.271 -0.563 4.37

BLEU-MIRA-SPARSE -0.992 -0.808 3.79USAAR-BASELINE-MIRA — — 5.31USAAR-BASELINE-MERT — — 5.25

Table 5: Results on English-Czech tuning

of the submissions including the exact weight set-tings are in Table 6.

5.1 Dense vs. Sparse FeaturesIt is surprising how well the baseline based on KB-MIRA and BLEU tuning (BLEU-MIRA-DENSE)performs on both language pairs. On Czech-English, it is better than all the other submittedsystems while on English-Czech, only one systemoutperforms it (staying in the same performancecluster anyway).

Using BLEU-MIRA-DENSE for tuning densefeatures is becoming more common in the MTcommunity, compared to the previous standard ofusing MERT. Our results confirm this practice.Preferring KBMIRA to MERT is often motivated bypossibility to include sparse features, but we seethat even for dense features only KBMIRA is betterthan MERT.

The sparse models, BLEU-MIRA-SPARSE andMETEOR-CMU, however, perform rather poorlyeven though they were trained with KBMIRA. Bothof the sparse submissions use the same set of fea-tures and the same tuning algorithm, although theoptimization was run at different sites. The onlydifference is the metric they optimize. Tuningfor Meteor (Denkowski and Lavie, 2011) givesbetter results than tuning for BLEU (Papineni etal., 2002). Unfortunately, we had no system with

277

of 84


-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

-0.3 -0.25 -0.2 -0.15 -0.1 -0.05

Man

ual S

core

Word Penalty (after L2 normalization)

bleu_MERTDCU

bleu_MIRA_denseUSAAR-Tuna

AFRLMETEOR_CMU

bleu_MIRA_sparse

Figure 1: Relation between the word penalty andthe final performance of systems translating fromEnglish to Czech.

●

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

PC1

PC

2

● DCUbleu_MIRA_denseAFRLUSAAR−Tunableu_MERTbleu_MIRA_sparseMETEOR_CMU

Figure 2: PCA for English-Czech. The darker thepoint, the higher the manual score.

dense features tuned for Meteor so we could notsee if Meteor outperforms BLEU in the dense-onlysetting as well.

It is not clear why the sparse methods performbadly. One explanation could be the relativelysmall development set or some pruning settings.In any case, we find it unfortunate that sparse fea-tures in the hierarchical model harm performancein the default configuration4.

5.2 Some Observations on Weight SettingsWe tried to find some patterns in the weight set-tings and the performance of the system, but ad-mittedly, it is difficult to make much sense of thefew points in the 8-dimensional space.

For English-to-Czech, we can see a gist of abell-like shape when normalizing the weights withL2 norm and plotting the word penalty and the

4MERT and two MIRA runs reached BLEU of not morethan +0.02 points higher when the size of n-best list was in-creased from 100 to 200. So n-best list size does not seem tobe the problem.

Cze

ch-t

o-E

nglis

hTy

peM

anua

lSco

reTe

stB

LE

UD

evB

LE

UL

M0

PhrP

enT

M0

TM

1T

M2

TM

3G

lue

Wrd

Pen

AFR

Lde

nse

0.07

0012

.20

14.8

30.

1588

-0.3

330

0.05

450.

0859

0.19

580.

1716

0.63

09-0

.622

7bl

euM

ER

Tde

nse

0.08

7012

.11

14.6

40.

0992

-0.0

507

0.06

880.

0350

0.12

960.

0919

0.18

20-0

.342

8bl

euM

IRA

dens

ede

nse

0.15

3012

.28

14.8

50.

0671

-0.1

689

0.03

630.

0413

0.07

470.

0680

0.29

82-0

.245

4bl

euM

IRA

spar

sesp

arse

-0.1

500

10.8

413

.16

0.09

06-0

.056

80.

0431

0.05

560.

0928

0.09

330.

3584

-0.2

093

DC

Ude

nse

-0.0

270

11.4

413

.58

0.05

58-0

.140

70.

0360

0.05

170.

0856

0.06

710.

2481

-0.3

150

HK

UST

ME

AN

Tde

nse

-0.1

500

10.9

913

.23

0.13

330.

0868

0.13

180.

0115

0.05

340.

1221

0.05

00-0

.411

0H

KU

STM

EA

NT

LA

TE

dens

e—

12.2

014

.42

0.06

38-0

.169

60.

0655

0.02

170.

0713

0.06

770.

3074

-0.2

330

ILL

CU

vAde

nse

0.10

8012

.05

14.5

70.

0918

-0.1

215

0.04

520.

0624

0.11

030.

0697

0.22

95-0

.269

6M

ET

EO

RC

MU

spar

se-0

.101

010

.88

13.3

50.

0936

-0.0

103

0.06

020.

0509

0.11

620.

1187

0.29

46-0

.255

6U

SAA

R-T

una

dens

e0.

0110

12.1

614

.57

0.07

89-0

.071

50.

0383

0.05

750.

1039

0.07

440.

1839

-0.2

952

Eng

lish-

to-C

zech

Type

Man

ualS

core

Test

BL

EU

Dev

BL

EU

LM

0Ph

rPen

TM

0T

M1

TM

2T

M3

Glu

eW

rdPe

nA

FRL

dens

e0.

3030

5.34

6.96

0.05

43-0

.432

6-0

.002

50.

0382

0.26

960.

0788

0.83

32-0

.187

8bl

euM

ER

Tde

nse

0.12

305.

247.

110.

0510

-0.1

353

0.00

480.

0169

0.17

720.

0408

0.35

08-0

.223

1bl

euM

IRA

dens

ede

nse

0.30

305.

317.

200.

0380

-0.2

046

-0.0

004

0.02

860.

1338

0.03

200.

3936

-0.1

689

bleu

MIR

Asp

arse

spar

se-0

.992

03.

795.

190.

0364

-0.1

232

-0.0

053

0.03

500.

0905

0.04

800.

5524

-0.1

093

DC

Ude

nse

0.32

004.

966.

870.

0247

-0.1

949

-0.0

022

0.03

670.

1370

0.03

450.

3767

-0.1

932

ME

TE

OR

CM

Usp

arse

-0.2

710

4.37

5.86

0.03

94-0

.093

5-0

.008

70.

0331

0.16

110.

0673

0.45

48-0

.142

1Sa

arla

ndba

selin

em

ert

dens

e—

5.25

7.16

0.03

94-0

.161

9-0

.001

10.

0218

0.19

470.

0211

0.39

73-0

.162

8Sa

arla

ndba

selin

em

ira

dens

e—

5.31

7.11

0.03

77-0

.202

3-0

.000

70.

0293

0.13

040.

0344

0.39

36-0

.171

4U

SAA

R-T

una

dens

e0.

2140

5.26

7.15

0.03

86-0

.179

9-0

.000

80.

0250

0.15

620.

0262

0.39

54-0

.167

0

Table 6: Detailed scores and weights of Czech-to-English (left) and English-to-Czech (right) sys-tems.

278

of 84


●

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

1.5

PC1

PC

2

● bleu_MIRA_denseILLC_UvAbleu_MERTAFRLUSAAR−TunaHKUST_MEANTbleu_MIRA_sparseMETEOR_CMUDCU

Figure 3: PCA for Czech-English. The darker thepoint, the higher the manual score.

manual score, see Figure 1. The middle valuesseemed to be a good setting. For the other transla-tion direction or other weights, no such clear rela-tion is apparent.

We tried to interpret the weight settings alsousing principal component analysis (PCA), de-spite the low number of observations. (Ideally,we would like to have at least 40–80 systems, wehave 7 or 9). Before running PCA, we normalizedthe weights with L2 norm. After running CattellScree test, the results showed that two componentswould be appropriate to summarize the dataset. Tomake components more interpretable, we appliedvarimax rotation.

Figure 2 plots the two principal components ofthe set of systems for English-to-Czech. We seethat the first component (PC1) explains the perfor-mance almost completely with middle values be-ing the best. Looking at loadings (correlations ofcomponents with the original feature function di-mensions) in Table 7, we learn, that PC1 primar-ily accounts for the first two weights of transla-tion model (TM 0 and TM 1, which correspondto phrase and lexically-weighted inverse probabil-ities, resp.) and the word penalty (WrdPen) andlanguage model weight (LM0). Knowing that inalmost all systems the weight of word penalty isseveral times bigger than weights of TM 0, TM 1,and LM0, we conclude that tuning of word penalty(in balance with LM weight) was the most appar-ent decisive factor of English-Czech tuning task.The second component (PC2) primarily covers theweights of the remaining features, that is the directtranslation probabilites and phrase penalty. Unfor-tunately, PC2 is not very informative about the fi-nal quality of the translation.

The Czech-to-English results in Figure 3 do not

PC1 PC2LM0 -0.69 0.44PhrasePenalty0 0.15 -0.63TranslationModel0 0 -0.91 -0.13TranslationModel0 1 0.91 -0.03TranslationModel0 2 -0.55 0.72TranslationModel0 3 0.36 0.75TranslationModel1 0.42 0.84WordPenalty0 0.84 0.27

Table 7: Loadings (correlations) of each compo-nent with each feature function for English-Czech

seem to lend themselves to any simple conclusion.Based on closeness of systems in the PCA

plots, we can say that for English-Czech, two outof three best systems (BLEU-MIRA-DENSE andDCU) found similar settings while AFRL standsout. Czech-English results show that systems ofvery similar weight settings give translations ofvery different quality. Again, AFRL stands outwhile leading to very good outputs.

6 Conclusion

This paper presented the WMT shared task in opti-mizing parameters of a given hierarchical phrase-based system (WMT Tuning Task) when translat-ing from English to Czech and vice versa. Theunderlying system was intentionally restricted tosmall data setting and somewhat unusually, thedata for the language model were smaller than forthe translation model.

Overall, six teams took part in one or both direc-tions, sticking to the constrained setting, with onlyMETEOR-CMU and our baseline BLEU-MIRA-SPARSE using sparse features.

The submitted configurations were manuallyevaluated jointly with the systems of the mainWMT translation task. Given the small data set-ting, we did not expect the tuning task systems toperform competitively to other submissions in theWMT translation task.

The results confirm that KBMIRA with the stan-dard (dense) features optimized towards BLEUshould be preferred over MERT. Two other sys-tems (DCU and AFRL) performed equally well inEnglish-to-Czech translation. The two systems us-ing sparse features (METEOR-CMU and BLEU-MIRA-SPARSE) performed poorly, but the sam-ple is too small to draw any conclusions fromthis. Overall, the variance in translation qualityobtained using various weight settings is apparentand justifies the efforts put into optimization tech-

279

of 84


niques.Since the task attracted a good number of sub-

missions and was generally considered interestingand useful by our colleagues, we plan to run thetask again for WMT in 2016. The next year’s un-derlying systems will use all data available in theWMT constraint setting, to test the tuning methodsin the range where state-of-the-art systems oper-ate.

Acknowledgments

We are grateful to Christian Federmann and MattPost for all the processing of human evaluationand to the annotators who quickly helped us ingetting additional judgements. Thanks also go toMatthias Huck for a thorough check of the paper,all outstanding errors are our own. This projecthas received funding from the European Union’sHorizon 2020 research and innovation programmeunder grant agreements no 645452 (QT21) andno 644402 (HimL). The work on this project wasalso supported by the Dutch organisation for sci-entific research STW grant nr. 12271.

ReferencesMeriem Beloucif, Chi-kiu Lo, and Dekai Wu. 2014.

Improving MEANT Based Semantically TunedSMT. In Proc. of 11th International Workshop onSpoken Language Translation (IWSLT 2014), pages34–41, Lake Tahoe, California.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Barry Haddow, Matthias Huck, Chris Hokamp,Philipp Koehn, Varvara Logacheva, Christof Monz,Matteo Negri, Matt Post, Carolina Scarton, LuciaSpecia, and Marco Turchi. 2015. Findings of the2015 Workshop on Statistical Machine Translation.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar Zaidan.2010. Findings of the 2010 Joint Workshop on Sta-tistical Machine Translation and Metrics for Ma-chine Translation. In Proceedings of the Joint FifthWorkshop on Statistical Machine Translation andMetricsMATR, pages 17–53, Uppsala, Sweden, July.Association for Computational Linguistics. RevisedAugust 2010.

Daniel Cer, Christopher D. Manning, and Daniel Juraf-sky. 2010. The best lexical metric for phrase-basedstatistical mt system optimization. In Human Lan-guage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association

for Computational Linguistics, HLT ’10, pages 555–563, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Colin Cherry and George Foster. 2012. Batch tun-ing strategies for statistical machine translation. InProceedings of the 2012 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL HLT ’12, pages 427–436, Stroudsburg, PA,USA. Association for Computational Linguistics.

David Chiang. 2005. A Hierarchical Phrase-BasedModel for Statistical Machine Translation. In Pro-ceedings of the 43rd Annual Meeting of the Associa-tion for Computational Linguistics (ACL’05), pages263–270, Ann Arbor, Michigan, June.

Michael Denkowski and Alon Lavie. 2011. Me-teor 1.3: Automatic Metric for Reliable Optimiza-tion and Evaluation of Machine Translation Sys-tems. In Proceedings of the Sixth Workshop onStatistical Machine Translation, pages 85–91, Ed-inburgh, Scotland, July. Association for Computa-tional Linguistics.

Grant Erdmann and Jeremy Gwinnup. 2015. Drem:The AFRL Submission to the WMT15 Tuning Task.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

Qin Gao and Stephan Vogel. 2008. Parallel implemen-tations of word alignment tool. In In Proc. of theACL 2008 Software Engineering, Testing, and Qual-ity Assurance Workshop.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.Clark, and Philipp Koehn. 2013. Scalable ModifiedKneser-Ney Language Model Estimation. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics, pages 690–696,Sofia, Bulgaria, August.

Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.A Unified Framework for Phrase-Based, Hierarchi-cal, and Syntax-Based Statistical Machine Trans-lation. In Proceedings of IWSLT, pages 152–159,Tokyo, Japan, December.

Mark Hopkins and Jonathan May. 2011. Tuning asranking. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,EMNLP ’11, pages 1352–1362, Stroudsburg, PA,USA. Association for Computational Linguistics.

Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for m-gram language model-ing. In IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP),, volume 1,pages 181–184. IEEE.

Liangyou Li, Hui Yu, and Qun Liu. 2015. MT Tuningon RED: A Dependency-Based Evaluation Metric.In Proceedings of the Tenth Workshop on StatisticalMachine Translation, Lisboa, Portugal, September.Association for Computational Linguistics.

280

of 84


Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng.2011. Better Evaluation Metrics Lead to Better Ma-chine Translation. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, EMNLP ’11, pages 375–384, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Chi-kiu Lo, Philipp Dowling, and Dekai Wu. 2015.Improving evaluation and optimization of MT sys-tems against MEANT. In Proceedings of the TenthWorkshop on Statistical Machine Translation, Lis-boa, Portugal, September. Association for Computa-tional Linguistics.

Franz Josef Och. 2003. Minimum error rate training instatistical machine translation. In Proceedings of the41st Annual Meeting on Association for Computa-tional Linguistics - Volume 1, ACL ’03, pages 160–167, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for AutomaticEvaluation of Machine Translation. In Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics, ACL ’02, pages 311–318.

Milos Stanojevic and Khalil Sima’an. 2015. BEER1.1: ILLC UvA submission to metrics and tuningtask. In Proceedings of the Tenth Workshop onStatistical Machine Translation, Lisboa, Portugal,September. Association for Computational Linguis-tics.

Milos Stanojevic, Amir Kamran, Philipp Koehn, andOndrej Bojar. 2015. Results of the WMT15 MetricsShared Task. In Proceedings of the Tenth Workshopon Statistical Machine Translation, Lisboa, Portu-gal, September. Association for Computational Lin-guistics.

Bing Zhao and Shengyuan Chen. 2009. A simplexarmijo downhill algorithm for optimizing statisticalmachine translation decoding parameters. In HLT-NAACL (Short Papers), pages 21–24. The Associa-tion for Computational Linguistics.

281

of 84