using a new analytic measure for the annotation and analysis of … · mt output in cases where...

8
Using a New Analytic Measure for the Annotation and Analysis of MT Errors on Real Data Arle Lommel 1 , Aljoscha Burchardt 1 , Maja Popovi´ c 1 , Kim Harris 2 , Eleftherios Avramidis 1 , Hans Uszkoreit 1 1 DFKI / Berlin, Germany 2 text&form / Berlin, Germany [email protected] 1 kim [email protected] 2 Abstract This work presents the new flexible Mul- tidimensional Quality Metrics (MQM) framework and uses it to analyze the per- formance of state-of-the-art machine trans- lation systems, focusing on “nearly accept- able” translated sentences. A selection of WMT news data and “customer” data provided by language service providers (LSPs) in four language pairs was anno- tated using MQM issue types and exam- ined in terms of the types of errors found in it. Despite criticisms of WMT data by the LSPs, an examination of the resulting er- rors and patterns for both types of data shows that they are strikingly consistent, with more variation between language pairs and system types than between text types. These results validate the use of WMT data in an analytic approach to as- sessing quality and show that analytic ap- proaches represent a useful addition to more traditional assessment methodolo- gies such as BLEU or METEOR. 1 Introduction For a number of years, the Machine Translation (MT) community has used “black-box” measures of translation performance like BLEU (Papineni et al., 2002) or METEOR (Denkowski and Lavie, 2011). These methods have a number of advan- tages in that they can provide automatic scores for c 2014 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC- BY-ND. MT output in cases where there are existing refer- ence translations by calculating similarity between the MT output and the references. However, such metrics do not provide insight into the specific nature of problems encountered in the translation output and scores are tied to the particularities of the reference translations. As a result of these limitations, there has been a recent shift towards the use of more explicit er- ror classification and analysis (see, e.g., Vilar et al. (2006)) in addition to automatic metrics. The error profiles used, however are typically ad hoc cate- gorizations and specific to individual MT research projects, thus limiting their general usability for research or comparability with human translation (HT) results. In this paper, we will report on an- notation experiments that use a new, flexible er- ror metric and that showcase a new type of MT research involving collaboration between MT re- searchers, human translators, and Language Ser- vice Providers (LSPs). When we started to prepare our annota- tion experiments, we teamed up with LSPs and designed a custom error metric based on the “Multidimensional Quality Metric” MQM designed by the QTLaunchPad project (http://www.qt21.eu/launchpad). The metric was designed to facilitate annotation of MT output by human translators while containing analytic error classes we considered relevant to MT research (see Section 2, below). This paper represents the first publication of results from use of MQM for MT quality analysis. Previous research in this area has used er- ror categories to describe error types. For in- stance, Farr´ us et al. (2010) divide errors into five broad classes (orthographic, morphological, lexi- cal, semantic, and syntactic). By contrast, Flana- 165

Upload: others

Post on 18-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

Using a New Analytic Measure for the Annotation and Analysis of MTErrors on Real Data

Arle Lommel1, Aljoscha Burchardt1, Maja Popovic1,Kim Harris2, Eleftherios Avramidis1, Hans Uszkoreit1

1DFKI / Berlin, Germany2 text&form / Berlin, [email protected]

kim [email protected]

Abstract

This work presents the new flexible Mul-tidimensional Quality Metrics (MQM)framework and uses it to analyze the per-formance of state-of-the-art machine trans-lation systems, focusing on “nearly accept-able” translated sentences. A selectionof WMT news data and “customer” dataprovided by language service providers(LSPs) in four language pairs was anno-tated using MQM issue types and exam-ined in terms of the types of errors foundin it.

Despite criticisms of WMT data by theLSPs, an examination of the resulting er-rors and patterns for both types of datashows that they are strikingly consistent,with more variation between languagepairs and system types than between texttypes. These results validate the use ofWMT data in an analytic approach to as-sessing quality and show that analytic ap-proaches represent a useful addition tomore traditional assessment methodolo-gies such as BLEU or METEOR.

1 Introduction

For a number of years, the Machine Translation(MT) community has used “black-box” measuresof translation performance like BLEU (Papineniet al., 2002) or METEOR (Denkowski and Lavie,2011). These methods have a number of advan-tages in that they can provide automatic scores for

c© 2014 The authors. This article is licensed under a CreativeCommons 3.0 licence, no derivative works, attribution, CC-BY-ND.

MT output in cases where there are existing refer-ence translations by calculating similarity betweenthe MT output and the references. However, suchmetrics do not provide insight into the specificnature of problems encountered in the translationoutput and scores are tied to the particularities ofthe reference translations.

As a result of these limitations, there has beena recent shift towards the use of more explicit er-ror classification and analysis (see, e.g., Vilar et al.(2006)) in addition to automatic metrics. The errorprofiles used, however are typically ad hoc cate-gorizations and specific to individual MT researchprojects, thus limiting their general usability forresearch or comparability with human translation(HT) results. In this paper, we will report on an-notation experiments that use a new, flexible er-ror metric and that showcase a new type of MTresearch involving collaboration between MT re-searchers, human translators, and Language Ser-vice Providers (LSPs).

When we started to prepare our annota-tion experiments, we teamed up with LSPsand designed a custom error metric basedon the “Multidimensional Quality Metric”MQM designed by the QTLaunchPad project(http://www.qt21.eu/launchpad). The metric wasdesigned to facilitate annotation of MT output byhuman translators while containing analytic errorclasses we considered relevant to MT research(see Section 2, below). This paper represents thefirst publication of results from use of MQM forMT quality analysis.

Previous research in this area has used er-ror categories to describe error types. For in-stance, Farrus et al. (2010) divide errors into fivebroad classes (orthographic, morphological, lexi-cal, semantic, and syntactic). By contrast, Flana-

165

Page 2: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

gan (1994) uses 18 more fine-grained error cate-gories with additional language-pair specific fea-tures, while Stymne and Ahrenberg (2012) use tenerror types of somewhat more intermediate granu-larity (and specifically addresses combinations ofmultiple error types). All of these categorizationschemes are ad hoc creations that serve a particularanalytic goal. MQM, however, provides a generalmechanism for describing a family of related met-rics that share a common vocabulary. This metricwas based upon a rigorous examination of majorhuman and machine translation assessment met-rics (e.g., LISA QA Model, SAE J2450, TAUSDQF, ATA assessment, and various tool-specificmetrics) that served as the basis for a descriptiveframework for declaring what a particular metricaddresses. While the metric described in this pa-per is still very much a purpose-driven metric, it isdeclared in this general framework, which we pro-pose for use to declare specific metrics for generalquality assessment and error annotation tasks.

For data, we chose WMT data (Bojar et al.,2013) to represent the state of the art output for MTin research. However, LSPs frequently reported tous that the mostly journalistic WMT data does notrepresent their business data (mostly technical doc-umentation) or typical applications of MT in busi-ness situations. In addition, it turned out that jour-nalistic style often contains literary flourishes, id-iosyncratic or mixed styles, and deep embedding(e.g., nested quotations) that sometimes make itvery difficult to judge the output.

As a result, we decided to use both WMT dataand customer MT data that LSPs contributed fromtheir daily business to see if the text types gener-ate different error profiles. This paper accordinglypresents and compares the results we obtained forboth types of sources. For practical purposes, wedecided to analyze only “near miss” translations,translations which require only a small effort tobe converted into acceptable translations. We ex-cluded “perfect” translations and those translationsthat human evaluators judged to have too manyerrors to be fixed easily (because these would betoo difficult to annotate). We therefore had humanevaluators select segments representing this espe-cially business-relevant class of translations priorto annotation.

A total of nine LSPs participated in this task,with each LSP analyzing from one to three lan-guage pairs. Participating LSPs were paid up to

e1000 per language pair. The following LSPs par-ticipated: Beo, Hermes, iDisc, Linguaserve, Lo-grus, Lucy, Rheinschrift, text&form, and Welocal-ize.

2 Error classification scheme

The Multidimensional Quality Framework(MQM) system1 provides a flexible system fordeclaring translation quality assessment methods,with a focus on analytic quality, i.e., qualityassessment that focuses on identifying specificissues/errors in the translated text and categorizingthem.2 MQM defines over 80 issue/error types(the expectation is that any one assessment taskwill use only a fraction of these), and for thistask, we chose a subset of these issues, as definedbelow.

• Accuracy. Issues related to whether the in-formation content of the target is equivalentto the source.

– Terminology. Issues related to the useof domain-specific terms.

– Mistranslation. Issues related to the im-proper translation of content.

– Omission. Content present in the sourceis missing in the target.

– Addition. Content not present in thesource has been added to the target.

– Untranslated. Text inappropriately ap-pears in the source language.

• Fluency. Issues related to the linguistic prop-erties of the target without relation to its statusas a translation.

– Grammar. Issues related to the gram-matical properties of the text.

∗ Morphology (word form). The textuses improper word forms.∗ Part of speech. The text uses the

wrong part of speech

1http://www.qt21.eu/mqm-definition/2This approach stands in contrast to “holistic” methods thatlook at the text in its entirety and provide a score for the asa whole in terms of one or more dimensions, such as over-all readability, usefulness, style, or accuracy. BLEU, ME-TEOR, and similar automatic MT evaluation metrics used forresearch can be considered holistic metrics that evaluate textson the dimension of similarity to reference translations sincethey do not identify specific, concrete issues in the transla-tion. In addition, most of the options in the TAUS DynamicQuality Framework (DQF) (https://evaluation.taus.net/about)are holistic measures.

166

Page 3: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

∗ Agreement. Items within the text donot agree for number, person, or gen-der.∗ Word order. Words appear in the

incorrect order.∗ Function words. The text uses func-

tion words (such as articles, prepo-sitions, “helper”/auxiliary verbs, orparticles) incorrectly.

– Style. The text shows stylistic problems.– Spelling. The text is spelled incorrectly∗ Capitalization. Words are capital-

ized that should not be or vice versa.– Typography. Problems related to typo-

graphical conventions.∗ Punctuation. Problems related to

the use of punctuation.– Unintelligible. Text is garbled or oth-

erwise unintelligible. Indicates a majorbreakdown in fluency.

Note that these items exist in a hierarchy. An-notators were asked to choose the most specific is-sue possible and to use higher-level categories onlywhen it was not possible to use one deeper in thehierarchy. For example, if an issue could be cate-gorized as Word order it could also be categorizedas Grammar, but annotators were instructed to useWord order as it was more specific. Higher-levelcategories were to be used for cases where morespecific ones did not apply (e.g., the sentence Heslept the baby features a “valency” error, which isnot a specific type in this hierarchy, so Grammarwould be chosen instead).

3 Corpora

The corpus contains Spanish→English,German→English, English→Spanish, andEnglish→German translations. To prepare thecorpus, for each translation direction a set oftranslations were evaluated by expert humanevaluators (primarily professional translators) andassigned to one of three classes:

1. perfect (class 1). no apparent errors.2. almost perfect or “near miss” (class 2).

easy to correct, containing up to three errors.3. bad (class 3). more than three errors.

Both WMT and “customer” data3 were ratedin this manner and pseudo-random selections (se-3WMT data was from the top-rated statistical, rule-based, andhybrid systems for 2013; customer data was taken from a vari-

lections were constrained to prevent annotation ofmultiple translations for the same source segmentwithin a given data set in order to maximize thediversity of content from the data sources) takenfrom the class 2 sentences, as follows:

• Calibration set. For each language pair weselected a set of 150 “near miss” (see be-low) translations from WMT 2013 data (Bo-jar et al., 2013).

– For English → German and English →Spanish, we selected 40 sentences fromthe top-ranked SMT, RbMT, and hybridsystems, plus 30 of the human-generatedreference translations.

– For German → English and Spanish →English, we selected 60 sentences fromthe top-ranked SMT and RbMT systems(no hybrid systems were available forthose language pairs), plus 30 of thehuman-generated reference translations.

• Customer data. Each annotator was pro-vided with 200 segments of “customer” data,i.e., data taken from real production systems.4

This data was translated by a variety of sys-tems, generally SMT (some of the Germandata was translated using an RbMT system).

• Additional WMT data. Each annotator wasalso asked to annotate 100 segments of previ-ously unannotated WMT data. In some casesthe source segments for this selection over-lapped with those of the calibration set, al-though the specific MT outputs chosen didnot (e.g., if the SMT output for a given seg-ment appeared in the calibration set, it wouldnot reappear in this set, although the RbMT,hybrid, or human translation might). Notethat the additional WMT data provided wasdifferent for each LSP in order to maximizecoverage of annotations in line with other re-search goals; as such, this additional datadoes not factor into inter-annotator agreementcalculations (discussed below).

ety of in-house systems (both statistical and rule-based) usedin production environments.4In all but one case the data was taken from actual projects;in the one exception the LSP was unable to obtain permissionto use project data and instead took text from a project thatwould normally not have been translated via MT and ran itthrough a domain-trained system.

167

Page 4: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

It should be noted that in all cases we selectedonly translations for which the source was origi-nally authored in the source language. The WMTshared task used human translations of some seg-ments as source for MT input: for example, a sen-tence authored in Czech might be translated intoEnglish by humans and then used as the source fora translation task into Spanish, a practice knownas “relay” or “pivot” translation. As we wished toeliminate any variables introduced by this practice,we eliminated any data translated in this fashionfrom our task and instead focused only on thosewith “native” sources.

3.1 Annotation

The annotators were provided the data describedabove and given access to the open-source trans-late55 annotation environment. Translate5 pro-vides the ability to mark arbitrary spans in seg-ments with issue types and to make other annota-tions. All annotators were invited to attend an on-line training session or to view a recording of it andwere given written annotation guidelines. Theywere also encouraged to submit questions concern-ing the annotation task.

The number of annotators varied for individ-ual segments, depending on whether they were in-cluded in the calibration sets or not. The numbersof annotators varied by segment and language pair:

• German→English: Calibration: 3; Cus-tomer + additional WMT: 1• English→German: Calibration: 5; Cus-

tomer + additional WMT: 1–3• Spanish→English: Calibration: 4; Customer

+ additional WMT: 2–4• English→Spanish: Calibration: 4; Customer

+ additional WMT: 1–3

After annotation was complete some post-processing steps simplified the markup and ex-tracted the issue types found by the annotators topermit comparison.

3.2 Notes on the data

The annotators commented on a number of aspectsof the data presented to them. In particular, theynoted some issues with the WMT data. WMT iswidely used in MT evaluation tasks, and so en-joys some status as the universal data set for tasks

5http://www.translate5.net

such as the one described in this paper. The avail-able translations represent the absolute latest andmost state-of-the-art systems available in the in-dustry and are well established in the MT researchcommunity.

However, feedback from our evaluators indi-cated that WMT data has some drawbacks thatmust be considered when using it. Specifically,the text type (news data) is rather different fromthe sorts of technical text typically translated inproduction MT environments. News does not rep-resent a coherent domain (it is, instead, a genre),but rather has more in common with general lan-guage. In addition, an examination of the human-generated reference segments revealed that the hu-man translations often exhibited a good deal of“artistry” in their response to difficult passages,opting for fairly “loose” translations that preservedthe broad sense, but not the precise details.

The customer data used in this task does not allcome from a single domain. Much of the datacame from the automotive and IT (software UI) do-mains, but tourism and financial data were also in-cluded. Because we relied on the systems availableto LSPs (and provided data in a few cases wherethey were not able to gain permission to use cus-tomer data), we were not able to compare differenttypes of systems in the customer data and insteadhave grouped all results together.

An additional factor is that the sentences in thecalibration sets were much longer (19.4 words,with a mode of 14, a median of 17, and a rangeof 3 to 77 words) than the customer data (average14.1 words, with a mode of 11, a median of 13,and a range of 1 to 50 words). We believe thatthe difference in length may account for some dif-ference between the calibration and customer setsdescribed below.

4 Error analysis

In examining the aggregate results for all languagepairs and translation methods, we found that fourof the 21 error types constitute the majority (59%)of all issues found:

• Mistranslation: 21%• Function words: 15%• Word order: 12%• Terminology: 11%

None of the remaining issues comprise morethan 10% of annotations and some were found so

168

Page 5: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

infrequently as to offer little insight. We also foundthat some of the hierarchical distinctions were oflittle benefit, which led us to revise the list of is-sues for future research (see Section 4.2 for moredetails).

4.1 Inter-Annotator AgreementBecause we had multiple annotators for most of thedata, we were able to assess inter-annotator agree-ment (IAA) for the MQM annotation of the cal-ibration sets. IAA was calculated using Cohen’skappa coefficient. At the word level (i.e., seeing ifannotators agreed for each word, we found that theresults lie between 0.2 and 0.4 (considered “fair”),with an average of pairwise comparisons of 0.29(de-en), 0.25 (es-en), 0.32 (en-de), and 0.34 (en-es), with an overall average of 0.30

4.2 ModificationsThis section addresses some of the lessons learnedfrom an examination of the MQM annotations de-scribed in Section 4.1, with a special emphasis onways to improve inter-annotator agreement (IAA).Although IAA does not appear to be a barrier tothe present analytic task, we found a number of ar-eas where the annotation could be improved andsuperfluous distinctions eliminated. For example,“plain” Typography appeared so few times that itoffered no value separate from its daughter cat-egory Punctuation. Other categories appeared tobe easily confusible, despite the instructions givento the annotators (e.g., the distinction between“Terminology” and “Mistranslation” seemed to bedriven largely by the length of the annotated issue:the average length of spans tagged for “Mistrans-lation” was 2.13 words (with a standard deviationof 2.43), versus 1.42 (with a standard deviation of0.82) for “Terminology’.’ (Although we had ex-pected the two categories to exhibit a difference inthe lengths of spans to which they were applied,a close examination showed that the distinctionswere not systematic with respect to whether actualterms were marked or not, indicating that the twocategories were likely not clear or relevant to theannotators. In addition, “Terminology” as a cat-egory is problematic with respect to the general-domain texts in the WMT data sets since no termi-nology resources are provided.)

Based on these issues, we have undertaken thefollowing actions to improve the consistency offuture annotations and to simplify analysis of thepresent data.

• The distinction between Mistranslation andTerminology was eliminated. (For calculationpurposes Terminology became a daughter ofMistranslation.)• The Style/Register category was eliminated

since stylistic and register expectations wereunclear and simply counted as general Flu-ency for calculation purposes.• The Morphology (word form) category was

renamed Word form and Part of Speech,Agreement, and Tense/mood/aspect weremoved to become its children.• Punctuation was removed, leaving only Ty-

pography, and all issues contained in eithercategory were counted as Typography• Capitalization, which was infrequently en-

countered, was merged into its parentSpelling.

In addition, to address a systematic problemwith the Function words category, we added ad-ditional custom children to this category: Extra-neous (for function words that should not appear),Missing (for function words that are missing fromthe translation), and Incorrect (for cases in whichthe incorrect function word is used). These wereadded to provide better insight into the specificproblems and to address a tendency for annotatorsto categorize problems with function words as Ac-curacy issues when the function words were eithermissing or added. This revised issue type hierar-chy is shown in Figure 1.

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

SMT

RbMT

Hybrid

HT

All language pairs

Figure 3: Average Sentence-level error rates [%]for all language pairs.

This revised hierarchy will be used for ongoingannotation in our research tasks. We also realizedthat the guidelines to annotators did not providesufficient decision-making tools to help them se-lect the intended issues. To address this problem

169

Page 6: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

TranslationQuality

Fluency

Spelling

Part of speech

Tense/aspect/mood

Extraneous

Incorrect

Typography

Grammar Word order

MissingFunction words

AgreementWord form

Unintelligible

Accuracy

Omission

Mistranslation

Untranslated

Addition

Figure 1: Revised issue-type hierarchy.

we created a decision-tree to guide their annota-tions. We did not recalculate IAA from the presentdata set with the change in categories since wehave also changed the guidelines and both changeswill together impact IAA. We are currently run-ning additional annotation tasks using the updatederror types that will result in new scores.

Refactoring the existing annotations accordingto the above description, gives the results for eachtranslation direction and translation method in thecalibration sets, as presented in Figure 2 (with av-erages across all language pairs as presented inFigure 3). Figure 4 presents the same results foreach language pair in the customer data. As pre-viously mentioned, we were not able to break outresults for the customer data by system type.

4.3 Differences between MT methods

Despite considerable variation between languagepairs, an examination of the annotation revealeda number of differences in the output of differ-ent system types. While many of the differencesare not unexpected, the detailed analytic approachtaken in this experiment has enabled us to providegreater insight into the precise differences ratherthan relying on isolated examples. The overall re-sults for all language pairs are presented in Fig-ure 3 (which includes the results for the humantranslated segments as a point of comparison).

The main observations for each translationmethod include:

• statistical machine translation– Performs the best in terms of Mistrans-

lation

– Most likely to drop content (Omission);otherwise it would be the most accuratetranslation method considered.

– Had the lowest number of FunctionWords errors, indicating that SMT getsthis aspect substantially better than alter-native systems.

– Weak in Grammar, largely due to signif-icant problems in Word Order

• rule-based machine translation– Generated the worst results for Mis-

translation– Was least likely to omit content (Omis-

sion)– Was weak for Function Words; statistical

enhancements (moving in the directionof hybrid systems) would offer consid-erable potential for improvement

• hybrid machine translation (available onlyfor English→Spanish and English→German)

– Tends to perform in between SMT andRBMT in most respects

– Most likely method to produce mistrans-lated texts (Mistranslation)

When compared to the results of human transla-tion assessment, it is apparent that all of the near-miss machine translations are somewhat more ac-curate than near-miss human translation and sig-nificantly less grammatical. Humans are far morelikely to make typographic errors, but otherwiseare much more fluent. Note as well that humansare more likely to add information to translationsthan MT systems, perhaps in an effort to rendertexts more accessible. Thus, despite substantialdifferences, all of the MT systems are overall moresimilar to each other than they are to human trans-lation. However, when one considers that a fargreater proportion of human translation sentenceswere in the “perfect” category and a far lower pro-portion in the “bad” category, and that these com-parisons focus only on the “near miss sentences,” it

170

Page 7: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

0%

5%

10%

15%

20%

25%

30%

35%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

SMT

RbMT

HT

Spanish→English

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

SMT

RbMT

HT

German→English

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

SMT

RbMT

Hybrid

HT

English→Spanish

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

SMT

RbMT

Hybrid

HT

English→German

Figure 2: Sentence-level error rates [%] for eachtranslation direction and each translation methodfor WMT data.

is apparent that outside of the context of this com-parison, human translation still maintains a muchhigher level of Accuracy and Fluency.

In addition, a number of the annotators com-mented on the poor level of translation evidentin the WMT human translations. Despite beingprofessional translations, there were numerous in-stances of basic mistakes and interpretive transla-

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

DE>EN

ES>EN

EN>DE

EN>ES

Customer data

Figure 4: Sentence-level error rates [%] for eachtranslation direction for customer data.

tions that resulted in translations that would gener-ally be considered poor references for MT evalu-ation (since MT cannot make interpretive transla-tions). However, at least in part, these problemswith translation may be attributed to the uncon-trolled nature of the source texts, which tended tobe more literary than is typical for industry uses ofMT. In may cases the WMT source sentences pre-sented translation difficulties for the human trans-lators and the meaning of the source texts was notalways clear out of context. As a result the WMTtexts provide difficulties for both human and ma-chine translators.

4.4 Comparison of WMT and customer data

By contrast, the customer data was more likely toconsist of fragments (such as Drive vibrates or sec-tion headings) or split segments (i.e., one logicalsentence was split with a carriage return, result-ing in two fragments) that caused confusion forthe MT systems. It also, in principle, should havehad advantages over the WMT data because it wastranslated with domain-trained systems.

Despite these differences, however, the averageprofiles for all calibration data and all customerdata across language pairs look startlingly simi-lar, as seen in Figure 5. There is thus signifi-cantly more variation between language pairs andbetween system types than there is between theWMT data and customer data in terms of the errorprofiles. (Note, however, that this comparison ad-dresses only the “near-miss” translations and can-not address profiles outside of this category; it alsodoes not address the overall relative distributioninto the different quality bands for the text types.)

171

Page 8: Using a New Analytic Measure for the Annotation and Analysis of … · MT output in cases where there are existing refer-ence translations by calculating similarity between the MT

0%

10%

20%

30%

40%

50%

Unint.Func.words

Wordorder

Wordform

Gram.Punc.Spell.Flu.Unt.Add.Omis.Mis.Acc.

Calibration

Customer

Figure 5: Sentence-level error rates [%] for cali-bration vs. customer data (average of all systemsand language pairs).

5 Conclusions and outlook

The experiment here shows that analytic qualityanalysis can provide a valuable adjunct to auto-matic methods like BLEU and METEOR. Whilemore labor-intensive to conduct, they provide in-sight into the causes of errors and suggest possiblesolutions. Our research treats the human annota-tion as the first phase in a two-step approach. In thefirst step, described in this paper, we use MQM-based human annotation to provide detailed de-scription of the symptoms of MT failure. This an-notation also enables us to detect the system type-and language-specific distribution of errors and tounderstand their relative importance.

In the second step, which is ongoing, linguistsand MT experts will use the annotations from thefirst step to gain insight into the causes for MT fail-ures on the source side or into MT system limita-tions. For example, our preliminary research intoEnglish source-language phenomena indicates that-ing verbal forms, certain types of embedding inEnglish (such as relative sentences or quotations),and non-genitive uses of the preposition of are par-ticularly contributory to MT failures. Further re-search into MQM human annotation will undoubt-edly reveal additional source factors that can guideMT development or suggest solutions to system-atic linguistic problems. Although many of theseissues are known to be difficult, it is only with theidentification of concrete examples that they canbe addressed.

In this paper we have shown that the symptomsof MT failure are the same between WMT and cus-tomer data, but it is an open question as to whetherthe causes will prove to be the same. We therefore

advocate for a continuing engagement with lan-guage service providers and translators using thesedifferent types of data. These approaches will helpfurther the acceptance of MT in commercial set-tings by allowing them to be compared to HT out-put and will also help research to go forward in amore principled and requirements-driven fashion.

Acknowledgments

This work has been supported by the QTLaunch-Pad (EU FP7 CSA No. 296347) project.

References

Bojar, O., Buck, C., Callison-Burch, C., Feder-mann, C., Haddow, B., Koehn, P., Monz, C.,Post, M., Soricut, R., and Specia, L. (2013).Findings of the 2013 workshop on statisticalmachine translation. In Proceedings of theEighth Workshop on Statistical Machine Trans-lation, pages 1–44.

Denkowski, M. and Lavie, A. (2011). Meteor 1.3:Automatic metric for reliable optimization andevaluation of machine translation systems. InProceedings of the EMNLP 2011 Workshop onStatistical Machine Translation.

Farrus, M., Costa-jussa, M. R., Marino, J. B.,and Fonollosa, J. A. (2010). Linguistic-basedevaluation criteria to identify statistical machinetranslation errors. In Proceedings of the EAMT2010.

Flanagan, M. (1994). Error classification for MTevaluation. In Proceedings of AMTA, pages 65–72.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.(2002). BLEU: A method for automatic evalu-ation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association forComputational Linguistics (ACL), pages 311–318.

Stymne, S. and Ahrenberg, L. (2012). On thepractice of error analysis of machine translationevaluation. In Proceedings of LREC 2012, pages1785–1790.

Vilar, D., Xu, J., D’Haro, L. F., and Ney, H. (2006).Error analysis of statistical machine translationoutput. In Proceedings of the International Con-ference on Language Resources and Evaluation(LREC), pages 697–702.

172