Predicting Human Assessment of Machine Translation Quality ... ?· Predicting Human Assessment of Machine…

Download Predicting Human Assessment of Machine Translation Quality ... ?· Predicting Human Assessment of Machine…

Post on 09-Jul-2018




0 download

Embed Size (px)


<ul><li><p>International Journal of Computer Applications (0975 - 8887)Volume 59 - No. 10, December 2012</p><p>Predicting Human Assessment of Machine TranslationQuality by Combining Automatic Evaluation Metrics</p><p>using Binary ClassifiersMichael Paul and Andrew Finch and Eiichiro Sumita</p><p>MASTAR ProjectNational Institute of Information and Communcations Technology</p><p>3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 6190289</p><p>ABSTRACTThis paper presents a method to predict human assessments of ma-chine translation (MT) quality based on a combination of binaryclassifiers using a coding matrix. The multiclass categorizationproblem is reduced to a set of binary problems that are solved us-ing standard classification learning algorithms trained on the resultsof multiple automatic evaluation metrics. Experimental results us-ing a large-scale human-annotated evaluation corpus show that thedecomposition into binary classifiers achieves higher classificationaccuracies than the multiclass categorization problem. In addition,the proposed method achieves a higher correlation with humanjudgments on the sentence level compared to standard automaticevaluation measures.</p><p>General Terms:Machine Translation, Machine Learning, Machine Translation Evaluation</p><p>Keywords:Evaluation Metric Combination, Human Assessment Prediction-ifx</p><p>1. INTRODUCTIONThe evaluation of MT quality is a difficult task because theremay exist many possible ways to translate a given source sen-tence. Moreover, the usability of a given translation depends onnumerous factors such as the intended use of the translation, thecharacteristics of the MT software, and the nature of the transla-tion process. Early attempts tried to manually produce numericaljudgments of MT quality with respect to a set of reference trans-lations [25]. Recent human assessment of MT quality has beencarried out by either assigning a single grade on a scale of 5 or 7specifying the fluency or adequacy of a given translation [19] orrelatively ranking multiple translations of the same input to eachother [7].Although the human evaluation of MT output provides the mostdirect and reliable assessment, it is time consuming, costly andsubjective, i.e., evaluation results might vary from person to per-son for the same translation output due to different backgrounds,bilingual experience, and inconsistent judgments caused by thehigh complexity of the multiclass grading task.These drawbacks of human assessment schemes have encour-aged many researchers to seek reliable methods for estimatingsuch measures automatically. Various automatic evaluation mea-sures have been proposed to make the evaluation of MT out-puts cheaper and faster. However, automatic metrics have not yetproved able to consistently predict the usefulness of MT tech-nologies. Each automatic metric focuses on different aspects ofthe translation output and its correlation with human judges de-pends largely on the type of human assessment.</p><p>Moreover, recent evaluation campaigns on newswire [19, 18] andtravel data [17] investigated how well these evaluation metricscorrelate with human judgments. The results showed that highcorrelations to human judges were obtained for some metricswhen ranking MT system outputs on the document level. How-ever, none of the automatic metrics turned out to be satisfactoryin predicting the translation quality of a single translation.In order to overcome the above shortcomings of automaticevaluation metrics and combine their strengths, this paperinvestigates the usage of multiple evaluation metrics to predicthuman assessments of machine translation (MT) quality. Theproposed method applies multiple automatic evaluation metrics,such as BLEU and METEOR, to a given translation task anduses the obtained sentence-level metric scores as features of thestandard classification learning algorithms in order to predict thesubjective grades assigned by humans for a given translation.The proposed method reduces the complexity of such multiclasscategorization problems by (1) dividing the multiclass probleminto binary classification tasks, (2) learning discriminativemodels based on features extracted from multiple automaticevaluation metric results, (3) producing binary indicators oftranslation quality on the sentence level, and (4) solving themulticlass classification problem by combining the results of thebinary classifiers using a coding matrix. The main advantages ofthe proposed method are:</p><p>the reduction of classification ambiguity due to the decompo-sition of a multiclass classification task into a set of binaryclassification problems;</p><p>the combination of multiple automatic evaluation metricsto predict human judgments taking into account differentaspects of translation quality.</p><p>The framework for reducing multiclass to binary classificationand the combination of the binary results to solve the multiclassclassification problem are described in Section 2. The humanand automatic evaluation metrics investigated in this paper aredescribed in Section 3. Section 4 gives a brief overview of re-lated research on predicting human assessments and outlines themain differences of the proposed method. The proposed methodis described in Section 5 and its effectiveness is evaluated in Sec-tion 6 for English translations of Chinese and Japanese sourcesentences in the travel domain.</p><p>2. MULTICLASS TO BINARY CLASSIFIERREDUCTION</p><p>Multiclass learning problems try to find an approximate defini-tion of an unknown function whose range is a discrete set of val-ues. Previous research on margin classifiers, investigated the fea-sibility of reducing multiclass categorization problems to multi-</p><p>1</p></li><li><p>International Journal of Computer Applications (0975 - 8887)Volume 59 - No. 10, December 2012</p><p>ple binary problems that are solved using binary learning algo-rithms. The combination of the classifiers generated on the bi-nary problems using error-correcting output codes is then used tosolve the multiclass task [8]. The theoretical properties of com-bining binary classifiers to solve multiclass categorization prob-lems are investigated in [Allwein et al. 2000] [3].There are many ways in which a multiclass problem can be de-composed into a number of binary classification problems. Themost well-known approaches are the one-against-all and all-pairs. In the one-against-all approach, a classifier for each ofthe classes is trained where all training examples that belong tothat class are used as positive examples and all others as neg-ative examples. In the all-pairs approach, classifiers are trainedfor each pair of classes whereby all training examples that do notbelong to any of the classes in question are ignored [11]. Suchdecompositions of the multiclass problem can be represented bya coding matrixM, where each class c of the multiclass problemis associated with a row of binary classifiers b. If k is the numberof classes and l is the number of binary classification problems,the coding matrix is defined as:</p><p>M = ( mi,j ) i=1,...,k;j=1,...,lmi,j {1, 0, +1},</p><p>If the training examples that belong to class c are considered aspositive examples for a binary classifier b, then mc,b=+1. Sim-ilarly, if mc,b=1 the training examples of class c are used asnegative examples for the training of b. mc,b=0 indicates that therespective training examples are not used for the training of clas-sifier b [8, 3]. Examples of coding matrices for one-against-alland all-pairs (k=3, l=3) are given in Table 1.</p><p>Table 1. Coding Matrix Examplesone-against-all all-pairs</p><p>c1 c23 c2 c13 c3 c12 c1 c2 c1 c3 c2 c3c1 +1 1 1 c1 +1 +1 0c2 1 +1 1 c2 1 0 +1c3 1 1 +1 c3 0 1 1</p><p>In this paper, the above coding matrix approach is applied topredict the outcomes of subjective evaluation metrics introducedin Section 3.</p><p>3. ASSESSMENT OF TRANSLATION QUALITYVarious approaches on how to assess the quality of a translationhave been proposed. In this paper, human assessments of transla-tion quality with respect to the fluency, the adequacy and the ac-ceptability of the translation are investigated. Fluency indicateshow natural the evaluation segment sounds to a native speaker ofEnglish. For adequacy, the evaluator is presented with the sourcelanguage input as well as a gold standard translation and hasto judge how much of the information from the original trans-lation is expressed in the translation [26]. Acceptability judgeshow easy to understand the translation is [23]. The fluency, ad-equacy and acceptability judgments consist of one of the gradeslisted in Table 2.The high cost of such human evaluation metrics has triggered ahuge interest in the development of automatic evaluation metricsfor machine translation. Table 3 introduces some metrics that arewidely used in the MT research community.</p><p>4. PREDICTION OF HUMAN ASSESSMENTSMost of the previously proposed approaches to predict human as-sessments of translation quality utilize supervised learning meth-ods such as decision trees (DT), support vector machines (SVM),or perceptrons to learn discriminative models that are able tocome closer to human quality judgments. Such classifiers can</p><p>Table 2. Human Assessmentfluency adequacy</p><p>5 Flawless English 5 All Information4 Good English 4 Most Information3 Non-native English 3 Much Information2 Disfluent English 2 Little Information1 Incomprehensible 1 None</p><p>acceptability5 Perfect Translation4 Good Translation3 Fair Translation2 Acceptable Translation1 Nonsense</p><p>Table 3. Automatic Evaluation MetricsBLEU: the geometric mean of n-gram precision of the system</p><p>output with respect to reference translations. Scoresrange between 0 (worst) and 1 (best) [16]</p><p>NIST: a variant of BLEU using the arithmetic mean ofweighted n-gram precision values. Scores are positivewith 0 being the worst possible [9]</p><p>METEOR: calculates unigram overlaps between a translation andreference texts. For the experiments reported in thispaper, only exact matches were taken into account.Scores range between 0 (worst) and 1 (best) [4]</p><p>GTM: measures the similarity between texts by using aunigram-based F-measure. Scores range between 0(worst) and 1 (best) [24]</p><p>WER: Word Error Rate: the minimal edit distance betweenthe system output and the closest reference transla-tion divided by the number of words in the reference.Scores are positive with 0 being the best possible [14]</p><p>PER: Position independent WER: a variant of WER thatdisregards word ordering [15]</p><p>TER: Translation Edit Rate: a variant of WER that allowsphrasal shifts [22]</p><p>be trained on a set of features extracted from human-evaluatedMT system outputs.The work described in [Quirk 1994] [20] uses statistical mea-sures to estimate confidence on the word/phrase level and gath-ers system-specific features about the translation process itself totrain binary classifiers. Empirical thresholds on automatic eval-uation scores are utilized to distinguish between good and badtranslations. [Quirk 1994] also investigates the feasibility of var-ious learning approaches for the multiclass classification prob-lem for a very small data set in the domain of technical docu-mentation. [Akiba et al. 2001] [2] utilized DT classifiers trainedon multiple edit-distance features where combinations of lexical(stem, word, part-of-speech) and semantic (thesaurus-based se-mantic class) matches were used to compare MT system outputswith reference translations and to approximate human scores ofacceptability directly. [Kulesza and Shieber 2004] [13] traineda binary SVM classifier based on automatic scoring features inorder to distinguish between human-produced and machine-generated translations of newswire data instead of predictinghuman judgments directly.The proposed approach also utilizes a supervised learningmethod to predict human assessments of translation quality, butdiffers in the following two aspects:</p><p>(1) Reduction of Classification Ambiguity:The decomposition of a multiclass classification task into aset of binary classification problems reduces the complexityof the learning task, resulting in higher classification accuracy.</p><p>(2) Feature Set:Classifiers are trained on the results of multiple automatic</p><p>2</p></li><li><p>International Journal of Computer Applications (0975 - 8887)Volume 59 - No. 10, December 2012</p><p>evaluation metrics, thus taking into account different aspectsof translation quality addressed by each of the metrics. Themethod does not depend on a specific MT system nor on thetarget language.</p><p>5. HUMAN ASSESSMENT PREDICTION BYBINARY CLASSIFIER COMBINATION</p><p>The proposed prediction method is divided into three phases:(1) a learning phase in which binary classifiers are trainedon the feature set that is extracted from a database of humanand machine-evaluated MT system outputs, (2) a decompositionphase in which the optimal set of binary classifiers that maxi-mizes the classification accuracy of the recombination step on adevelopment set is selected, (3) an application phase in whichthe binary classifiers are applied to unseen sentences, and the re-sults of the binary classifiers are combined using the optimizedcoding matrix to predict a human score. A flow chart summariz-ing the major processing steps of each phase is given in Figure 1.</p><p>!"</p><p>#</p><p>$</p><p>ApplicationLearning/Decomposition</p><p>Fig. 1. Flow Chart of the Proposed Prediction Method</p><p>5.1 Learning PhaseDiscriminative models for the multiclass and binary classifica-tion problem are obtained by using standard learning algorithms.The proposed method is not limited to a specific classificationlearning method. For the experiments described in Section 6, astandard implementation of decision trees [21] was utilized.The feature set consists of the scores of the seven automatic eval-uation metrics listed in Table 3. All automatic evaluation metricswere applied to the input data sets consisting of English MT out-puts whose translation quality was manually assessed by humansusing the metrics introduced in Section 3. In addition to the met-ric scores, metric-internal features, like ngram-precision scores,and length ratios between references and MT outputs, were alsoutilized, resulting in a total of 54 training features.</p><p>5.2 Decomposition PhaseFor the experiments described in Section 6, both the one-against-all and the all-pairs binary classifiers described in Section 2 wereutilized . In addition, boundary classifiers were trained on thewhole training set. In this case, all training examples annotatedwith a class better than the class in question were used as positiveexamples, and all other training examples as negative examples.Table 4 lists the 17 binary classification problems that were uti-lized to decompose the human assessment problems introducedin Section 3.</p><p>Table 4. Decomposition of HumanAssessment of Translation Quality</p><p>type binary classifier</p><p>one-against-all 5, 4, 3, 2, 1all-pairs 5 4, 5 3, 5 2, 5 1,</p><p>4 3, 4 2, 4 1,3 2, 3 1,2 1</p><p>boundary 54 321, 543 21</p><p>In order to identify the optimal coding matrix for the respec-tive tasks, the binary classifiers were first ordered according totheir classification accuracy on the development set. In the sec-ond s...</p></li></ul>