a methodology for a semi-automatic evaluation of the lexicons of machine translation systems

Machine Translation 16: 127–149, 2001.© 2002 Kluwer Academic Publishers. Printed in the Netherlands.

127

A Methodology for a Semi-Automatic Evaluationof the Lexicons of Machine Translation Systems

AHMED GUESSOUMDepartment of Computer Science, University of Sharjah, P.O. Box 27272, Sharjah, UAEE-mail: [email protected]

RACHED ZANTOUTDepartment of Computer Science, Faculty of Sciences, University of Balamand, P.O. Box 100,Tripoli, LebanonE-mail: [email protected]

Abstract. The lexicon is a major part of any Machine Translation (MT) system. If the lexicon of anMT system is not adequate, this will affect the quality of the whole system. Building a comprehensivelexicon, i.e., one with a high lexical coverage, is a major activity in the process of developing a goodMT system. As such, the evaluation of the lexicon of an MT system is clearly a pivotal issue for theprocess of evaluating MT systems. In this paper, we introduce a new methodology that was devisedto enable developers and users of MT Systems to evaluate their lexicons semi-automatically. Thisnew methodology is based on the idea of the importance of a specific word or, more precisely, wordsense, to a given application domain. This importance, or weight, determines how the presence ofsuch a word in, or its absence from, the lexicon affects the MT system’s lexical quality, which inturn will naturally affect the overall output quality. The method, which adopts a black-box approachto evaluation, was implemented and applied to evaluating the lexicons of three commercial English–Arabic MT systems. A specific domain was chosen in which the various word-sense weights weredetermined by feeding sample texts from the domain into a system developed specifically for thatpurpose. Once this database of word senses and weights was built, test suites were presented to eachof the MT systems under evaluation and their output rated by a human operator as either correct orincorrect. Based on this rating, an overall automated evaluation of the lexicons of the systems wasdeduced.

Key words: Arabic, coverage, evaluation, lexicons, translation quality.

1. Introduction

Software evaluation is proving ever more crucial with the frenzied growth of soft-ware development. No matter what the domain is, software evaluation can target thequality of the software under assessment, its cost, and/or performance. As a matterof fact, one can immediately see that testing and maintenance, two major parts ofa software life cycle, are directly relevant to software evaluation, performed bythe software developers during the development process. The additional aspects ofsoftware evaluation are performed after the software has been developed and may

128 AHMED GUESSOUM AND RACHED ZANTOUT

be carried out by potential users and/or researchers. Any software can be evaluatedfor its intrinsic features, or for purposes of comparing them to those of similarsoftware so as to select the most suitable amongst them in terms of the featuresthat are targeted by a given institution or individual.

Natural Language Processing (NLP) systems also need to go through appropri-ate evaluation procedures. On the one hand, they need to be assessed by researchersand system developers for the technological improvements and novel researchideas put forward as solutions to the various problems faced in the area. On theother hand, they also need to be assessed by potential users and customers, thenumber of whom is continuously increasing. This is so due to the fact that NLPsystems are proving more and more useful and their quality nowadays definitelymakes them worthwhile in the computer industry world.

In terms of quality assessment, the evaluation of NLP systems has been dividedinto two main approaches: “glass-box” and “black-box” evaluations (Hutchins andSomers, 1992; Arnold et al., 1993; Nyberg et al., 1994). In the latter, the evaluatorhas access only to the input and output of the system under evaluation. In theformer, the evaluator also has access to the various workings of the system andcan thus assess each sub-part of the system independently and in association withthe others. In addition to quality assessment, an NLP system, just like any othersoftware, needs to be evaluated for its cost, performance, stability, and portabilityamong other criteria.

MT systems inherit from their super-class, NLP systems, most of the featuresin terms of their evaluation. They are most of the time assessed using a glass-boxor black-box evaluation, though some researchers have also pointed out the needfor component-based evaluation and detailed error analyses (Arnold et al., 1993;Nyberg et al., 1994). Since MT systems combine lexical analyzers, morpholo-gical analyzers, parsers, semantic disambiguation modules, generators, pragmaticmodules, etc., it is important to be able to evaluate these various componentsindividually as well as to evaluate the overall system.

Evaluation is clearly a complex task since it needs to take care of all the aspectsmentioned above. It is even more complex in the case of MT systems. Indeed,the evaluation of MT systems introduces a number of additional complicationschief of which, in our opinion, is its currently subjective nature. Such subjectivitycomes from the fact that all current evaluation methods known to the authors relyheavily on humans to produce grades that are used to evaluate the system underconsideration. The assessment of the quality of some system (or component) out-put may depend on the evaluator’s background, skills, or even taste. For instance,given that MT systems are bilingual, or even multilingual, a need arises for good(human) evaluators/translators, that is, translators who have a good grasp of thesource as well as the target languages. This can be found in varying degrees invarious evaluators. This means that the evaluation process will be affected by theproficiency (or lack of proficiency) of the human evaluator in the various facetsof the languages which are involved in the translation. Besides, the difference

SEMI-AUTOMATIC EVALUATION OF LEXICONS 129

between a score of 3 or 4 on a scale of 5 for some feature of the system underevaluation may not be that obvious for various evaluators, especially given themultilingual nature of the systems at stake. Even when strict and very clear rulesare introduced at the beginning of the evaluation process, the exact number givento a system component under evaluation would vary from one human evaluator toanother. Also, various MT systems may use different approaches to translation andmay be used in application domains or settings that may make their performancejudged differently. Moreover, given that the evaluators often do not have access tothe internal workings of the system under evaluation, and are therefore forced intoblack-box evaluation, their interpretations of their evaluations do not necessarilyyield appropriate diagnoses of the observed errors.

The previous arguments explain why the evaluation of an MT system’s qualityis far from being a simple task. In addition, one should note that MT systemsmost of the time give raw output whose quality necessitates post-processing bya human translator. Thus the evaluation of the performance of an MT systemwill also depend on the acquaintance of the human translator with the system.Besides, the overall cost of using a given MT system should also take into accountthe cost introduced by the use of a human post-processor (and, sometimes, evenpre-processor), a cost which may vary, as explained in the previous point.

All the arguments above point to the need for the development of evaluationmethodologies that minimize the amount of subjectivity as much as possible. Webelieve in fact that the best way to achieve this goal is to automate, as much aspossible, any evaluation task or part thereof. In this paper we concentrate our efforton the evaluation of the language coverage of an MT system’s lexicon.

The lexicon is an important component of any MT system. It supports the sys-tem with the lexical data, that is, the specific information about each lexical item,word or idiomatic phrase, in the vocabulary of the source and target languages.Ideally, the lexicon should be correct and complete. “Correctness” means that aword gets translated correctly, while “completeness” means that the lexicon coversthe entire source language. Of course, this last criterion is a theoretical one. Whatis more important in reality is to assess the percentage of the source language thata given MT system lexicon covers.

The aim of this paper is to introduce a new methodology for semi-automaticallyevaluating the language coverage of MT system lexicons. In Section 2, we intro-duce lexicons and their importance in MT systems and present some of the previouswork done on their evaluation. In Section 3, we introduce our methodology forevaluating MT system lexical coverage. We use this methodology to evaluate threeEnglish–Arabic MT system lexicons.The results of this evaluation are presentedand discussed in Section 4. We conclude and give suggestions for future work inSection 5.


2. Lexicons and MT Evaluation

A lexicon is a crucial part of an MT (or NLP) system. A prerequisite for a good MT(or NLP) system is to build a lexicon that adequately covers the language pair(s)between which the MT system will translate (or the languages the NLP system willprocess). There is an abundance of literature about the specifications of lexiconsfor NLP systems. Generally a lexicon is considered like the dictionary used by ahuman translator to translate one word from the source language to another in thetarget language. However, for computers, it is necessary to include in a lexiconmore information than that which is available for the human user in a dictionary.This is to aid the MT system with background (or world) knowledge, that is, theinformation that a human user of a dictionary would already know by experience.

The information contained in lexicons should be more explicit than that foundin dictionaries. It is usually used for the purposes of syntactic and semantic pro-cessing. According to Hutchins and Somers (1992), part of the information thatshould be available with each word in a lexicon is the following:

• grammatical category: this will help the system analyze the source struc-ture better and generate better target structures. Examples of grammaticalcategories are verbs, pronouns, nouns;

• case frames: these will help the system understand the context in which thesource-text word is used and know how to generate the correct word for thatcontext in the target language;

• semantic features: these indicate not only the potential range of extralinguisticobjects to which they may refer but also the appropriate semantic conjunctionof words in sentences, such as boat and sea, play and ball;

• selection restrictions: they will help the computer decide which of the availableforms of a word is better to be used in the specific context.

Being of prime importance to NLP and MT systems, lexicons have been studiedquite thoroughly. The reader is referred to Dorr and Klavans (1994/5) for moretechnical details relevant to building lexicons.

Evaluation of an NLP/MT system should also target the assessment of the com-putational limitations of this system as well as its cost and benefits for the potentialpurchasers and users. A number of contributions exist such as Dyson and Hannah(1987) and Lehrberger and Bourbeau (1987) on evaluation by users, and Nagao(1985), Melby (1988) and King and Falkedal (1990) on methodologies for MTevaluation. Vasconcellos (1988) is a collection of papers which includes a numberof discussions of methods for MT system evaluation. Notable evaluations of MTsystems are those of Systran (van Slype, 1979; Wilks, 1992), and of Logos (Sinaikoand Klare, 1972, 1973).

In van Slype (1979) a detailed study of the methods that had been developedfor evaluating MT is presented. The report subdivides the evaluation features intomacro-evaluation and micro-evaluation each of which deals with various aspectsof the evaluation process. These include cognitive aspects such as intelligibility,


fidelity; linguistic aspects such as lexical evaluation, semantic evaluation; and otheraspects related to the system economics, improvability, and so on. These methodswere all based on humans’ processing and assessment of the systems studied.

Major projects related to the evaluation of NLP systems have been supportedsuch as the DARPA project (White et al., 1994), the DiET project (Klein et al.,1999), TSNLP (Lehmann et al., 1996) and the European project EAGLES (EAGLES,1995, 1998) for the development of diagnostic and evaluation tools for NLP ap-plications. All of these projects have been concerned one way or another with theevaluation of NLP/MT systems and/or the generation of suitable test suites for suchan evaluation. Nevertheless, these methods were based on humans’ assessments ofvarious features of NLP/MT systems.

Evaluation of MT systems has been an active area and has produced abund-ant literature. (See, for instance, Hutchins and Somers, 1992; Arnold et al., 1993;Nyberg et al., 1994; White et al., 1994; EAGLES, 1998). One also finds the resultsof evaluating a large set of MT systems in Mason and Rinsche (1995). In additionto what was mentioned in the previous section about the various approaches toevaluation, we can note that there seems to be an agreement that the evaluationshould test for a number of aspects. One of these is “adequacy”, i.e., the extent towhich the meaning of the source text is rendered in the translated text. Anotheris “fluency”, i.e., the extent to which the target text appeals to a native speaker interms of well-formedness of the target text, grammatical correctness, absence ofmisspellings, adherence to common language usage of terms, and meaningfulnesswithin the context (White et al., 1994). Yet another aspect that should be testedfor is “informativeness”, which assesses the extent to which the translated textconveys enough information from the source text as to enable evaluators to answervarious questions about it. One can also test for “intelligibility” (Arnold et al.,1993), which is strongly related to informativeness, though directly affected bygrammatical errors and mistranslated or missing words.

In Nyberg et al. (1994) the authors from the KANT team (Nyberg and Mit-amura, 1992) introduce a methodology based on evaluation metrics for knowledge-based MT. The evaluation criteria they consider are “completeness”, which ex-presses that a system produces an output for every input; “correctness”, whichexpresses that a system produces a correct output for every input it is to translate;and “stylistics", which expresses the appropriateness of the lexical, grammatical,and other choices made during the translation process. Concerning lexicon evalu-ation, the previous criteria were refined as follows. Lexical completeness was takento mean that for every word or phrase in the translation domain the system un-der evaluation has source- and target-language lexicon entries. Lexical correctnessrefers to the fact that words are correctly chosen in the target sentence to realizethe intended concept. Finally, in terms of stylistics, lexical appropriateness meansthat each word selected for output is the most appropriate (and correct) choicefor the context. Based on the completeness, correctness, and stylistics criteria,the authors then defined four evaluation criteria, which test, as percentages, the


Analysis Coverage, Analysis Correctness, Generation Coverage, and GenerationCorrectness. These four percentages then get multiplied, yielding the “TranslationCorrectness”, which measures the overall quality of the system.

3. Lexicon Evaluation Methodology

In order to evaluate lexicons of MT systems, it is necessary to know the typesof errors that may occur when translating text using MT systems. In this paper,we focus on the errors in translation due to shortcomings in the system’s lexicon.There are two main types of lexical errors that may occur when translating fromone language to another using an MT system. First, errors can occur because ofa word missing in the lexicon. Second, errors can occur because the correct wordsense does not exist in the lexicon even though the word may exist with a differentsense.

3.1. CLASSIFICATION OF WORD SENSES

For most languages, a single word can have different senses according to the con-text of the sentence. In (1) below, the word bank appears in two sentences. In (1a),it gets translated as al-masraf and in (1b) as al-jaanib ‘side’. Twocompletely different Arabic words are used to translate the same English word.This illustrates the idea that the same word in some language can have differentsenses in another. Therefore, if one of the senses does not exist in the lexicon, thetranslation may be incorrect depending on the context in which the word to betranslated occurs.

(1) a. I waited inside the bank.

Intazartu daakhila al-masraf

b. I waited on the northern bank of the river.

Intazartu bil-jaanbi al-shimaaly mina al-nahr

The same discussion applies to the word types in (2). In addition to what wasexplained, we note that types in (2a) comes as a noun, translated to Arabic asanwaa’, while in the second sentence types is a verb, translated as yaTba’.

(2) a. There are different types of operating systems.

Hunaaka anwaa’ mukhtalifa min anzimati al-tashgheel


b. He types his articles using a computer.

Hwa yaTba’ magaallaTahu biwasitati al-haasibi al-naly

From this, we see that the existence of a lexical occurrence of a specific wordin a lexicon is not sufficient. Indeed, the lexicon coverage should be different de-pending on whether the lexicon covers all the different word senses or not. Thismeans that, when evaluating lexicons, we should deal with word senses and notwith words only.

In the following, a word sense carries the conventional meaning. The databaseof word senses will consist of tuples which contain the word sense, its translation,the number of its occurrences in the corpus, and any necessary additional fields.It is important here to highlight that the source word-sense representation dependson the source language of the MT system under evaluation. For instance, whilewe would represent word senses in their root form for MT systems that translatefrom English, French, Italian, or any language that has the same lexical behaviour,for a language like Arabic, actual lemmas would be used in the database of wordsenses if we were to evaluate an MT system that translates from Arabic to anyother language. For our purposes, we will therefore represent English word sensesin their root form in the database of word senses.

In our work, we will refer to domains in which the user translates texts. Thesedomains are general concepts under which fall all the texts that the user is intendingto translate using some MT system. For example, biology, chemistry, computerscience are a few examples of different domains in which a user of an MT systemmight be interested in translating texts. These domains are not necessarily mutuallyexclusive and some might even be complete subsets of others. The idea of havingdifferent domains stems from the fact that different users may want to use an MTsystem in different settings. The main effect of this on evaluating lexicons is thatthe evaluator should take into consideration that a certain word sense might beimportant in some domain but not as important in other domains. For example, theEnglish word user is a frequently used word sense in the domain of computing,whereas it may be rarely used in the domains of medicine and the arts. The Englishword formula is frequently used in the domains of mathematics and chemistry,whereas it is not as frequently used in computing or geography. This means thatif the word formula is absent from the lexicon of an MT system which is used totranslate texts in the computer domain, it should not affect the evaluation of such alexicon as much as it would when the lexicon of the MT system is considered fortranslation in the mathematics domain. Therefore, the classification of word sensesaccording to domains is a prerequisite to evaluating any lexicon. This classifica-tion should be done based on the specific domain for which the lexicon is beingevaluated.

We can now state that a prerequisite to evaluating lexicons is first to specify thedomain in which the evaluation is to take place. Second, a way should be available


for the evaluator to know the importance of each word sense in the selected domain.This can be done by providing the evaluator with a database of word senses andthe frequency of their use in the domain in which the evaluation is being made.A simpler categorization would be to divide the word senses in this database intoclasses. For example, we might decide to use three classes of word senses, “fre-quent”, “normal” and “rare”. Specifying which word to assign to which class canbe done by looking at the occurrence frequency of the specific word in the domainof interest. The idea of classes of word senses is especially important for largedatabases, for which storing an individual weight for each word sense would bememory-consuming.

Using classes means that all words in a specific class would have weights ofcomparable importance. The weight of a class would therefore be calculated asa function of the weights of the individual words that make up that class. Theweight of the class should reflect its importance as compared to other classes in thedomain. Therefore, in the example of using the three classes (frequent, normal andrare), the frequent class should have the highest weight, while the rare class shouldhave the lowest weight. The evaluation of the lexicon would then use these weightsto rate the usefulness of the lexicon in the specific domain under consideration.

As mentioned earlier, it is proposed in this paper to use one of two methods forthe classification of word senses. In the first method, the evaluator discriminatesbetween the score of each word sense in one class and the scores of the other wordsenses in the same class based on their occurrences. For example, if the words w1

(with 4 occurrences) and w2 (with 2 occurrences) are in the same class, then thismeans that the word w1 will affect the overall class coverage twice as much as wordw2. This method is referred to as the “local-discrimination” method in this paper.In the second method, the evaluator does not discriminate between the score of oneword sense in one class and others in the same class (though their weights may beof fairly different orders if compared to words of other classes). For example, theword senses w1 and w2 in the previous example will be assigned the same score sothey will affect the overall class coverage equally. We will call this method the “no-local-discrimination” method. The difference between both methods is that in thelocal-discrimination method, each word has its own weight, which reflects its ownimportance in the domain under consideration. This will give more precise resultsas far as evaluation is concerned. However, the additional calculations needed inthe case of local discrimination, as compared to the improvement in the evaluationprecision, might not be that important to the evaluator. In fact, the method usedto build the database of word senses uses a finite number of texts to generate theweights associated with each word sense. This leads to the conclusion that thereis an inaccuracy inherent in the weights regardless of the “randomness” in theselection of sample texts of a specific domain. This error might be deemed by theevaluator as large enough to warrant assigning the same weight to a group of wordsenses which have weights that are close numerically. In this case, the evaluatorwill use the no-local-discrimination method in which a weight is assigned to each


class, based on the total number of occurrences of its constituent elements withrespect to the total number of word senses in the database.

3.2. EVALUATING LEXICON COVERAGE

Our approach to evaluating the coverage of lexicons for a given MT system requiresa number of steps.

3.2.1. Step 1

Step 1 involes calculating the size of each class of word senses, which will beused to calculate the weight of every word sense so as to rank its importance inthe domain under consideration. The class size is the sum of the occurrences of allword senses in the class. The size of the whole database (DB) is the sum of theoccurrences of all word senses in the whole database or, equivalently, the sum ofthe sizes of all the classes (3),

(3) size(C j) = ∑mi∈CjOcc(mi)

where C j is the class j of word senses (Cj ⊆ DB), mi is the word sense i in classC j(mi ∈ DB), Occ(mi) is the number of occurrences of mi in DB, and size(C j) isthe total number of occurrences of word senses in class Cj.

3.2.2. Step 2

Step 2 involves calculating the class weight for each class of word senses withrespect to the whole database (4),

(4) CW(C j) = size(C j)

size(DB)=

∑mi∈CjOcc(mi)∑mi∈DBOcc(mi)

where CW(C j) is the weight of class Cj of word senses, size(DB) is the total numberof occurrences of all word senses in the entire DB.

3.2.3. Step 3

In step 3 we calculate the coverage of each class of word senses in the MT systemlexicon separately. The class coverage could be computed in one of two waysaccording to the evaluator’s need. These are as follows.

In case of no local discrimination among class elements (i.e., word senses),we assign an equal score, when calculating the class coverage, for each word sensein the class (i.e., 1 if the word sense is available in the MT system lexicon and 0otherwise). We can calculate the lexicon coverage of a class as in (5) and (6),

(5) Coverage(L,C j) =∑

mi∈Cjd(mi, L)

elements(C j)


(6) d(mi, L) ={

1 if mi ∈ L

0 otherwise

where Coverage(L,Cj) is the coverage of class Cj in the lexicon L under evaluation(for the selected domain), elements(Cj) is the number of word senses in class Cj notcounting multiple occurrences, and d(mi, L) is a binary (boolean) function (6) thatreflects the existence (or not) of a word sense in the lexicon. If the lexicon coversall word senses that are in class Cj then Coverage(L,Cj) would be 1. Each word ofclass C j not covered by the lexicon would fail to contribute to the numerator of (5).The denominator in (5) is independent of L and is a function only of the numberof word senses in class C j.

In case of local discrimination among word senses, instead of using equation(5), we calculate the class coverage as in (7).

(7) Coverage(L,C j) =∑

mi∈Cj[Occ(mi) ∗ d(mi, L)]∑mi∈Cj Occ(mi)

The difference between equations (5) and (7) is that in (7), two words thatbelong to the same class can contribute differently to the coverage of the class Cj

in the lexicon under evaluation, whereas in equation (5), the existence of any wordwould contribute equally. In (7), Coverage(L,Cj) is affected more by a word whichhas a high number of occurrences in Cj than a word which occurs less frequently,even though both words may be in the same class.

We should point out here that the local-discrimination case amounts to not con-sidering classes at all since the word frequency with respect to the entire databaseis relevant. So we could compute the overall MT system lexicon coverage (see nextstep) directly. However, we do provide equation (7) for comparative purposes andto double check through the text examples that the methodology is sensitive to thelocal-discrimination and the no-local-discrimination differences.

Using the above equations, the evaluator will be able to compute the overallcoverage percentage of the lexicon of an MT system. Moreover, after calculatingthe values of the lexicon coverage for each class, the evaluator will be able todecide which class of word senses was covered with high percentage in the lexiconof an MT system. Calculating the values of the lexicon coverage for each class isparticularly useful, in addition to computing the lexicon coverage, in that it givesthe evaluator some means of finding out which category of words suffers fromweaknesses and whether it is worthwhile working on improving it. For instance, ifthe class of rare words is not well covered by the lexicon, it may be not importantto enrich the lexicon. Such would not be the case had the class of frequent words(or even normal words) been not well covered. This highlights some aspect of theimportance of the notion of class coverage, which also explains why we provideequation (7) though it is not strictly necessary.


3.2.4. Step 4

Finally we calculate the overall coverage for the MT system lexicon using thecoverage and weight of each class as in (8).

(8) Coverage(L) =∑j = 1

N

[CW(C j) ∗ Coverage(L,C j)]

Equation (8) thus provides one single value that can be used in order to assessthe quality of an MT system and to compare lexicons across MT systems.

3.3. MT SYSTEM LEXICON EVALUATION PROCEDURE

We summarise the above methodology in the procedure shown in Figure 1. Thisprocedure consists of two main steps: a procedure build_wordsense_db that builds,off-line, the DB of word senses, classes, and their statistics with respect to a selec-ted domain; and a procedure evaluate_lexicon, which uses the DB of word senses,classes, and statistics, built in the previous step, to actually evaluate any lexicon itis presented with. The two procedures are presented separately.

Note that the evaluation of the MT system lexicon is performed in a pure black-box fashion. The MT system is automatically asked to translate each word senseand a human operator decides whether the translation returned by the system is theproper lexical equivalent to the input. This step is referred to by the if-statement inthe procedure evaluate_lexicon.

In the following section we show an example of how to apply the aboveprocedure in order to calculate the coverage of the lexicon for a given MT system.

3.4. EXAMPLE

Consider the following case, where we have assumed the existence of three classesof word senses. Class 1 contains words that occur more than 19 times in thedatabase; class 2 contains words that occur between 10 and 19 times; and class3 includes words that occur less than 10 times. In Table I, the column labelled“Exist” tells whether a word sense belongs to the lexicon under evaluation or not.In order to calculate the coverage of the lexicon for the given MT system we followthe four steps listed in the previous section, as shown in Table II.

• Step 1: We calculate the size of each class of word senses in addition to thesize of the whole database using equation (3).

• Step 2: We calculate the class weight for each class of word senses usingequation (4).

• Step 3: Now we calculate the coverage of each class of word senses in theMT system lexicon separately, (a) in the case of no local discrimination usingequation (5), and (b) in the case of local discrimination using equation (7).


Figure 1. Summary of evaluation procedure.

• Step 4: We calculate the overall coverage for the MT system lexicon usingequation (8), again without (a) or with (b) local discrimination.

The results show that this lexicon covers 60.8% of the word senses in the textdomain in the case of no local discrimination among word senses of the sameclass. On the other hand, when discrimination based on occurrence ratios is appliedbetween word senses in the same class, the coverage of word senses in the textdomain becomes 52.8%. Note that these lexical coverage percentages are verylow, which means that the lexicon of the given MT system is poor for the givenapplication domain.

The reader’s attention is drawn to the point that the methodology presented hereallows us to have quantified measures for a lexicon coverage. The interpretation ofthe quality of the lexicon is the task of the MT system user. For instance, seeing that


Table I. Some word senses, their occurrences, weights and classes.

Class-1 words Class-2 words Class-3 words

Occ. Weight Exist Occ. Weight Exist Occ. Weight Exist

w11 25 0.10 1 w21 10 0.04 1 w31 1 0.004 1

w12 30 0.12 1 w22 11 0.04 1 w32 2 0.008 1

w13 35 0.14 0 w23 16 0.06 1 w33 4 0.016 1

w14 40 0.16 0 w24 13 0.05 1 w34 3 0.012 0

w15 20 0.08 1 w25 15 0.06 0 w35 8 0.032 0

w26 15 0.06 0 w36 1 0.004 0

w37 1 0.004 0

Tot 150 0.60 3 80 0.32 4 20 0.080 3

Table II. Results from worked example.

Class 1 Class 2 Class 3 Total

Step 1 Size 150 80 20 250

Step 2 Class weight 0.60 0.32 0.08

Step 3 (a) Coverage 0.60 0.66 0.43

Step 3 (b) Coverage 0.50 0.63 0.35

Step 4 (a) 0.608

Step 4 (b) 0.528

the evaluated systems have performed, say, 90% means that, statistically, almost 1word sense out of 10 is not properly translated by the system. Obviously, for lexicalcoverage, a system must perform as close to 100% as possible. Nevertheless, ouraim in this paper is to provide a way of having objective quantified values whichhelp assess the quality of an MT system’s lexicon. We leave it up to the user tointerpret the result with what suits any potential usage.

4. Evaluation of Various Arabic MT Systems’ Lexicons

One of our goals while developing our approach for evaluating MT system lexiconswas to automate the evaluation process as much as possible. To this end, we havedeveloped a tool for automating the major activities involved in evaluating thecoverage of lexicons. This tool automatically calculates occurrences and weightsfor all word senses in a given text as well as the weight and coverage of each classand the coverage of the lexicon of a given MT system with respect to some domain,as explained in the previous section.


4.1. A TOOL FOR COLLECTING WORD-SENSE STATISTICS

Building the database of frequently used words for different domains is not an easytask itself. In our case, we have realized it by collecting sample texts chosen froma selected domain. A tool was built to extract each word sense from sample textsand count the number of times this word sense occurs across the different sampletexts. This calculates the total number of word senses entered thus far and stores aflag value for each word sense to show its existence (or not) in the lexicon underevaluation. Additionally, as shown in Table III below, each word sense is assigneda weight that reflects the number of occurrences of this word sense as a percentageto the total number of occurrences of all word senses in all the sample texts that thetool has processed thus far. We should point out that, since the source language inthe evaluated systems is English, this tool deals with word senses at the root levelfor a given word category. This means that different derivations of a word sensewill add to the number of occurrences of the same root for a given word category.For example, in Table III, item 13 has two occurrences which may have comefrom hand and hands which are occurrences of the word sense ‘hand’. Anotherexample would be if the tool reads the word happy in any of the following forms(happy, unhappy, happily, or happiest), it will be counted as an occurrence for theword happy. On the other hand, the same word could occur more than once in ourdatabase depending on the sense and/or word category used in the texts processedby the tool. For example, we can see in Table III that the word force has two entriesin the database. This is because the texts that were processed used two differentsenses of this word, one as a noun and one as a verb. The same can be said aboutthe three different instances of the word form and the two different instances ofwords format and hand. This is so in our database since we deal with statisticsabout word senses and not only words. Human assistance is currently requiredduring the building of the database to differentiate between the different senses ofthe same word as well as the extraction of word sense roots.

Deciding which word sense belongs to which class and how many classes to usewas done by plotting the weights of all different words and analyzing their distribu-tion. In our case, we have found that dividing word senses into three classes gave areasonable trade-off between storage requirements and evaluation correctness. Thetwo extreme cases of classification would be to use only one class which meansthat all words in our database are equally important, or to use as many classes asthere are word senses which is equivalent to assigning a different weight for eachword sense.

Figure 2 depicts the statistical distribution of a randomly selected sample set ofword senses. From Figure 2 we have found it reasonable to divide word senses intothree categories. The first category would encompass all word senses correspond-ing to the prominent peaks (above 65 occurrences). The second category containsword senses which correspond to points in Figure 2 that occur less than the firsttype but still correspond to peaks in the graph (occurrences between 7 and 65). The


Table III. A sample from the database of word senses.

English word sense Arabic equivalent(s) Occ. Exist

1 force quwah 1 1

2 force yujbir, ijbar 4 1

3 form shakl 10 1

4 form namudhaj, istimaarah 2 1

5 form yakon, yunshi’ 1 1

6 format Siyghah 2 1

7 format yushakil 2 1

8 fortune Haz, naSiib 1 0

9 frame iTaar 1 1

10 full taam, kaamil, maliy’ 11 1

11 guide daliil 1 1

12 hand naaHiyah 1 1

13 hand yad 3 1

third category would contain all the remaining words (less than 7 occurrences).Therefore, one possible classification of word senses is, as shown in Figure 3, touse three classes, namely, “frequent”, “normal” and “rare”. Note that the numberof classes and the boundaries between classes are decided by the evaluator andmight be different for each domain. We believe however that using three classesis representative enough for our purpose. Note also that such a choice, althoughevaluator-dependent, would affect all evaluations of MT systems homogeneously.This means that the comparison between MT systems would not be biased towarda specific MT system nor dependent on the human evaluator.

Figure 3 shows a graphical representation of the data collected in the databaseof word senses. In particular, it shows, for each word sense, how many times ithas occurred in the corpus. For example, there are 100 word senses that have 3occurrences each. In addition, this figure depicts boundaries between and sizes ofthe classes of word senses. Using the above information we classified the databaseof word senses into the following classes:

• Class 1, of frequent word senses, contains word senses with an occurrence ratioof more than 0.01 (i.e., word senses with more than 65 occurrences).

• Class 2, of normal word senses, contains word senses with an occurrence ra-tio between 0.001 and 0.01 (i.e., word senses with numbers of occurrencesbetween 7 and 65).

• Class 3, of rare word senses, contains word senses with an occurrence per-centage more than 0 and less than 0.001 (i.e., word senses with a number ofoccurrences between 1 and 6).


Figure 2. Graphical representation for the occurrences of a sample set of word senses.

Figure 3. Graphical representation for word sense occurrences and classes.

To summarize, we have developed a tool which helps in the evaluation of thelexicon of an MT system and have implemented it using Delphi 3 for Windows 95.The tool implements the procedure build_wordsense_db, that is, reads a text fileand a domain and then generates an output file containing the following:

• cumulative statistics which show the occurrences of each word sense in all thetext files which get entered since the system is first used;

• the total number of word senses processed using this tool; and• a flag for each word sense to show its existence in the lexicon under evaluation.

4.2. EVALUATION OF AMT LEXICONS: THREE CASE STUDIES

For our purposes, we have evaluated the lexicons of three Arabic MT systems,Al-Mutarjim Al-Arabey, Arabtrans, and Al-Wafy using a database of word sensesautomatically built using the tool mentioned above.

Figure 4 shows a screen displaying some results obtained using our evaluationtool. The interface available to the evaluator allows them to select the following:


Figure 4. Screen-shot of tool for calculating the coverage of the lexicon of an MT system.

• local discrimination or no local discrimination mode using radio buttons; and• the ranges of the three classes of word senses. (Note that “average” in the figure

has been renamed as “normal” in this paper.)Using procedure build_wordsense_db, we obtain the following:

• the sizes of classes C1, C2, and C3, as well as the sum of these sizes as the sizeof the database of word senses (using equation (3));

• the number of distinct word senses, i.e., not counting the redundant occur-rences of word senses in the class under consideration;

• the weights of classes C1, C2, and C3 (using equation (4)).Now using procedure evaluate_lexicon, we obtain the required lexicon coveragemeasures:

• the lexicon coverages of classes C1, C2, and C3 (using equations (5) or (7),depending on the choice made as to the local or no-local discrimination); and,to conclude,

• the coverage of the MT system lexicon (using equation (8)).The collection of texts was selected from the domain of “Internet and Arabiza-

tion” with the following properties:• The total number of word senses in this database, size(DB), is 6,308.• The total number of word senses in this database with one occurrence for each

word sense, i.e., skipping the redundant occurrences of any word sense, is1,319.


Table IV. Number of examples in each class, and weight.

Frequent (C1) Normal (C2) Rare (C3) Total

Class size 1520 2692 2096 6308

Weight 24.1 42.7 33.2

Table V. Summary of the evaluation results.

Coverage Al-Mutarjim Al-Arabey Arabtrans Al-Wafy

Without local Class 1 1.000 1.000 1.000

discrimination Class 2 0.916 0.916 0.928

Class 3 0.904 0.901 0.884

Lexical 0.932 0.931 0.930

coverage

With local Class 1 1.000 1.000 1.000

discrimination Class 2 0.926 0.922 0.934

Class 3 0.924 0.920 0.905

Lexical 0.943 0.940 0.940

coverage

Using the classification of word senses shown in Table III and equation (3), thesizes of the classes are as shown in Table IV. Also shown is the weight for eachclass, calculated using equation (4).

Using equations (5), (7), and (8) we have evaluated the three MT systems men-tioned earlier. The results of this evaluation are shown in Table V, where the firstblock of rows is for the case where no local discrimination is applied and the secondblock of rows where local discrimination is applied.

As an enhancement of the above results, we have decided to remove some wordsenses which are very frequent (let us call them “necessary”) from the database(e.g., a, an, the, of, in, on, this, I, he, she, it, they, we, am, is, are, be, was, were,etc.). We have done so in order to measure their effect on the overall coverage of thelexicons since we may assume that all the lexicons of the MT systems must coverthem by default. After removing such words, we have calculated the coverage ofthe lexicons of the MT systems once more. The results are shown in Table VI.

4.3. DISCUSSION OF THE EVALUATION RESULTS

In the above evaluation process, we have evaluated the lexicons of the MT systemsin the domain of the Internet and Arabization. This means that the evaluation resultsare specific to this domain. In other words, if we change the domain, the results


Table VI. Evaluation of lexical coverage with “necessary” words not tested.

Al-Mutarjim Al-Arabey Arabtrans Al-Wafy

Without local discrimination 0.908 0.907 0.906

With local discrimination 0.919 0.915 0.914

might be different. This is so because a word sense which was classified as frequentin this domain might be classified as normal or rare in another domain and viceversa. Thus our methodology emphasizes and is sensitive to this important factorin the evaluation of the lexicon coverage of a domain language. Also, we haveapplied the lexicon evaluation methodology with and without local discrimination.This issue has already been discussed in Section 3.

From Tables V and VI, we can see that all three MT systems have adequatelexicons. Al-Mutarjim Al-Arabey has the best lexical coverage in both cases (withand without local discrimination), although it is worse than Arabtrans for the cov-erage of class-2 word senses in case of local discrimination. The three MT systemscover completely class-1 word senses. As to comparing Al-Wafy and Al-MutarjimAl-Arabey, one can see that, overall, the latter performs better than the former.However, the two systems perform exactly similarly on aspects related to the gram-matical, semantic, and pragmatic coverage.1 The comparison of the two systems,which have been produced by the same company, has given further strength to ourgeneralized (including the lexicon) evaluation methodology. Indeed, it has allowedus to infer that the lexicon of Al-Mutarjim Al-Arabey is richer than that of Al-Wafy,whereas the two systems have the same remaining modules.

Comparing with and without local discrimination, we can see that all threesystems have better coverage percentages when local discrimination is used. Thisis because all three systems cover the frequent senses in a good manner. Had one ofthe systems had a problem with the class of frequent word senses, this would havebeen reflected as a lower rating when using local discrimination. The improvementin the local discrimination case is also partly due to the high coverage of class-2and class-3 word senses.

Figure 5 depicts graphically the results of these evaluations. The figure showsthat all the Arabic MT systems evaluated give a lexical coverage percentage ofmore than 93%, which is acceptably good for the chosen domain. Upon analysis,we have realized that most of the missed word senses, in the test cases, are thosewords which have more than one sense. This points to the fact that, in the evaluatedMT system lexicon, alternative senses of the same words have been overlooked,either intentionally or unintentionally, by the system lexicon developers.

One issue that should be clarified here is the apparent contradiction betweenour results in this paper (the good coverage of all three lexicons) and the bad eval-uations these MT systems received elsewhere (Anon, 1996; Jihad, 1996; Qendelft,


Figure 5. Graphical representation of lexical coverage evaluation with and without localdiscrimination.

1997). Although the lexicon is an important part of an MT system, it is clear that itis not the only part. Through our experience dealing with the above three systems,the major flaws that gave them bad ratings were in areas such as grammatical andpragmatic coverage. This means that even though the developers of the systemshave clearly put a lot of effort into the lexicons of their respective systems, theoverall improvement that resulted was not satisfactory enough. As such, it is ourrecommendation that more effort be put on parts other than the lexicon in order toachieve noticeably better results. We do stress once more that the results presentedin this paper also show that the terminology of the area of Internet and Arabizationis well covered by the three systems. We are yet to test the performance of the MTsystem lexicons on other domains.

5. Conclusion

The research reported in this paper was motivated by the fact that we were notaware of any automated or semi-automated approach for evaluating MT systemsin general, and MT system lexicons more specifically. The research we have foundso far on the evaluation of MT systems depends to a large extent on the humanevaluator. We have therefore aimed at making the whole process as automated aspossible and we have concentrated in this work on the evaluation of MT systemlexicons.

In this paper, we have introduced a new methodology for the evaluation of (MTsystem) lexicons. To do so, we needed to introduce a number of new principles. Thewhole methodology is based on the primordial concept of word-sense occurrenceratio (or weight). This ratio gives a statistical assessment of the importance of some


word sense in a given domain. This points to another new feature, namely that thisratio may be different for the same word sense if the application domain changes.We believe that this is important if we want to have an accurate assessment of alexicon coverage based on the existence or not of some word senses dependingon the targeted application domain. Based on the central notion of word-senseoccurrence ratio, we have divided all the word senses of a language into variousclasses based on the range in which the ratio falls. This class division gives us animmediate quantitative understanding whether a word occurs frequently in somedomain or not. To complete the methodology, we have introduced formulas thatdefine the above concepts as well as the notions of class weight, class coverageand, finally, lexicon coverage.

Having developed the methodology, we have implemented it as two procedures.First, we have implemented a tool that takes as input sample texts (corpus) from agiven domain and incrementally produces a database of word senses, their occur-rence ratios, and class weights in the selected domain. This means that the richerthe corpus, the more precise the results will be. Second, we have implemented atool that takes as input an MT system lexicon as well as the corresponding databaseof word-sense occurrence ratios and class weights. This tool produces the classcoverage and lexicon coverage defined in the methodology. Finally, we have testedour tools on three Arabic MT systems and confirmed that the results reflect theintended assessment.

The main point that we would like to stress, besides the philosophy of themethodology which is based on the notion of a word-sense occurrence ratio orweight, is its semi-automatic nature. We should also point out that, to the best ofour knowledge, there is no formal evaluation of Arabic MT systems, apart fromnewspaper or magazine articles (Anon, 1996; Jihad, 1996; Qendelft, 1997). Thesearticles, unfortunately, have failed to present any systematic evaluations of theassessed systems.

In terms of implementation, future research should concentrate on integratinga morphological analyzer into the tool that produces statistics about word-senseoccurrence ratios. Indeed, so far a human operator is used to enter the root and theintended word sense for a given word during the process of building the data-base of word-sense occurrence ratios. The human operator can be replaced bya morphological analyzer of the language under consideration. Research shouldalso concentrate on developing a more general methodology that deals with theevaluation of MT systems, not just lexicons. Another exercise which may benefitthe area of lexicon evaluation is the testing of lexicons with respect to differentdomains. We have seen that the evaluated lexicons have performed well with theInternet and Arabization area. However, we will certainly have a better assessmentof lexicons if we can test them with respect to different domains. In fact, we canuse the methodology we have introduced to give (partial) evaluations of a lexiconwith respect to the domains under consideration as well as a combined evaluationof the lexicon in general, based on its partial evaluations.


Acknowledgements

We would like to thank MSc student A. Al-Sikhan for having implemented forus the methodology introduced in this paper. We would also like to acknowledgethe support of the Research Center of the College of Computer and InformationSciences of King Saud University (Grant RC1/418-419). Last but not least, weare very grateful to the anonymous referees for their useful comments which havehelped us improve the quality of this paper.

Note

1 This refers to work on the generalization of our evaluation methodology. This work has been donebut is yet to be reported on.

References

Anon: 1996, : [The Machine Translator: Al-Wafy], (Arabuter) 8,27–28.

Arnold, Doug, R. Lee Humphreys and Louisa Sadler (eds): 1993, ‘Special Issue on Evaluation ofMT Systems’, Machine Translation 8(1–2).

Dorr, Bonnie J. and Judith Klavans: 1994/5, ‘Special Issue: Building Lexicons for MachineTranslation’, Machine Translation 9(3–4), 10(1–2).

Dyson, Mary C. and Jean Hannah: 1987, ‘Towards a Methodology for the Evaluation of Machine-Assisted Translation Systems’, Computers and Translation 2, 163–176.

EAGLES (Expert Advisory Group on Language Engineering Standards): 1995, ‘Evaluation of NaturalLanguage Processing Systems’, EAGLES Document EAG-EWG-PR.2.

EAGLES: 1998, Proceedings of the Second EAGLES II Workshop on Evaluation in Human LanguageTechnology, Geneva.

Hutchins, W. John and Harold L. Somers: 1992, An Introduction to Machine Translation, London:Academic Press.

Jihad, A. : 1996, [Has the Arabic MachineTranslation Era Started?], (Byte Middle East), November 1996, pp.36–48.

King, M. and K. Falkedal: 1990, ‘Using Test Suites in Evaluation of Machine Translation Sys-tems’, in COLING-90: Papers presented to the 13th International Conference on ComputationalLinguistics, Helsinki, Finland, Vol. 2, pp. 211–216.

Klein, J., S. Lehmann, K. Netter and T. Wegst: 1999, ‘DiET in the Context of MT Evaluation’, inRita Nübel and Uta Seewald-Heeg (eds), Evaluation of the Linguistic Performance of MachineTranslation Systems, Proceedings of the Konvens’98, St. Augustin: Gardezi Verlag, pp. 107–126.

Lehmann, Sabine, Stephan Oepen, Sylvie Regnier-Prost, Klaus Netter, Veronika Lux, JudithKlein, Kirsten Falkedal, Frederik Fouvry, Dominique Estival, Eva Dauphin, Herv, Compagnion,Judith Baur, Lorna Balkan and Doug Arnold: 1996, ‘TSNLP – Test Suites for Natural LanguageProcessing’, COLING-96: The 16th International Conference on Computational Linguistics,Copenhagen, Denmark, pp. 711–716.

Lehrberger, John and Laurent Bourbeau: 1987, Machine Translation: Linguistic Characteristics ofMT Systems and General Methodology of Evaluation, Amsterdam: John Benjamins.

Mason Jane and Adriane Rinsche: 1995, Ovum Evaluates: Translation Technology Products,London: OVUM Ltd.


Melby, A. K.: 1988, ‘Lexical Transfer: Between a Source Rock and a Hard Target’, Coling Budapest:Proceedings of the 12th International Conference on Computational Linguistics, Budapest, pp.411–419.

Nagao, M.: 1985, ‘Evaluation of the Quality of Machine-Translated Sentences and the Control ofLanguage’, Journal of the Information Processing Society of Japan 26, 1197–1202.

Nyberg, Eric H. and Teruko Mitamura: 1992, ‘The KANT System: Fast, Accurate, High-QualityTranslation in Practical Domains’, Proceedings of the fifteenth [sic] International Conference onComputational Linguistics, COLING-92, Nantes, pp. 1069–1073.

Nyberg, Eric H., Teruko Mitamura and Jaime G. Carbonell: 1994, ‘Evaluation Metrics forKnowledge-Based Machine Translation’, COLING 94: The 15th International Conference onComputational Linguistics, Kyoto, Japan, pp. 95–99.

Qendelft, G. : 1997, :[The Translation Program Al-Wafy Is

Useful for Getting a General Understanding of a Letter Written in English], (Al-Hayat)(Saturday, 25 October).

Sinaiko, H. W. and G. R. Klare: 1972, ‘Further Experiments in Language Translation: Readability ofComputer Translations’, ITL 15, 1–29.

Sinaiko, H. W. and G. R. Klare: 1973, ‘Further Experiments in Language Translation: A SecondEvaluation of the Readability of Computer Translations’, ITL 19, 29–52.

van Slype, G.: 1979, ‘Systran: Evaluation of the 1978 Version of the Systran English-French Auto-matic System of the Commission of the European Communities’, The Incorporated Linguist 18,86–89.

Vasconcellos, Muriel (ed.): 1988. Technology as Translation Strategy, Binghamton, NY: StateUniversity of New York at Binghamton (SUNY).

White, John S., Theresa O’Connell and Francis O’Mara: 1994, ‘The ARPA MT Evaluation Meth-odologies: Evolution, Lessons, and Future Approaches’, Technology Partnerships for Crossingthe Language Barrier: Proceedings of the First Conference of the Association for MachineTranslation in the Americas, Columbia, Maryland, pp. 193–205.

Wilks Y.: 1992, ‘SYSTRAN: It Obviously Works, But How Much Can It Be Improved?’, in JohnNewton (ed.), Computers and Translation: A Practical Appraisal, London: Routledge, pp. 166–188.

a methodology for a semi-automatic evaluation of the lexicons of machine translation systems

Documents