chapter 7 evaluation and results -...
TRANSCRIPT
-
221
Chapter 7 Evaluation and Results
7.1 Introduction
The evaluation of a Machine Translation System and measuring translation
performance is a difficult and complex task. There are many factors involved,
most important being that a natural language is not exact in the way that
mathematical models and theories in science are. Therefore, Commercial MT
Systems cannot translate all texts reliably. [158]
7.2 Types of Evaluation [95]
Three broad classes of MT evaluation strategy are enlisted below:
Typological Evaluation seeks to specify which particular linguistic constructions
the system handles satisfactorily and which it does not. The principal tool for such
an investigation is a test suite – a set of sentences which individually represent
specified constructions and hence constitute performance probes.
Declarative Evaluation seeks to specify how a MT system performs relative to
various dimensions of translation quality.
Operational Evaluation seeks to establish how effective an MT system is likely to
be (i.e. in terms of cost effects) as part of a given translation process.
-
222
7.2.1 Typological Evaluation
Typological evaluation is primarily of interest to system developers. Potential
users may not be familiar with the linguistic descriptions used, nor is it likely to be
apparent how frequently some missing or badly-handled construction might occur
in their particular text type. The system is being tested by a suite of sentences.
These sentences illustrate particular types of linguistic constructions that the
system is likely to encounter in its lifetime. If the system is intended to operate
within a particular subject field then it is obvious that its design will reflect this
sublanguage. The test-suite approach has an advantage over corpus-based
approach. The corpus contains a large amount of redundancy, i.e. most
constructions will be encountered more than once, whereas in case of test suite
each combination of concepts appear once.
Once a corpus has been established for the task in hand, statistical information
regarding the type and frequency of lexical and grammatical phenomena
contained therein should be obtained in order to be able to evaluate the capability
of the system to successfully translate sentences contained in the corpus. If good
observed frequency data were not available then the system’s potential would
either be over estimated or even under estimated. At present, however, this
statistical information will almost certainly be gained by hand (a laborious
process), as the necessary tool capable of parsing texts in this way is not
available. Assuming that we have established the relative frequency of the
phenomena contained in the corpus, the test suite can now be constructed.
-
223
7.3 Metrics used for Automatic Evaluation
A human evaluation of MT system is rather time consuming and exhaustive
which is not practical for developers. It takes human labour which can not be
reused. Human evaluations of Machine Translation (MT) have many aspects of
translation including adequacy, fidelity and fluency of the translation. [159]. A
metric that evaluates Machine Translation output represents the quality of the
output. The quality of a translation is inherently subjective and there is no
objective or quantifiable "good." Therefore, any metric that exists must assign
quality scores so that the quality of translation can be correlated with the human
judgment of quality. That is, a metric should score high for the translations that
humans score high, and give low scores to the ones for which humans give low
scores. Human judgment is the benchmark for assessing automatic metrics, as
humans are the end-users of any translation output.[161]
Many automated measures have been proposed to facilitate fast and cheap
evaluation of MT systems. Most efforts focus on devising metrics based on
measuring the closeness of the output of MT systems to one or more human
translation; the closer it is, the better it is.
Some Methods of Automatic Evaluation of MT are discussed below:
BLEU : (BiLingual Evaluation Understudy) The rationale behind the development
of Bleu is that human evaluation of Machine Translation can be time consuming
and expensive. An automatic evaluation metric, on the other hand, can be used
-
224
for frequent tasks like monitoring incremental system changes during
development, which are seemingly infeasible in a manual evaluation. The quality
of translation is indicated as a number between 0 and 1. It is measured as
statistical closeness to a given set of good quality human reference translations.
It does not directly take into account translation intelligibility or grammatical
correctness. The primary programming task in BLEU implementation is to
compare n-grams of the candidate with the n-grams of the reference translation
and count the number of matches. These matches are position independent, the
more the matches, the better the candidate translation. The modified n-gram
precision computation for any n is all candidate n-gram counts and their
corresponding max. reference counts.
The candidate counts are clipped by their and corresponding reference max.
value. These values are summed and divided by the total number of candidate n-
grams. The modified n-gram precision on a multi-sentence test set is computed
by the formula:
Pn= Σ Σ Count clip (n-gram)
C {Candidates} n-gram C
_______________________________________
Σ Σ Count clip (n-gram’)
C’ {Candidates} n-gram’ C’
-
225
This means that a word-weighted average of the sentence-level modified
precision is used rather than a sentence-weighted average.
NIST: NIST is another method for evaluating the quality of the text translated
using Machine Translation. It is based on BLEU metric with some alterations. It
calculates how informative a particular n-gram is. When calculating brevity
penalty, small variations in translation length do not impact overall score very
much.
METEOR: The current version of the METEOR automatic evaluation metric, scores
Machine Translation hypotheses by aligning them to one or more reference
translations. Alignments are based on exact stem, synonym and paraphrase
matches between words and phrases. A lexical similarity score is then calculated
based on the alignment for each hypothesis-reference pair. The metric includes
several free parameters that are tuned to emulate various human judgment tasks
including adequacy, ranking, and HTER.
Word Error Rate: WER works at the word level. It was originally used for
measuring the performance of speech recognition systems, but is also used in
the evaluation of Machine Translation. The metric is based on the calculation of
the number of words that differ between a piece of machine translated text and a
reference translation. This measure is based on the Levenshtein distance — the
-
226
minimum number of substitutions, deletions and insertions that have to be
performed to convert the automatic translation into a valid translation.
PER. Position-Independent Word Error Rate. A shortcoming of the WER
measure is that it does not allow reordering of words. In order to overcome this
problem, the position independent word error rate (PER) compares the words in
the two sentences without taking the word order into account.[162]
TER. Translation Edit Rate[163]. TER measures the amount of post-editing that a
human would have to perform to change a system output so it exactly matches a
reference translation. Possible edits include insertions, deletions, and
substitutions of single words as well as shifts of word sequences. All edits have
equal cost.
7.4 Related Works:
ALPAC report in 1966 was a study comparing different levels of human
translation with Machine Translation output, using human subjects as judges. It
consists of two variables Fidelity and Intelligibility. Fidelity (or Accuracy) is a
measure of how much information the translated sentence retained compared to
the original. Intelligibility is a measure of understandability of results of automatic
translation.[164] Advanced Research Projects Agency (ARPA) created a
methodology to evaluate Machine Translation Systems, and continues to perform
evaluations based on this methodology. The evaluation programme was
-
227
instigated in 1991, and continues to this day. The evaluation programme involved
testing several systems based on different theoretical approaches: statistical,
rule-based and human-assisted. A number of methods for the evaluation of the
output from these systems were tested in 1992. These methods include:
comprehension evaluation, quality panel evaluation, and evaluation based on
adequacy and fluency. [161] The first approach to metric combination based on
human likeness was given by Corston-Oliver who used decision trees to
distinguish between human-generated (‘good’) and machine generated (‘bad’)
translations.[165] They suggested using classifier confidence scores directly as a
quality indicator. High levels of classification accuracy were obtained. However,
they focused on evaluating only the well-formedness of automatic translations
(i.e., subaspects of fluency). Preliminary results using Support Vector Machines
were also discussed. Kulesza and Shieber extended the approach by Corston-
Oliver to take into account other aspects of quality further than fluency alone.
Instead of decision trees, they trained Support Vector Machines (SVM). They
used features inspired by well-known metrics such as BLEU, NIST, WER, and
PER. Metric quality was evaluated both in terms of classification accuracy and in
terms of correlation with human assessment at the sentence level. A significant
improvement with respect to standard individual metrics was reported.[167]
In a different research line, Akiba et al. suggested directly predicting human
scores of acceptability, approached as a multiclass classification task. They used
decision tree classifiers trained on multiple edit-distance features based on
-
228
combinations of lexical, morphosyntactic and lexical semantic information (e.g.,
word, stem, part-of-speech, and semantic classes from a Thesaurus). Promising
results were obtained in terms of local accuracy over an internal predefined set of
overall quality assessment categories. Quirk presented a similar approach, also
with the aim to approximate human quality judgements, with the particularity that
human references were not required. Recently, Paul extended these works so as
to account for separate aspects of quality: adequacy, fluency and acceptability.
They used SVM classifiers to combine the outcomes of different automatic
metrics at the lexical level (BLEU, NIST, METEOR, GTM, WER, PER and TER).
Also very recently, Albrecht and Hwa re-examined the SVM-classification
approach by Kulesza and Shieber and Corston-Oliver and, inspired by the work
of Quirk suggested a regression-based learning approach to metric combination,
with and without human references.[170-171] Their results outperformed those by
Kulesza and Shieber in terms of correlation with human assessments. In a
different approach, Ye suggested approaching sentence level MT evaluation as a
ranking problem.They used the Ranking SVM algorithm to sort candidate
translation. Assessments were based on a 1-4 scale similar to overall quality
categories used by Akiba. [161-165]
7.5 Approach followed for Evaluation of Machine Translation System
Different metrics are used to evaluate different stages of the Machine Translation
System. The stage-I, tagging the words with its part of speech is evaluated.
-
229
Stage 2 corresponds to the evaluation of Phrase Chunker followed by stage 3 i.e
the evaluation of final translator. These stages are discussed in the sub-sections
followed below:
7.5.1 Evaluation of Part-of-Speech Tagging
The most commonly used evaluation measure used for part-of-speech tagging is
Accuracy. This is sometimes expressed in percentage or as a value between 0
and 1. This accuracy measure can be defined as given below:
For evaluation, part-of-speech tagger was applied on tagging a set of 1000
sentences collected from the crime news from various newspapers and from
legal documents. The sentences contain about 8665 words. The outcome was
manually evaluated to mark the correct and incorrect tag assignments. There
are total 527 tag sets and a set of 1000 sentences with 8922 words was used as
tagged corpus for training the part of speech tagger. Table 7.1 provides the
tagging results.
Table 7.1 Part-of-Speech Tagging Results
Tagged Words Unknown Words
Incorrect Tags
Correct Tags
Total Words
501 407 7757 8665
Accuracy = Total Number of Words Tagged
Total Number of Words having Correct Tags
-
230
Based on the data given in the above table for part-of-speech tagging results and
the accuracy definition, the following different accuracy measures (in percentage)
were calculated:
Accuracy 1 represents the typical accuracy results for our tagger, i.e. total
number of words having a unique correct tag divided by total number of words
tagged.
In the sentences chosen for testing the system, there are some words which are
not recognized since these are not present in the morph database. These words
include some proper nouns which are not recognized by the proper noun
gazetteer as it may not be followed or preceded by some special words stored in
our database made for recognizing proper nouns. Accuracy 2 is calculated by
dividing the correct tags by the difference of total words and the unknown words.
Accuracy results are shown below.
Table 7.2 Parts-of-Speech Tagging Accuracy
Accuracy 1 Accuracy 2
89.52 95.01
Accuracy 2 = Total No. of Words – Unknown Words Total No. of Words having Correct Tag
Accuracy 1 = Total No. of Words
Total No. of Words having Correct Tag
-
231
Comparison with Existing Systems
The accuracy measure was compared with the accuracy of the rule based tagger
developed at Punjabi University Patiala. The rule based tagger has the accuracy
of 80.29% including unknown words and 88.86% by excluding unknown words.
By using statistical approach, the tagging accuracy including unknown words is
increased to 89.52% and by excluding unknown words, it is raised to 95.01%.
One of the popular POS taggers is TnT tagger which has shown to have high
accuracy in English and some other languages. It provides overall tagging
accuracy of 96.64%, specifically, 97.01% on known words. Dandapat et al.
reported 95% accuracy for Hindi using HMM.[136] Shacham reported 87.27%
accuracy for Hebrew.[174] Bharati and Mannem reported 67-77% accuracy for
Hindi, Telugu, and Bengali using 24 tags and applying various statistical
techniques.[142] A hybrid POS tagger for Tamil using HMM technique and a rule
based system and had a precision of 97.2%. The accuracy tagger for Machine
Translation System in present study is comparable to highly accurate taggers.
7.5.2 Evaluation of Phrase Chunking
Chunking phrase includes the noun phrases, verb phrases, adjective phrases
and the postpositional phrases. The accuracy of the chunking phrase can be
calculated using the formulas defined below.
Precision = Number of Proposed Chunks
Number of Correct Proposed
-
232
Precision can be seen as a measure of exactness or conformity, and Recall is a
measure of completeness. In other words, precision tells how accurate the
system is and recall specifies how complete the system is. On a measurement
scale of 0 to 1, a value close to 1 is desirable for both of these measures. Fβ
measure or just F-measure (for β = 1) is the weighted harmonic mean of
precision and recall. The value of β allows precision and recall to be weighted
differently. In all our experiments being conducted, in this and the next section, β
is be set to 1, thus, giving an equal weight to both the precision and recall. When
β=1, the weighted Harmonic Mean is calculated as
The phrase chunker for the present study was manually evaluated for 1000
sentences having structure as per our system’s input scope. The results for three
phrase chunks are provided below in Table 7.3. According to the input scope of
sentences, the sentences containing only adjective are almost negligible, so
while evaluating the chunker, adjective phrases are not taken into account.
F = Precision + Recall
2 x Precision x Recall
Fβ = β2 x Precision + Recall
Recall = Number of Correct Chunks
(β2+1) x Precision x Recall
Number of Correct Proposed Chunks
-
233
Table 7.3 Phrase Chunking Counts
Phrase Type Number of Proposed Chunks
Number of Correct Proposed Chunks
Number of Correct Chunks
Noun Phrase 1534 1426 1512
Postpositional Phrase
244 200 278
Verb Phrase 1022 910 964
Grand Total 2800 2536 2754
Based on the values of Table 7.3, the precision, recall, and F-measure values for
different chunk types are shown in Table 7.4 below. These values are expressed
in terms of percentage.
Table 7.4 : Phrase Chunking Results
Phrase Group Type Precision Recall F-measure
Noun Phrase 92.95 94.31 93.62
Postpositional
Phrase
81.96 71.94 76.62
Verb Phrase 89.04 94.39 91.63
Average 87.98 86.88 87.29
As per the above table, average precision comes out to be 87.98%, average
recall 86.88%, and average F-measure value 87.29%, which reveals that if the
-
234
words are tagged accurately and the structure of the sentences follow the
assumptions, the precision and recall of the phrase chunker can reach to high
levels. Precision and recall for noun phrases and verb phrases is more as
compared to post positional phrases as the structure of postpositional phrase has
complexities whereas noun phrase and verb phrases are relatively simpler. It can
be further increased by training the system with more phrases.
Comparison with existing systems
Singh et al. followed a rule-based approach for Hindi and reported 91% precision
and 100% recall using 5 phrase tags.[153] Tamil text chunking had a precision of
97.4%.[175] Hindi phrase chunker developed at IIT Kharagpur has a precision of
87.22% and recall of 94.62%. The recall of chunker can be improved by addition
of more rules.
7.5.3 Evaluation of Final Translation
The collection of sentences to be given as input for evaluation of Machine
Translation System in various research areas varies according to their specific
research considerations. Some of them include
7.5.3.1 Selection of a Set
It is very important aspect in MT evaluation to make appropriate selection of the
sentences for evaluating the Machine Translation System. These sentences
correspond to the following:
-
235
Test Corpora: It is a collection of naturally occurring text in electronic form.
Test Suites: It is a collection of artificially constructed inputs, where each input is
designed to probe a system's treatment of a specific phenomenon or set of
phenomena. Inputs may be in the form of sentences, sentence fragments, or
even sequence of sentences.
Test Collections: It is a set of inputs associated with a corresponding set of
expected outputs.
For the present system, the random selection of sentences has been made to be
used as input in the evaluation process. These sentences correspond to news
related to some crime either from various Punjabi newspapers or from the First
Information Reports(FIR’s) gathered from local police stations and lawyers. The
sentences were too complex. So these were first divided to form simple
sentences following certain assumptions being considered for the system.
Sentence length has also been restricted to maximum of 12 words and the
phrase length is restricted to maximum of 6 words. The test data set is shown in
the table 7.5
Table 7.5: Test Data Set for the Evaluation of Punjabi to English Machine
Translation System
Total Sentences
1000
Total Words 8665
-
236
7.5.3.2 Selection of Tests for Evaluation
There are number of tests available for evaluating the Machine Translation
Systems. In the evaluation procedure for present Machine Translation System
being developed, both Qualitative (Subjective) and Quantitative tests have been
applied. The subjective test includes two tests, Intelligibility Test and Accuracy
Test and the Quantitative test includes only one i.e. Word Error Rate (WER) Test.
These tests are explained below:
7.5.3.2.1 Intelligibility Tests:
A traditional way of assessing the quality of translation is to assign scores to
output sentences. This test is used to check the intelligibility of the MT System. A
common aspect to score for is Intelligibility, where the intelligibility of a translated
sentence is affected by grammatical errors, mistranslations and untranslated
words. A four point scale is most adequate, in that it measures intelligibility only,
has a low scatter and is of a sufficiently discriminatory character since the
evaluation covers several hundreds of sentences and the average calculated as
a percentage is sufficiently precise. Scoring scales reflect top marks for those
sentences that look like perfect target language sentences and bottom marks for
those that are so badly degraded so as to prevent the average
translator/evaluator from guessing what a reasonable sentence might be in the
context. In between these two extremes, output sentences are assigned higher or
-
237
lower scores depending on their degree of awfulness.[176] The scale is given in
tablr 7.6
Table 7.6 Score Sheet for Intelligibility Test
Score Significance
3 The sentence is perfectly clear and intelligible. It is grammatically correct.
2 The sentence is generally clear and intelligible. Despite some
inaccuracies, one can understand the information to be conveyed.
1 The general idea is intelligible only after considerable study. The sentence contains grammatical errors and/or poor word choice.
0 The sentence is unintelligible. The meaning of the sentence is not understandable.
7.5.3.2.2 Accuracy Test / Fidelity Measure
By measuring intelligibility we get only a partial view of translation quality. A
highly intelligible output sentence need not be a correct translation of the source
sentence. It is important to check whether the meaning of the source language
sentence is preserved in the translation. This property is called Accuracy or
Fidelity [176]. Scoring for accuracy is normally done in combination with (but
after) scoring for intelligibility. As with intelligibility, some sort of scoring scheme
for accuracy must be devised. Whilst it might initially seem tempting to just have
simple `Accurate' and `Inaccurate' labels, this could be somewhat unfair to an MT
system which routinely produces translations which are only slightly deviant in
-
238
meaning. The evaluation procedure is fairly similar to the one used for the scoring
of intelligibility. However the scorers obviously have to refer to the source
language text (or a high quality translation of it in case they cannot speak the
source language), so that they can compare the meaning of input and output
sentences.
A Four Point Scale is selected in which highest score is assigned to those
sentences that are completely faithful and lowest score is assigned to the
sentence which are un-understandable and unacceptable. The scale looks like
Table 7.7 Score Sheet for Accuracy Test
Score Significance
3 Completely faithful
2 Fairly faithful: more than 50 % of the original
information passes in the translation.
1 Barely faithful: less than 50 % of the original
information passes in the translation.
0 Completely unfaithful. Doesn’t make sense.
-
239
7.5.4 Experiments
To evaluate the system, different evaluators were chosen. About 30 people are
chosen who are well qualified and most of them are in teaching profession having
knowledge of both the languages and also translation rules for translating Punjabi
sentences to English. Some of the persons amongst them are more familiar with
English and have less knowledge about Punjabi, but have knowledge about
Hindi. These persons are provided with experiments related to intelligibility tests.
Average ratings for the sentences of the individual translations are then summed
up separately according to intelligibility and accuracy to get the average scores.
Percentage of accurate sentences and intelligent sentences is calculated.
7.5.4.1 Intelligibility Evaluation
The evaluators did not have any clue about the source language i.e. Punjabi.
They judged each sentence of target language i.e English, which is the output of
the translator on the basis of its comprehensibility. The target user had been a
layman who was interested only in the comprehensibility of translations.
Intelligibility in this case is affected by grammatical errors, mis-translations, and
un-translated words.
7.5.4.1.1 Scoring
The scoring is done based on the degree of intelligibility and comprehensibility. A
four point scale is made in which highest score is assigned to those sentences
-
240
that look perfectly alike the target language and lowest score is assigned to the
sentence which is un-understandable. Detail is as follows:
Score 3: The sentence is perfectly clear and intelligible. It is grammatically correct
and reads like ordinary text.
Score 2: The sentence is generally clear and intelligible. Despite some
inaccuracies, one can understand immediately what it means.
Score 1: The general idea is intelligible only after considerable study. The
sentence contains grammatical errors and/or poor word choice.
Score 0: The sentence is unintelligible. Studying the meaning of the sentence is
hopeless. Even allowing for context, one feels that guessing would be too
unreliable.
7.5.4.1.2 Results
According to the responses of 30 respondents, who were asked to judge the
translated sentences on the basis of the 4-point scale as discussed above, the
observations are as given in the table 7.8
-
241
Table 7.8 : Summary of Respondents’ Perception of Translated Sentences for Intelligibility Rating
Total Number of Sentences Respondents
Score 0 Score 1 Score 2 Score 3
1 187 110 342 361
2 215 215 198 372
3 118 171 302 409
4 154 129 206 511
5 186 186 225 403
6 146 169 310 375
7 195 195 311 299
8 164 110 301 425
9 138 151 226 485
10 125 164 392 319
11 168 169 263 400
12 237 105 281 377
13 165 146 298 391
14 115 95 346 444
-
242
Total Number of Sentences Respondents
Score 0 Score 1 Score 2 Score 3
15 210 116 272 402
16 138 148 275 439
17 106 118 265 511
18 188 188 274 350
19 165 147 370 318
20 154 118 307 421
21 147 163 375 315
22 118 167 314 401
23 105 200 318 377
24 118 105 408 369
23 125 94 336 445
26 201 156 252 391
27 148 154 230 468
28 180 117 242 461
29 198 168 218 416
30 141 121 321 417
Percentage 15.85 14.65 29.26 40.24
-
243
The responses by the evaluators were analyzed and following results were
observed:
40.24 % sentences got the score 3 i.e. they were perfectly clear and
intelligible.
29.26 % sentences got the score 2 i.e. they were generally clear and
intelligible.
14.65 % sentences got the score 1 i.e. they were hard to understand.
15.85 % sentences got the score 0 i.e. they were not understandable.
So we can say that about 69.50 % sentences are intelligible. These sentences
are those which have score 2 or above.
Sample Translations with Intelligibility Results
S No. English Sentence Score Number
1. He told his name surinder singh 2
2. Bag of cloth was searched 3
3. I was present on moment 1
4. I should be informed of case number 3
5. Mouth of bottle is sealed with lid 3
6. 18 bottle whisky found 2
7. I is three boys and a girl 0
8. Elder son of mohan is balwinder singh 2
-
244
S No. English Sentence Score Number
9. Age of balwinder is 34 year 2
10. He works in grain market 3
7.5.4.2 Accuracy Evaluation / Fidelity Measure
The evaluators were provided with source text along with translated text. A highly
intelligible output sentence need not be a correct translation of the source
sentence. It is important to check whether the meaning of the source language
sentence is preserved in the translation or not. This property is called as
accuracy.
7.5.4.2.1 Scoring:
The scoring is done on the basis of the degree of intelligibility and
comprehensibility. A four point scale is used in which highest point is assigned to
those sentences that look perfectly like the target language and lowest point is
assigned to the sentence which is not understandable and unacceptable. The
description of the scale is as given below:
Weight Description
Score 3 Completely faithful
Score 2 Fairly faithful: more than 50 % of the original information passes in
the translation.
-
245
Weight Description
Score 1 Barely faithful: less than 50 % of the original information passes in the translation.
Score 0 Completely unfaithful. It doesn’t make any sense
7.5.4.2.2 Results
According to the responses of 30 respondents, who were asked to judge the
translated sentences on the basis of the 4-point scale as discussed above, the
observations are as given in the table 7.9
Table 7.9 : Summary of Respondents’ Perception of Translated Sentences for
Accuracy Rating
Total Number of Sentences Respondents
3 2 1 0
1 412 262 211 115
2 422 276 200 102
3 471 230 210 89
4 472 202 220 106
5 434 280 175 111
6 507 202 170 121
7 432 307 139 122
8 525 228 162 85
-
246
Total Number of Sentences Respondents
3 2 1 0
9 452 208 175 165
10 365 316 205 114
11 436 280 165 119
12 431 303 145 121
13 365 312 218 105
14 411 260 232 97
15 459 230 218 93
16 359 290 253 98
17 552 230 135 83
18 332 403 167 98
19 459 260 151 130
20 440 235 212 113
21 422 270 210 98
22 444 236 217 103
23 325 349 217 109
24 346 322 182 150
23 332 367 180 121
26 441 270 192 97
-
247
Total Number of Sentences Respondents
3 2 1 0
27 547 208 153 92
28 539 221 150 90
29 552 220 142 86
30 411 290 165 134
Percentage 43.65 26.89 18.57 10.89
The table 7.9 depicts the total number of sentences rated by each respondent for
a particular category of scores. It shows that 43.65 % of sentences got the score
3 i.e. these are completely faithful. 26.89 % sentences got the score 2 i.e. they
were fairly faithful. 18.57 % sentences got the score 1 i.e. barely faithful. 10.89 %
sentences got the score 0 i.e. completely unfaithful.
So we can say that about 70.54 % sentences are faithful i.e. they are completely
correct translations or more than 50% of the information is conveyed in the
translation of these sentences. These sentences belong to the category of
scoring 2 or above. The results also depict that the percentage of the sentences
conveying no meaning at all came out to be the least whereas the completely
meaningful sentences or the correct sentences had the highest score. This
concludes for going towards the acceptability of the system.
-
248
Some Sample Translations with Accuracy Results
S No Punjabi Sentence Transliteration English Sentence Score
1. ਉਸ ਨੰੂ ਗੁਰੂ ਨਾਨਕ ਦੇਵ ਹਸਪਤਾਲ ਿਵਚ ਦਾਖਲ ਕਰਵਾਇਆ ਿਗਆ
us nūṃ gurū nānak dēv haspatāl vic dākhal karvāiā giā
He was admitted in guru nanak dev hospital
3
2. ਉਹ ਆਪਣੇ ਕੁਝ ਸਾਥੀਆਂ ਨਾਲ ਆਏ ਸਨ
uh āpaṇē kujh sāthīāṃ nāl āē san
They came with their some friends
2
3. ਉਸ ਨੇ ਹਮਲਾ ਕੀਤਾ us nē hamlā kītā He attacked 3
4. ਮੋਹਨ ਉਹਨਾ ਨੰੂ ਜਖਮੀ ਕਰਕੇ ਫਰਾਰ ਹੋ ਿਗਆ
mōhan uhnā nūṃ jakhmī karkē pharār hō giā
Mohan ran away by injuring them
1
5. ਓ◌ੁਹਨੇ ਮੋਟਰ ਸਾਇਕਲ ਿਵਚ ਗੱਡੀ ਮਾਰ ਿਦਤੀ
uhnē mōṭar sāikal vic gaḍḍī mār ditī
He stuck car in motorcycle
1
6. ਭਗਵਾਨ ਿਸੰਘ ਦੀ ਮੌਤ ਹੋ ਗਈ।
bhagvān siṅgh dī maut hō gaī.
Bhagwan singh died 3
7. ਉਹ ਆਪਣੇ ਿਰਸ਼ਤਦਾਰਾਂ ਕੋਲ ਜਾ ਿਰਹਾ ਸੀ।
uh āpaṇē rishtadārāṃ kōl jā rihā sī.
He was going to his relatives
2
8. ਉਹ ਿਵਆਹ ਦੇਖਣ ਜਾ ਿਰਹਾ ਸੀ।
uh viāh dēkhaṇ
jā rihā sī.
He was going to see
marriage
3
-
249
S No Punjabi Sentence Transliteration English Sentence Score
9. ਮੈ ਿਬਸ਼ਨ ਿਸੰਘ ਪੁੱਤਰ ਹਮੀਰ ਿਸੰਘ ਖੰਨੇ ਦਾ ਰਿਹਣਵਾਲਾ ਹਾ ਂ
mai bishan siṅgh puttar hamīr siṅgh
khannē dā rahiṇvālā hāṃ
I am resident of khanna bishan singh son hameer singh
0
10. ਚੋਰ ਹਾਰਡਵੇਅਰ ਦਾ ਸਾਮਾਨ ਚੋਰੀ ਕਰਕੇ ਲੈ ਗਏ
cōr hārḍavēar dā sāmān cōrī karkē lai gaē
Thieves took luggage of hardware
1
7.5.4.3 Word Error Analysis
Error analysis is done against pre classified error list. All the errors in translated
text were identified and their frequencies were noted. Errors were just counted
and not weighted. After analyzing the sentences under testing, 1129 words out of
8665 words are found to be incorrect i.e. the Word Error rate is found to be
13.02%.
Table 7.10: Percentage of Type of Errors Out of the Total Errors Found
Type of Word Error Number of words Percentage error
Wrongly Translated Words 113 10.01%
Untranslated Words 338 29.93%
Wrong Choice of Words 375 33.21%
Addition and Removal of Words
303 26.83%
-
250
From the above table it is concluded that most of the errors are due to wrong
choice of words which has been the main reason that word sense ambiguity has
not been removed and hence it has been limited to a legal domain. Even then
there are certain words which are ambiguous. Since the type of sentences or
phrases is limited to a particular order. Some words when not found according to
the rule of phrase are not translated. Since there are structural differences of
languages, there is a need to insert some words like ‘to’, ‘has’, ‘have’, ‘had’ etc.
and deletion of some words in the sentences is also required. As the tagging has
high accuracy, so the number of wrongly translated words have low percentage.
7.6 Comparison with Other Existing Systems:
The accuracy level of other existing systems are compared as shown in the table 7.11
Table 7.11: Comparison of Present System with Other Existing Systems
MT SYSTEM Accuracy Test Used
Hinglish Satisfactory results in more than 90% of
cases.
Accuracy Test
Mantra (English-Hindi) 93% Accuracy Test
English-Arabic 85% Accuracy Test
Hindi-to-Punjabi 94%
90.84%
Intelligibility Test
Accuracy Test
Punjabi-English 69.50%
70.54%
Intelligibility Test
Accuracy Test
-
251
The comparison of the accuracy level of these systems with the present system
shows that the system has lower accuracy as compared to these compared
systems. It is due to the reason that word level ambiguity is not resolved here.
After adding the WSD module, the accuracy of the system can be highly
improved.
7.7 Conclusion
By applying subjective tests and the quantitative metrics for evaluation, it has
been found that the Machine Translation System for translation of legal
documents from Punjabi to English is found to be 69.50% on the basis of
intelligibility test and 70.54% on the basis of accuracy test. The accuracy can be
improved by training the system with large corpus and by adding word sense
disambiguation module. Improving the post processing module can even raise
the accuracy and intelligibility level.