d3.3.2 final machine translation based semantic …...deliverable+d3.3.2+ xlike+...
TRANSCRIPT
FP7-‐ICT-‐2011.4.2 Contract no.: 288342 www.xlike.org
Deliverable D3.3.2
Final machine translation based semantic annotation prototype
Editor:
Marko Tadić, UZG
Author(s): Marko Tadić, UZG; Matea Srebačić, UZG; Daša Berović UZG; Danijela Merkler, UZG; Tin Pavelić, UZG.
Deliverable Nature: Prototype (P)
Dissemination Level: (Confidentiality)
Public (PU)
Contractual Delivery Date: M30
Actual Delivery Date: M36
Suggested Readers: All partners using the XLike Toolkit
Version: 0.5
Keywords: machine translation, semantic annotation, Cyc ontology
XLike Deliverable D3.3.2
Page 2 of (27) © XLike consortium 2012 – 2014
Disclaimer
This document contains material, which is the copyright of certain XLike consortium parties, and may not be reproduced or copied without permission.
All XLike consortium parties have agreed to full publication of this document.
The commercial use of any information contained in this document may require a license from the proprietor of that information.
Neither the XLike consortium as a whole, nor a certain party of the XLike consortium warrant that the information contained in this document is capable of use, or that use of the information is free from risk, and accept no liability for loss or damage suffered by any person using this information.
Full Project Title: XLike– Cross-‐lingual Knowledge Extraction
Short Project Title: XLike
Number and Title of Work package:
WP3 – Cross-‐lingual Semantic Annotation
Document Title: D3.3.2 Final machine translation based semantic annotation prototype
Editor (Name,Affiliation) Marko Tadić, UZG
Work package Leader (Name, affiliation)
Achim Rettinger, KIT
Estimation of PM spent on the deliverable:
8 PM
Copyright notice
2012-‐2014 Participants in project XLike
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 3 of (27)
Executive Summary
The main goal of the XLike project is to extract knowledge from multi-‐lingual text documents by annotating statements in sentences of a document with a cross-‐lingual knowledge base. The purpose of the final machine translation based semantic annotation prototype described here, is to demonstrate how the SMT systems could be used to translate from natural language into a formal language. This translation would then be used as the semantic annotation of a natural language sentence. We have described the further experiments using the Moses SMT system with extended translation models and presented the evaluation of results.
XLike Deliverable D3.3.2
Page 4 of (27) © XLike consortium 2012 – 2014
Table of Contents
Executive Summary ........................................................................................................................................... 3 Table of Contents .............................................................................................................................................. 4 List of Figures .................................................................................................................................................... 5 List of Tables ...................................................................................................................................................... 6 Abbreviations .................................................................................................................................................... 7 Definitions ......................................................................................................................................................... 8 1 Introduction ............................................................................................................................................... 9 1.1 Motivation .......................................................................................................................................... 9
2 Related research: Semantic Parsing ......................................................................................................... 10 3 Statistical Machine Translation techniques ............................................................................................. 12 3.1 General Framework .......................................................................................................................... 12 3.2 Early prototype: proof of concept .................................................................................................... 12 3.3 Final prototype: additional experiments .......................................................................................... 13
4 Preparing the training data ...................................................................................................................... 14 4.1 Generation of larger parallel corpus ................................................................................................. 14 4.2 Generation of additional linguistic data ........................................................................................... 14
5 Using Moses ............................................................................................................................................. 16 5.1 Training Moses with a larger training set ......................................................................................... 16 5.2 Training Moses with factor-‐based models ........................................................................................ 17
6 Evaluation of Translation ......................................................................................................................... 18 6.1 Translation from English into CycL with En-‐EnSemRep-‐Model04 ..................................................... 18 6.2 Translation from English into CycL with Factor-‐based models ......................................................... 18 6.3 Evaluation of the translation quality of En-‐EnSemRep-‐Model04 ..................................................... 19 6.3.1 Automatic evaluation of En-‐EnSemRep-‐Model04 ..................................................................... 19 6.3.2 Human evaluation of En-‐EnSemRep-‐Model04 .......................................................................... 20
6.4 Extrinsic evaluation ........................................................................................................................... 23 7 Conclusion ................................................................................................................................................ 25 References ...................................................................................................................................................... 26
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 5 of (27)
List of Figures
Figure 1. General diagram of a SMT system (from [4]) ................................................................................... 12 Figure 2. Example of training data in TMX format .......................................................................................... 14 Figure 3. Example of the English part of the factor-‐based training data in CoNLL format .............................. 15 Figure 4. Example of the CycL part of the factor-‐based training data ............................................................. 15 Figure 5. The training data uploaded as a parallel corpus to Let'sMT! platform ............................................ 16 Figure 6. The En-‐EnSemRep-‐Model04 available for translation at Let'sMT! platform .................................... 17 Figure 7. Example of the English part of the factor-‐based training data in MOSES format ............................ 17 Figure 8. Example of the translation from English to CycL .............................................................................. 18 Figure 9. Automatic evaluation of translation quality for En-‐EnSemRep-‐Model02 (above) and En-‐EnSemRep-‐
Model04 (below) SMT systems ................................................................................................................ 19 Figure 10. Sisyphos II screen with Comparative evaluation example of the better first translation .............. 21 Figure 11. Sisyphos II screen with Comparative evaluation example of both translations equally good ....... 21 Figure 12. Sisyphos II screen with Comparative evaluation example of both translations equally bad ......... 22 Figure 13. Sisyphos II screen with Absolute evaluation scenario of 1000 Bloomberg sentences: Rubble
example .................................................................................................................................................... 23 Figure 14. Sisyphos II screen with Absolute evaluation scenario of 1000 Bloomberg sentences: Mainly
nonfluent example ................................................................................................................................... 23
XLike Deliverable D3.3.2
Page 6 of (27) © XLike consortium 2012 – 2014
List of Tables
Table 1. Results of the human evaluation of translation quality of 1000 English sentences translated into CycL by En-‐EnSemRep-‐Model04 SMT system ......................................................................................... 22
Table 2 . Results of the human evaluation of translation quality of 1000 Bloomberg sentences translated into CycL by En-‐EnSemRep-‐Model04 SMT system ................................................................................... 24
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 7 of (27)
Abbreviations
SL Source Language
TL Target Language
IL Interlingua
NL Natural Language
FL Formal Language
NL2FL Natural Language to Formal Language
MT Machine Translation
SMT Statistical Machine Translation
RBMT Rule Based Machine Translation
L Language
TM Translation Model
LM Language Model
NLP Natural Language Processing
SRL Semantic Role Labelling
WSD Word Sense Disambiguation
SP Semantic Parsing
XLike Deliverable D3.3.2
Page 8 of (27) © XLike consortium 2012 – 2014
Definitions
Parallel Corpus Parallel corpus consists of documents that are translated directly into different languages.
Comparable Corpus Comparable corpus, unlike parallel corpora, contains no direct translations. Overall they may address the same topic and domain, but can differ significantly in length, detail and style.
Source language Language of the text that is being translated.
Target Language Language of the text into which the translation is being done.
Formal language Artificial language that uses formally defined syntax. Its norm precedes its usage.
Natural language A language where the usage precedes the norm.
Language pair Unidirectional translation from the SL to TL. Translation from La to Lb is one language pair and from Lb to La is another language pair.
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 9 of (27)
1 Introduction
In this deliverable we are presenting the results of research leading to the final machine translation based semantic annotation prototype. This part of the project was envisaged and covered by the research plans situated in the WP3, namely T3.3.
1.1 Motivation
The main goal of the XLike project is to extract knowledge from multi-‐lingual text documents by different means and treating the documents at all possible levels: from the document collection, over documents as unique entities, up to individual paragraphs and sentences that occur in these documents. The knowledge can be formally represented as statements in a formal language, resembling a formal logic calculus or any other semantically rich format (e.g. RDF triples), or as mappings from any of the mentioned levels of processing to a desired conceptual space (e.g. Cyc ontology, Wikipedia, Dbpedia, Linked Open Data, etc.).
Different work packages and the respective tasks within the XLike project examine different approaches to this problem, while the task T3.3 covered in this and previous deliverable (D3.3.1) is trying to investigate how the machine translation techniques could be exploited for cross-‐lingual semantic annotation.
Then main idea behind this task is to investigate how the use of statistical machine translation (SMT) techniques could facilitate the obtaining of mappings between text and its semantic representation(s). The development of this advanced prototype followed the idea presented in D3.3.1: would it be possible to train a SMT-‐system to translate from natural language as a source language into a formal language as a target language? The work presented here has been conducted as an addition to the early prototype that confirmed the proof of concept, i.e. whether this idea, that could be applicable in theory, once turned into a real SMT-‐system, really produces results usable by humans and/or machines for further processing. In this final prototype we were using the additional capabilities of SMT-‐systems (such as larger training sets or factor-‐based SMT models) to train a translation model and target language model.
XLike Deliverable D3.3.2
Page 10 of (27) © XLike consortium 2012 – 2014
2 Related research: Semantic Parsing
The task of natural language understanding within the area of NLP has recently been mainly focussed on shallow semantic analysis (e.g. Semantic Role Labelling, SRL) and Word Sense Disambiguation (WSD). However, more ambitious is the task of semantic parsing (SP), which is the construction of a complete semantic representation of a sentence expressed in the form of FL statement(s). In this project task (T3.3) we tried to observe this process as a translation from NL to FL.
While in computing there is a long tradition of translating from a FL into another FL (e.g. programming languages translation with the usage of compilers and/or interpreters), translation from a NL into a FL has been limited usually to translation of commands formulated in NL into a programming language commands/expressions or a query language statements. The programming language commands/expressions were always strictly defined by the programming language formal syntax. The query languages syntax and particularly semantics were mostly confined within the limited domains with usage of predefined scripts or scenarios. There has been only a limited number of attempts to cover a general purpose use of NL sentences and come up with their FL, i.e. their semantic, representations. Two main research groups were working on this task previously, mainly using existing state-‐of-‐the art NLP techniques and machine learning approaches. One group was led at MIT by Michael Collins and his associate Luke Zettlemoyer, while at the University of Texas, Austin, Raymond Mooney led the group that included Yuk Wah Wong, Rohit J. Kate and Ruifang Ge.
The first attempts were mostly oriented towards the combination of rule-‐based approaches in processing NL part of the task (e.g. usage of constituency oriented context-‐free or context-‐sensitive grammars for parsing NL sentences) and rule-‐based or machine learning techniques for generating FL representations of these analysed sentences.
Luke Zettlemoyer's paper [9] introduced a framework making use of structured classification to perform semantic parsing. The follow up paper [10] introduced "relaxed grammar" to add flexibility to the grammar, while it also slightly improved the learning algorithm to boost the efficienty and make the algorithm incrementally applicable (it can learn from instance to instance, instead of batch learning). Later the context-‐dependent parsing was proposed in [11] which was able to handle even the discourse structures. Given a training corpus of NL sentences aligned with their FL representations, these approaches learn a model capable of translating sentences to a desired FL representation. However, set of hand crafted rules are used to learn syntactic categories and semantic representations of words based on combinatorial categorial grammar (CCG). All these works use lambda calculus as a FL representation of NL sentences, while training sets sizes for machine learning parts were between 800 and 4500 sentences. Although the reported results were quite high, it should be noted that the experiments were conducted in the limited domain: Air travel-‐planning dialogs (ATIS domain).
Research of semantic parsing in Austin was more oriented towards semantic lexica, i.e. their automatic acquisition from NL texts [12]. Later Wong and Mooney proposed a system which makes use of some SMT techniques (e.g. using synchronous grammar for word alignment to learn semantic lexicon and learn rules for composing meaning representation), called WASP [13,14]. Kate and Mooney also invented a Kernel-‐based semantic parsing system called KRISP [15,16,17], largely making use of machine learning techniques (such as SVM). Both these groups produced several PhD thesis that contribute to the research in this area [18,19,20].
In [21] a syntactic combinatorial categorial parser (CCG parser) is also used to parse natural language sentences and also to construct the semantic meaning of the sentences as directed by their parsing. The same parser is used for both and lambda calculus was also used for the FL representation of NL sentences.
However, neither of these approaches proposed the usage of the the well established general purpose SMT systems that were developed in the previous decade and were made accessible for free, namely, the MOSES-‐based systems. One of the reasons can be seen in the sizes of the training data the mentioned SP systems had at their disposal: the MOSES-‐based systems require huge training sets of parallel texts where one side of a pair is a NL sentence and its aligned counterpart is a FL representation of this sentence. These
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 11 of (27)
training sets have to encompass millions of aligned NL sentences and their FL counterparts in order to have a decent SMT results. The mentioned SP systems used only up to several thousands NL sentences usually manually annotated for their FL semantic representations, thus forming a parallel corpus of NL sentences and their FL representations. The manual annotation was certainly labour intensive and required a highly skilled human annotator(s) thus preventing a less expensive large scale processing. Such small corpora can't be used for generating a useful translation models in the SMT systems.
We wanted to tackle this problem by giving it a try of using much larger parallel corpus for training a SMT system, which should be regarded here as just a special type of machine learning system. Such an early prototype system was developed and described in D3.3.1 while here we present the follow up experiments in enlarging this system and evaluating its performance.
XLike Deliverable D3.3.2
Page 12 of (27) © XLike consortium 2012 – 2014
3 Statistical Machine Translation techniques
3.1 General Framework
General SMT scenario involves collecting the parallel data, aligning them at the sentence level, using that data for training the SMT systems and building a Translation Model (TM) for transfer of words and phrases from SL into TL. In order to select between different probable translations and to use the most appropriate (often also more natural) TL text, very large Language Models (LM) are used for adjusting the final SMT system output. In Figure 1 a general SMT process is presented as a diagram.
Figure 1. General diagram of a SMT system (from [4])
Generally, the mentioned scenario involves natural language (NL) as both, SL (Spanish) and TL (English). Training is performed using a large Spanish-‐English parallel corpus and TM is being built. Large English monolingual corpus is used to build (train) LM. Decoder applies the Decoding Algorithm to all TM outputs and uses LM to select as the final output the most probable translation of sentence s in TL. This is a SMT process described in a nut shell and all SMT systems so far (including Moses) were adapted for NL as TL.
3.2 Early prototype: proof of concept
The idea behind this task is simple. Since the main goal of XLike project is to build technology for extracting and representing knowledge from the text cross-‐lingually in a language independent (common) format, preferably formally defined, we suggested that this representation could be written in a formal language. From the Semantic Web community the representation of basic relations in the form of RDF triples has become common way of representing knowledge involving concepts from a conceptual space or an ontology. However, population of conceptual spaces and ontologies with relations from the texts has been complicated, demanding and involved a lot of human effort if it is entirely rule-‐based. Wouldn't it be possible to apply analogous shift in methodology here, like it was applied with the change from rule-‐based
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 13 of (27)
MT to statistical MT? This would involve usage of SMT techniques for automatic translation from natural language into formal language.
In D3.3.1 we have shown that this initial idea can be applied by the insight into the results of evaluation of an early prototype SMT output by automatic scores (BLUE, NIST, TER) and human evaluation (Accuracy, Fluency). The results of automatic evaluation were surprisingly good, so we checked them by human evaluation also and these demonstrated that more than 50% of NL content has been either "full" or "mayor content conveyed". This led us to the additional experiments that we present here.
3.3 Final prototype: additional experiments
We planned the following experimental steps that were planned to lead to the final prototype:
1. Generating a larger parallel corpus of English-‐CycL aligned "sentences" (1277K) than in Early prototype (650K).
2. Training a TM with larger parallel corpus (1277K)
3. Translation of the same test set of 1000 English sentences with 1277K TM
4. Comparative evaluation of two SMT outputs (early and final prototype TMs) in order to determine the baseline TM
5. Training a TM using additional linguistic information obtained from the English pipeline developed within WP2
a. Training a TM (TMsynt) with the syntactic information that is the result of dependency parsing
b. Training a TM (TMsrl) with the semantic role labels
c. Training a TM (TMsynt+srl) with both, syntactic information and semantic role labels
6. Comparative evaluation of TMsynt, TMsem, and TMsytn+sem with the baseline TM in order to check whether the additional linguistic information contributed to the quality of SMT output
7. For extrinsic evaluation purpose 1000 English sentences extracted from Bloomberg articles appearing on-‐line, were also absolutely evaluated for accuracy and fluency in order to determine how this system will behave when dealing with the English sentences from the real life texts.
However, it will be shown in the rest of this deliverable that some of the planned steps yielded poorer results than expected and their consequent steps had to be abandoned.
XLike Deliverable D3.3.2
Page 14 of (27) © XLike consortium 2012 – 2014
4 Preparing the training data
The general process of generation of FL "sentences" and their English counterparts from Cyc ontology was described in detail in D3.3.1, so here we will concentrate only of additional features that were used in building the Final prototype.
4.1 Generation of larger parallel corpus
Generation of English sentences aligned with FL "sentences" was done by partners from IJS since they operate Cyc ontology as a whole. In the second generation run we obtained a filtered corpus that we call 650K since it consisted of ca 650,000 aligned English-‐CycL "sentence" pairs. In this third generation run we obtained ca 1.87 million pairs of English and CycL "sentences". Like in the Early prototype development, we noticed that a lot of English sentences were referring to relations between two concepts denoted by their IDs instead by terms in plain English, so we filtered this output using the same procedure like in the data preparation for the early prototype. This filtering process was applied on the third generation run and it yielded 1,277,680 clean English-‐CycL aligned "sentence" pairs. This larger parallel corpus we will call 1277K. This amount of data represents an acceptable quantity of parallel data for a thorough SMT experiment, particularly having in mind the monotonous nature of CycL as TL. The training data were prepared in TMX format, an open XML industry standard format for exchanging parallel textual data.
<tu> <tuv xml:lang="en"> <seg>Zagreb, Croatia's longitude is 16 degrees</seg> </tuv> <tuv xml:lang="se"> <seg>(#$longitude #$CityOfZagrebCroatia (#$Degree-UnitOfAngularMeasure 16.0))</seg> </tuv> </tu> <tu> <tu> <tuv xml:lang="en"> <seg>Minnie Driver appeared in "Circle Of Friends"</seg> </tuv> <tuv xml:lang="se"> <seg>(#$movieActors #$CircleOfFriends-TheMovie #$MinnieDriver)</seg> </tuv> </tu> <tu> <tuv xml:lang="en"> <seg>lacrimal fluid is a type of bodily secretion</seg> </tuv> <tuv xml:lang="se"> <seg>(#$genls #$LacrimalFluid #$Secretion-Bodily)</seg> </tuv> </tu>
Figure 2. Example of training data in TMX format
Out of the prepared 1277K sentence pairs, a test set of 10,000 sentence pairs was set aside for evaluation purposes.
4.2 Generation of additional linguistic data
For generation of additional linguistic data in the English part of English-‐CycL aligned pairs of sentences we used the English pipeline developed within WP2 (see D2.2.2). We produced three different translation
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 15 of (27)
models (TMsynt, TMsrl, TMsynt+srl) that added the linguistic information to the English sentences. In figures 3 and 4 the examples of factor-‐based training data are shown.
No Token Lemma PoS MSD NE Dep. Synt. PWN sense WSD A1 A2
1 exactly exactly RB pos=adverb|type=general O 3 NMOD 00158309-‐r _ _ _
2 one one DT pos=determiner O 3 NMOD _ _ _ _
3 tail tail NN pos=noun|num=s O 4 SBJ 02157557-‐n _ A0 _
4 is be VBZ pos=verb|vform=personal|person=3 O 0 ROOT 02620587-‐v be.00 _ _
5 a 1 CD pos=number O 4 VC _ _ _ _
6 physical physical JJ pos=adjective O 7 NMOD 01778212-‐a _ _ _
7 part part NN pos=noun|num=s O 4 PRD 00720565-‐n officiate.01 A1 _
8 of of IN pos=preposition O 7 NMOD _ _ _ A1
9 every every DT pos=determiner O 10 NMOD _ _ _ _
10 snake snake NN pos=noun|num=s O 8 PMOD 01726692-‐n _ _ _
11 . . . pos=punctuation|type=period O 10 DEP _ _ _ _
Figure 3. Example of the English part of the factor-‐based training data in CoNLL format
(
#$relationAllExistsUnique
#$physicalParts
#$Snake
#$Tail-BodyPart
)
Figure 4. Example of the CycL part of the factor-‐based training data
The idea behind factor-‐based SMT is to use additional linguistic information in order to reduce the possible number of translation equivalents relying on e.g. lemmas instead of tokens, sytactic roles of words or phrases and/or their semantic roles. We have noticed in the previous experiment (D3.3.1) that majority of "non-‐fluent" CycL statements had invalid CycL syntax and we expected that additional linguistic information about syntactic and/or semantic roles would help TM to produce adequate CycL relations that would then generate the CycL statements that follow the CycL syntax more closely.
XLike Deliverable D3.3.2
Page 16 of (27) © XLike consortium 2012 – 2014
5 Using Moses
The SMT system we used for the first part of this advanced experiment was an open source SMT systems suite Moses1. The Let'sMT! platform2, the already existing platform for generating SMT systems out of your own parallel data, was used for the first part of building the final prototype, i.e. training with a larger training set 1277K. Since the Let'sMT! platform in its present form doesn't support factor-‐based SMT models, we had to use our own Moses installation for this part of training and translation.
5.1 Training Moses with a larger training set
The prepared training data were fed into the Let'sMT! platform as a parallel corpus following the procedure of language name selection and data adaptation described in D3.3.1.
Figure 5. The training data uploaded as a parallel corpus to Let'sMT! platform
The en2cyc_1277k_train parallel corpus was used to train several SMT systems with different features in order to produce the final version of the system En-‐EnSemRep-‐Model04 that was then used for this second experiment in translation from English to CycL.
1 http://www.statmt.org/moses/ 2 http://www.letsmt.eu, see also about the Let'sMT! project at http://www.letsmt.org.
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 17 of (27)
Figure 6. The En-‐EnSemRep-‐Model04 available for translation at Let'sMT! platform
5.2 Training Moses with factor-‐based models
Since Let'sMT! platform doesn't support factor-‐based translation models, we used our own MOSES installation to produce these TMs. The training on a moderate server (2 Xeons @2.4 GHz with 64 Gb RAM) took several days up to a week for every TM. This is uncomparable to Let'sMT! platform which uses Amazon Cloud services in a flexible, on-‐demand basis where non-‐factor-‐based TMs were trained in several hours.
In order to train MOSES for factor-‐based SMT, the output of English WP2 pipeline (see example in figures 3 and 4 in section 4.2) had to be converted into MOSES training format for factor-‐based models. In the figure 7 an example of English part of MOSES format is shown.
exactly|3|NMOD one|3|NMOD tail|4|SBJ is|0|ROOT a|4|VC physical|7|NMOD part|4|PRD of|7|NMOD every|10|NMOD snake|8|PMOD .|10|DEP
Figure 7. Example of the English part of the factor-‐based training data in MOSES format
For three different factor-‐based models we used selectively different types of data:
1. TMsynt: using syntactic functions tags (ROOT, SBJ, NMOD etc.) in building TM
2. TMsrl: using semantic roles tags (A0, A1 etc.) in building TM
3. TMsynt+srl: using both, syntactic and semantic tags in bulding TM
XLike Deliverable D3.3.2
Page 18 of (27) © XLike consortium 2012 – 2014
6 Evaluation of Translation
This section describes the translation and evaluation of the translation produced by the En-‐EnSemRep-‐Model04 SMT system and factor-‐based SMT systems.
6.1 Translation from English into CycL with En-‐EnSemRep-‐Model04
The trained SMT systems can be used through the Let'sMT! platform service Translate only if they are in the running state. Similar to Google Translate, the Let'sMT! platform opens a SL box (where user pastes the SL text) and a TL box (where selected running SMT system provides the translation). Entire SL files can be translated without this web interface as well.
Figure 8. Example of the translation from English to CycL
6.2 Translation from English into CycL with Factor-‐based models
Our installation of MOSES was set up in the way that if the factor-‐based TM model yields a translation output below certain threshold, it will not produce the translation, but return the SL sentence instead. In our case this was the only outcome of all three factor-‐based TMs we trained: all output was below the threshold of ### and no TL (i.e. CycL statement) was produced. This literally means that the output would be humanly evaluated as rubble. Subsequently, the evaluation of this output was not possible.
The explanation for this undesired outcome should be found in the amount of training data. When it comes to the usage of additional linguistic data, the number of training examples should grow even at higher rate for factor-‐based TMs than for non-‐factor-‐based ones. Only in this way MOSES can find during its training phase enough evidence for different combinations of tokens and additional linguistic data. Our somewhat naïve assumption was that having 1.2 million of sentences with highly simplified and repetitive syntactic structure of CycL statements, would supply the factor-‐based TMs with enough evidence to generalise and produce acceptable CycL statements out of English sentences. However, this expectation was not
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 19 of (27)
confirmed by this experiment. Whether it would be possible without any other rule-‐based pre-‐editing or post-‐editing procedures, remains to be checked in some further investigation.
6.3 Evaluation of the translation quality of En-‐EnSemRep-‐Model04
Like in D3.3.1, for SMT output provided by En-‐EnSemRep-‐Model04 we applied both kinds of evaluation of the MT quality that are used in MT research: automatic evaluation and human evaluation.
6.3.1 Automatic evaluation of En-‐EnSemRep-‐Model04
At the end of the training process the Let'sMT! platform produces automatic evaluation of the trained SMT system using the standard automatic evaluation measures such as BLUE, NIST, TER and METEOR scores.
Figure 9. Automatic evaluation of translation quality for En-‐EnSemRep-‐Model02 (above) and En-‐EnSemRep-‐
Model04 (below) SMT systems
As it can be seen in figure 9, the values of automatic evaluation scores for En-‐EnSemRep-‐Model04 are higher in every cell, apart of METEOR case sensitive score which is slighlty lower. This was expected since the 1277K training set, due to the monotonous nature of CycL as TL, provided more evidence for SMT system and this is exactly why we took this direction of enlarging the training set and TM in this final prototype. Automatic evaluation measures, that were developed primarily for evaluation of TL when it is a NL, here express very high values. Such values are usually obtained for SMT systems that are trained for translation between very closely related natural languages (e.g. Swedish and English, Croatian and Serbian etc.) with large amount of regular lexical similarities and similar word order. However, the main reason for these values in this case of En-‐EnSemRep-‐Model04 should be seen in the very simple formal syntax of CycL that probably artificially boosts the automatic evaluation measures.
The omission of TER Score calculation by the Let'sMT! platform for En-‐EnSemRep-‐Model04 is a bit surprising since it was calculated previously for a smaller model. Since TER (Translation Error Rate) is usually defined as an error metric for MT that measures the number of edits required to change a system output into one of the reference outputs [8], it might be that internally the Let'sMT! platform has a limitation in number of allowed edits. Due to the larger number of editing operations, this score was not calculated. If we were interested profoundly in the SMT output evaluation, we would have give this fact a closer look and
XLike Deliverable D3.3.2
Page 20 of (27) © XLike consortium 2012 – 2014
try to explain it, but here we are interested in producing a SMT system that can translate from NL to FL and missing one of four automatic evaluation scores didn't affect our line of research at the moment.
As with the early prototype evaluation (as described in D3.3.1), we also didn't stay with the automatic evaluation, but produced a human evaluation as well.
6.3.2 Human evaluation of En-‐EnSemRep-‐Model04
In order to keep the results comparable with the early prototype, the same procedures and tools as in D3.3.1 were used for the human evaluation in this final machine translation based semantic annotation prototype. We used 1,000 sentences from the test set of 10,000 sentence pairs that was set aside previously (see Section 4.1). This human evaluation set of 1,000 was translated using En-‐EnSemRep-‐Model04 SMT system and result was submitted to the evaluation process. The human evaluation was performed by three evaluators, each covering one third of human evaluation set (i.e. 2x333 and 1x334 sentences).
The software used for human evaluation was Sisyphos II, an open source MT human evaluation package produced by a Munich-‐based LT company Linguatec within the ACCURAT project3, as a part of ACCURAT Toolkit [7]. This suite of programs written in Java enable three different human evaluation scenarios: Absolute evaluation, Comparative evaluation and Postediting evaluation.
In contrast to the early prototype evaluation where only Absolute evaluation was possible since only one translation was produced, here we used the Comparative evaluation approach that allowed us to compare the quality of SMT output of two systems: En-‐EnSemRep-‐Model02 and En-‐EnSemRep-‐Model04.
The Comparative evaluation scenario is as follows. For each translated sentence Sisyphos II displays the SL sentence and two TL sentences without any trace from which of the two translations the sentence has been selected. In this way the possible bias of human evaluator towards the first of the second translation is avoided. The human evaluator has four categories to put his/her judgement in:
1. First translation better;
2. Both equally good;
3. Both equally bad;
4. Second translation better.
3 http://www.accurat-‐project.eu
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 21 of (27)
Figure 10. Sisyphos II screen with Comparative evaluation example of the better first translation
Figure 11. Sisyphos II screen with Comparative evaluation example of both translations equally good
XLike Deliverable D3.3.2
Page 22 of (27) © XLike consortium 2012 – 2014
Figure 12. Sisyphos II screen with Comparative evaluation example of both translations equally bad
Cumulative results of human Comparative evaluation are given in the Table 1.
Category Occurences Percentage First translation better 253 25,3% Both equally good 291 29,1% Both equally bad 211 21,1% Second translation better 245 24,5%
Table 1. Results of the human evaluation of translation quality of 1000 English sentences translated into CycL by En-‐EnSemRep-‐Model04 SMT system
Interpretation of results from the Table 1 show that human evaluation scored the translation quality of En-‐EnSemRep-‐Model04 SMT system much lower than automatic evaluation. The average comparative evaluation score falls into value Both equally good (but close to Both equally bad) and the distribution is almost equal between all four categories. This means that like in the first experiment with En-‐EnSemRep-‐Model02 SMT system, a good part of content from English sentences is conveyed into CycL, but it is not done following the strict formal syntax of this FL. This also means that translation from English into CycL, as it is performed by any version of this SMT system, is not immediately applicable for usage where statements with clean and regular CycL syntax are expected. Since the comparative human evaluation of output of smaller and larger TM model (En-‐EnSemRep-‐Model02 vs. En-‐EnSemRep-‐Model04) didn't yield significant difference in the favour of the larger model, we can tentatively say that we have almost reached the point of oversaturation in training and that it may be questionable whether more training data wouldn't start introducing noise. Also, regarding system efficiency, the training of the larger system took longer and requested more computational resources in both, training and translation phases.
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 23 of (27)
6.4 Extrinsic evaluation
So far, we have performed intrinsic evaluation where the quality of SMT system output was evaluated on the evaluation set of sentences that were taken from the same source as the training set. However, we wanted to check how will this SMT system behave when confronted with a set of real life sentences, i.e. sentences produced by humans in real communicative scenario. In order to check this we have randomly selected 1000 English sentences appearing in the on-‐line Bloomberg news from the same day and translated them using En-‐EnSemRep-‐Model04 from English (as SL) into CycL (as TL). The TL sentences were then evaluated by humans using the Absolute evaluation scenario described in more detail in D3.3.1.
Figure 13. Sisyphos II screen with Absolute evaluation scenario of 1000 Bloomberg sentences: Rubble
example
Figure 14. Sisyphos II screen with Absolute evaluation scenario of 1000 Bloomberg sentences: Mainly
nonfluent example
XLike Deliverable D3.3.2
Page 24 of (27) © XLike consortium 2012 – 2014
Cumulative results of human Absolute evaluation of translation of 1000 Bloomberg sentences are given in the Table 2.
Category Value Occurences Percentage Adequacy Full content conveyed 62 6,2% Major content conveyed 336 33,6% Some parts conveyed 349 34,9% Incomprehensible 253 25,3% Fluency Grammatical 39 3,9% Mainly fluent 124 12,4% Mainly non fluent 358 35,8% Rubble 479 47,9%
Table 2 . Results of the human evaluation of translation quality of 1000 Bloomberg sentences translated into CycL by En-‐EnSemRep-‐Model04 SMT system
Interpretation of results from the Table 2 show that human evaluation scored the translation quality of En-‐EnSemRep-‐Model04 SMT system over 1000 real-‐life sentences with average Adequacy of value Some parts conveyed (but close to Major content conveyed), while Fluency would fall into value Rubble (almost 48% of all translations are CycL-‐nonfluent, thus breaching its syntactic rules (mostly due to the mismatching parenthesis). This means that good part of content from English sentences is conveyed into CycL, but it is not done following the strict formal syntax of this FL. This also means that translation from English into CycL, as it is performed by this SMT system, is not immediately applicable for usage where statements with clean and regular CycL syntax are expected. This can be seen if compared to the Absolute evaluation of the first evaluation scenario (D3.3.1). The number of Full content conveyed sentences dropped from 20.9% to 6.2% which is almost 15 percentage points less. The number of Rubble CycL sentences grew from 40.7% to 47.9%, i.e. for more than 7 percentage points while number of Mainly non fluent CycL sentences also grew from 24.4% to 35.8%, i.e. for more than 11 percentage points. Bottom line is that more than 83% of all English sentences are translated either as Mainly non fluent or complete Rubble in CycL. This indicates that application of SMT techniques in this scenario will not yield directly useful results for semantic annotation of NL sentences.
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 25 of (27)
7 Conclusion
With this deliverable we have reported on the following experiments that attempted to use SMT system for translation from NL as SL into FL as TL. The CycL was the FL of our choice because the training material could have been produced in a non-‐expensive way by generating from Cyc Ontology a set of aligned pairs of English sentences with their respective CycL "sentences" as counter parts.
This parallel corpus served as a training material for Moses based SMT system that was used as the advanced or final prototype.
Judging by automatic evaluation procedure, the scores of three standard automatic MT evaluation metrics (BLEU, NIST and METEOR) could guarantee high quality translation since these scores were higher for the enlarged TM that was used to build a En-‐EnSemRep-‐Model04 system than the previous early prototype En-‐EnSemRep-‐Model02 system. However, human evaluation applied intrinsically in comparative evaluation scenario yielded results that displayed slightly better performance of the enlarged TM, i.e En-‐EnSemRep-‐Model04 over En-‐EnSemRep-‐Model02.
On top of that, the human evaluation was applied extrinsically in the absolute evaluation scenario on 1000 sentences randomly selected from Bloomberg texts. This evaluation has shown that application of En-‐EnSemRep-‐Model04 brought a drop in number of sentences evaluated as conveying the full content, while the CycL rubble statements grew to more than 83%.
After this figures of extrinsic evaluation, it can be said that this approach couldn't be useful in the XLike processing platform at the present stage taking into account the processing at the sentence level. Whether this approach could yield more acceptable results at the level of paragraphs or whole documents, remains to be checked, but the current experimental settings didn't allow the engagement such large computing resources.
XLike Deliverable D3.3.2
Page 26 of (27) © XLike consortium 2012 – 2014
References
[1] Brown, P. F.; Della Petra, S. A.; Della Pietra, V. J.; Mercer, R. L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, volume 19, number 2, pp. 263-‐311.
[2] Och, F. J.; Ney, H. (2003) A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-‐51.
[3] Koehn, P.; Hoang, H.; Birch, A.; Callison-‐Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; Herbst, E. (2007) Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL2007), demonstration session, Prague, Czech Republic.
[4] Knight, K.; Koehn, P. (2003) What’s New in Statistical Machine Translation (Tutorial slides), University of Southern California, pp. 4 (http://people.csail.mit.edu/people/koehn/publications/ tutorial2003.pdf, accessed 2013-‐12-‐20).
[5] Davies, J.; Grobelnik, M.; Mladenić, D. (eds.) (2009) Semantic Knowledge Management, Springer, Berlin-‐Heidelberg.
[6] Buitelaar, P.; Cimiano, P. (eds.) (2008) Ontology Learning and Population: Bridging the Gap between Text and Knowledge, IOS Press, Amsterdam.
[7] Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., & Babych, B. (2012) ACCURAT Toolkit for Multi-‐Level Alignment and Information Extraction from Comparable Corpora. Proceedings of the ACL 2012 System Demonstrations, ACL, Jeju, South Korea, pp. 91–96.
[8] Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J. (2006) A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of Association for Machine Translation in the Americas.
[9] Zettlemoyer, Luke S.; Collins, Michael. (2005) Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence.
[10] Zettlemoyer, Luke S.; Collins, Michael (2007) Online learning of relaxed CCG grammars for parsing to logical form. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
[11] Zettlemoyer, Luke S.; Collins, Michael (2009) Learning Context-‐Dependent Mappings from Sentences to Logical Form. Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL2009).
[12] Thompson, Cynthia A.; Mooney, Raymond J. (2003) Acquiring word-‐meaning mappings for natural language interfaces. Journal of Artificial Intelligence Research, 18:1–44.
[13] Wong, Yuk Wah; Mooney, Raymond J. (2006) Learning for semantic parsing with statistical machine translation. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-‐NAACL-‐2006), pp. 439–446.
[14] Wong, Yuk Wah; Mooney, Raymond J. (2007) Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-‐2007), pp. 960–967.
[15] Kate, R. J., & Mooney, R. J. (2006) Using string-‐kernels for learning semantic parsers. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL-‐06), pp. 913–920.
[16] Kate, R. J., & Mooney, R. J. (2007a) Learning language semantics from ambiguous supervision. Proceedings of the Twenty-‐Second Conference on Artificial Intelligence (AAAI 2007), pp. 895–900.
Deliverable D3.3.2 XLike
© XLike consortium 2012 -‐ 2014 Page 27 of (27)
[17] Kate, R. J., & Mooney, R. J. (2007b) Semi-‐supervised learning for semantic parsing using support vector machines. Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-‐HLT-‐07), pp. 81–84.
[18] Luke S. Zettlemoyer (2009) Learning to Map Sentences to Logical Form, PhD Thesis, MIT.
[19] Wong, Yuk Wah (2007) Learning for Semantic Parsing and Natural Language Generation Using Statistical Machine Translation Techniques, PhD Thesis, University of Texas, Austin.
[20] Kate, Rohit Jaivant (2007) Learning for Semantic Parsing with Kernels under Various Forms of Supervision, PhD Thesis, University of Texas, Austin.
[21] Baral, Chitta; Dzifcak, Juraj; Alvarez Gonzalez, Marcos; Zhou, Jiayu (2011) Using Inverse lambda and Generalization to Translate English to Formal Languages. Proceedings of International Conference on Computational Semantics (IWCS) 2011, Oxford, pp. 35-‐44.