proceedings of the workshop on iberian cross-language natural language processings...

76
Proceedings of the Workshop on Iberian Cross-Language Natural Language Processings Tasks (ICL 2011) held in conjunction with 27th Conference of the Spanish Society for Natural Language Processing Editors Paolo Rosso Jorge Civera Alberto Barr´ on-Cede˜ no Anabela Barreiro Marta Vila naki Alegria Huelva, Spain, September 7th 2011

Upload: others

Post on 15-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the Workshop

    on Iberian Cross-Language

    Natural Language Processings Tasks

    (ICL 2011)

    held in conjunction with

    27th Conference of the Spanish Society

    for Natural Language Processing

    Editors

    Paolo Rosso Jorge Civera

    Alberto Barrón-Cedeño Anabela Barreiro

    Marta Vila Iñaki Alegria

    Huelva, Spain, September 7th 2011

  • Preface

    In the Iberian Peninsula, five official languages co-exist: Basque, Catalan,Galician, Portuguese and Spanish. Fostering multi-linguality and establishingstrong links among the linguistic resources developed for each language of theregion is essential. Additionally, a lack of published resources in some of theselanguages exists. Such lack propitiates a strong inter-relation between themand higher resourced languages, such as English and Spanish.

    In order to favour the intra-relation among the peninsular languages as wellas the inter-relation between them and foreign languages, different purposemultilingual NLP tools need to be developed. Interesting topics to beresearched include, among others, analysis of parallel and comparable corpora,development of multilingual resources, and language analysis in bilingualenvironments and within dialectal variations.

    With the aim of solving these tasks, statistical, linguistic and hybrid ap-proaches are proposed. Therefore, the workshop addresses researchers fromdifferent fields of natural language processing/computational linguistics: textmining, machine learning, pattern recognition, information retrieval andmachine translation.

    The research in this proceedings includes work in all of the official languages ofthe Iberian Peninsula. Moreover, interactions with English are also included.Wikipedia has shown to be an interesting resource for different tasks and hasbeen analysed or exploited in some contributions.

    Most of the regions of the Peninsula are represented by the authors of thecontributions. The distribution is as follows: Basque Country (2 authors),Catalonia (7 authors), Galicia (4 authors), Portugal (2 authors) and Valencia(5 authors). Interestingly, those regions where Spanish is the only officiallanguage are not represented. It is worth noting that authors working beyondthe Peninsula have also contributed to this workshop, including: Argentina (3authors), Finland (1 author), France (2 authors), Mexico (1 author), Singapore(1 author), and USA (6 authors).

    The ICL workshop has been organised as one of the activities of the VLC/-CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems, ECWIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie PeopleFramework; the FPU Grant AP2008-02185 from the Spanish Ministry of Ed-ucation; and MICINN Text-Enterprise 2.0 (TIN2009-13391-C04-03) and Text-Knowledge 2.0 (TIN2009-13391-C04-04) projects within the Plan I+D+i.

    Organising Committee

    Paolo Rosso Universitat Politècnica de ValènciaAlberto Barrón-Cedeño Universitat Politècnica de ValènciaMarta Vila Universitat de BarcelonaJorge Civera Universitat Politècnica de ValènciaAnabela Barreiro INESC-ID LisbonIñaki Alegria Euskal Herriko Unibertsitatea2

    3

    martavilarigatRectangle

  • Program Committee

    Eneko Agirre University of the Basque CountryAmparo Alcina Universitat Jaume IIñaki Alegria Euskal Herriko UnibertsitateaJesús Andrés Ferrer Universitat Politècnica de ValènciaAlexandra Balahur DLSI - University of AlicanteAnabela Barreiro INESC-ID LisbonAlberto Barrón Cedeño Universitat Politècnica de ValènciaYassine Benajiba Philips Research North America, Briacliff ManorDavide Buscaldi Université d’OrléansPaula Carvalho University of Lisbon, Faculty of Sciences, LASIGEJorge Civera Universitat Politècnica de ValènciaPaul Clough University of SheffieldIria Da Cunha Institut Universitari de Lingüı́stica Aplicada - UPFVı́ctor Darriba University of VigoPatrick Drouin Université de MontréalAntonio Ferrández Universidad de AlicanteMikel Forkada DLSI - Universitat d’AlacantAtsushi Fujita Future University HakodateMiguel Angel Garcı́a University of JaenVeronique Hoste University College Ghent - Ghent UniversityZornitsa Kozareva Information Sciences InstituteSobha L. AU-KBC Research CentreGorka Labaka University of the Basque CountryFrançois Laureau Macquarie UniversityCodrina Lauth Fraunhofer Inst. for Intelligent Analysis and Information SystemsEls Lefever University College Ghent - Ghent UniversityAntonia Martı́ Universitat de BarcelonaFernando Martı́nez Universidad de JaenRaquel Martı́nez UNEDMikhail Mikhailov University of TampereManuel Montes-Y-Gómez INAOELidia Moreno Universitat Politècnica de ValènciaRoberto Paredes Universitat Politècnica de ValènciaDavid Pinto Benemérita Universidad Autónoma de PueblaHoracio Rodriguez Universitat Politecnica de CatalunyaPaolo Rosso Universitat Politècnica de ValènciaHoracio Saggion Universitat Pompeu FabraLuı́s Sarmento Universidade do PortoGrigori Sidorov CIC-IPNAberto Simões Universidade do MinhoThamar Solorio University of Alabama at BirminghamMariona Taulé Universitat de BarcelonaDan Tufis Research Inst. for Artificial Intelligence, Romanian AcademyMarta Vila Universitat de BarcelonaJesús Vilares Universidade da Coruña

    4

  • ICL 2011 Program Committee

    Luı́s Villaseñor INAOEMichael Zock CNRS-LIF

    2

    STELRectángulo

    STELRectángulo

    STELCuadro de texto5

  • Table of Contents

    I Exploitation and Analysis of Comparable and Parallel Corpora

    Measuring Comparability of Multilingual Corpora Extracted from Wikipedia . . . . . . . . . . . . . 8Pablo Gamallo Otero, Isaac González López

    Extracción de corpus paralelos de la Wikipedia basada en la obtención de alineamientosbilingües a nivel de frase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Joan Albert Silvestre-Cerdà, Mercedes Garćıa-Mart́ınez, Alberto Barrón-Cedeño,Jorge Civera, Paolo Rosso

    Pivot Strategies as an Alternative for Statistical Machine Translation Tasks InvolvingIberian Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Carlos Henŕıquez, Marta R. Costa-Jussà, Rafael E. Banchs, Lluis Formiga, José B.Mariño

    II Bilingual Resources and Methods

    A Bilingual Summary Corpus for Information Extraction and other Natural LanguageProcessing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    Horacio Saggion, Sandra Szasz

    Extracción automática de léxico bilingüe: experimentos en español y catalán . . . . . . . . . . . . . 35Raphaël Rubino, Iria da Cunha, Georges Linarès

    A Particle Swarm Optimizer to Cluster Parallel Spanish-English Short-text Corpora . . . . . . 43Diego Ingaramo, Marcelo Errecalde, Leticia Cagnina, Paolo Rosso

    III Cross-Language Semantics and Opinion Mining

    Cross-language Semantic Relations between English and Portuguese . . . . . . . . . . . . . . . . . . . . . . 49Anabela Barreiro, Hugo Gonçalo Oliveira

    Generación semiautomática de recursos de Opinion Mining para el gallego a partir delportugués y el español . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    Paulo Malvar Fernández, José Ramom Pichel Campos

    IV Bilingualism and Dialectal Variation

    Language Dominance Prediction in Spanish-English Bilingual Children Using SyntacticInformation: A First Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    Gabriela Ramirez-de-la-Rosa, Thamar Solorio, Manuel Montes-y-Gómez, Yang Liu,Aquiles Iglesias, Lisa Bedore, Elizabeth Peña

    Recursos y métodos de sustitución léxica en las variantes dialectales en euskera . . . . . . . . . . . 70Larraitz Uria, Mans Hulden, Izaskun Etxeberria, Iñaki Alegria

    1

    6

    martavilarigatRectangle

  • Measuring Comparability of Multilingual Corpora Extractedfrom Wikipedia ∗

    Midiendo la comparabilidad de copus multilingües extráıdos de la

    Wikipedia

    Pablo Gamallo OteroCentro de Investigación en Tecnolox́ıas

    da Información (CITIUS),Universidade de Santiago de Compostela

    Galiza, [email protected]

    Issac González LópezCilenis S.L.

    Language Engineering SolutionsSantiago de Compostela

    Galiza, [email protected]

    Resumen: Los corpus comparables son muy útiles en variadas tareas del procesa-miento del lenguaje tales como la extracción de léxicos bilingües. Con la mejora dela calidad de los corpus comparables, podemos mejorar la calidad de la extracción.Este art́ıculo describe algunas estrategias para construir corpus comparables a par-tir de la Wikipedia, y propone una medida de comparabilidad. Fueron realizadosalgunos experimentos utilizando la Wikipedia portuguesa, española e inglesa.Palabras clave: Extracción de Información, Corpus Comparables, Léxicos Bi-lingües, Comparabilidad

    Abstract: Comparable corpora can be used for many linguistic tasks such as bilin-gual lexicon extraction. By improving the quality of comparable corpora, we improvethe quality of the extraction. This article describes some strategies to build compara-ble corpora from Wikipedia and proposes a measure of comparability. Experimentswere performed on Portuguese, Spanish, and English Wikipedia.Keywords: Information Extraction, Comparable Corpora, Bilingual Lexicons,Comparability

    1. Introduction

    Wikipedia is a free, multilingual, and co-llaborative encyclopedia containing entries(called “articles”) for almost 300 languages(281 in July 2011). English is the more re-presentative one with about 3 million arti-cles. However, Wikipedia is not a parallelcorpus as their articles are not translationsfrom one language into another. Many workshave been published in the last years focu-sed on its use and exploitation for multilin-gual tasks in natural language processing: ex-traction of bilingual dictionaries (Yu y Tsu-jii, 2009; Tyers y Pieanaar, 2008), alignmentand machine translation (Adafre y de Rijke,2006; Tomás, Bataller, y Casacuberta, 2001),multilingual information retrieval (Pottast,Stein, y Anderka, 2008). There also exists

    ∗ This work has been supported by Ministerio de

    Educación y Ciencia of Spain, within the project On-

    toPedia, ref: FF12010-14986.

    theoretical work analysing symmetries andasymmetries among the different multilingualversions of an entry/article in Wikipedia (Fi-latova, 2009).

    In addition, multilingual articles of Wiki-pedia have been used as a source to buildcomparable corpora (Gamallo y González,2010). The EAGLES - Expert AdvisoryGroup on Language Engineering StandardsGuidelines (see http://www.ilc.pi.cnr.it/EAGLES96/browse.html) defines a “com-parable corpus” as one which selects simi-lar texts in more than one language or va-riety. One of the main advantages of compa-rable corpora is their versatility to be usedin many linguistic tasks (Maia, 2003), likebilingual lexicon extraction (Gamallo y Pi-chel, 2008; Saralegui, Vicente, y Gurrutxaga,2008), information retrieval, and knowledgeengineering. Besides, they can also be usedas training corpus to improve statistic machi-

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    8

  • ne learning systems, in particular when pa-rallel corpora are scarce for a given pair oflanguages. Another advantage concerns theiravailability. In contrast with parallel corpora,which require (not always available) transla-ted texts, comparable corpora are easily re-trieved from the web. Among the differentweb sources of comparable corpora, Wikipe-dia is likely the largest repository of simi-lar texts in many languages. We only requirethe appropriate computational tools to makethem comparable.

    By taking into account multilingual po-tentialities of Wikipedia, our main objectiveis to define a method to measure the simi-larity (or degree of comparability) of diffe-rent comparable corpora built from Wikipe-dia. For this purpose, first we describe somestrategies to extract monolingual corpora inPortuguese, Spanish, and English from Wi-kipedia, by making use of some categories(“Archaeology”, “Biology”, “Physics”, etc.)to make them comparable according to aspecific topic. These strategies were descri-bed in detail in (Gamallo y González, 2010).Then, we propose a measure of comparabi-lity to verify whether the corpora are lowlyor highly comparable. For many extractiontasks, such as bilingual lexicon extraction,using highly comparable corpora often leadsto better results. There are some works pro-posing comparability measures between mo-nolingual corpora (Li y Gaussier, 2010; Sa-ralegui y Alegria, 2007), based on the use ofexisting bilingual dictionaries. However, ins-tead of exploiting dictionaries to compute thecomparability degree, we take advantage ofthe translation equivalents inserted in Wiki-pedia by means of interlanguage links.

    This paper is organized as follows. Section2 introduces two strategies to build compara-ble corpora from Wikipedia. Next, in Section3, we propose some comparability measures.Then, Section 4 describe some experimentsperformed in order to measure the compara-bility between different corpora built usingthe strategies defined in Sec. 2 . The last sec-tion discusses future tasks that will be imple-mented in order to extend and improve ourtools.

    2. Two strategies to Build

    Wikipedia-Based Comparable

    Corpora

    The input of our strategies is CorpusPe-dia1, a friendly and easy-to-use XML struc-ture, generated from Wikipedia dump files.In CorpusPedia, all the internal links foundin the text are put in a vocabulary list iden-tified with the tag links. In the same way, allthe categories (or topics) used to classify eacharticle are inserted in the tag category. In ad-dition, there is a tag called translations whichcodifies a list of interlanguage links (i.e., linksto the same articles in other languages) foundin each article. Categories and translationsare very useful features to build comparablecorpora. Given these features, we developedtwo strategies aimed to extract corpora withdifferent degrees of comparability.

    Not-Aligned Corpus This strategy ex-tracts those articles in two languages ha-ving in common the same topic, whe-re the topic is represented by a cate-gory and its translation (for instance,the English-Spanish pair “Archaeology-Arqueoloǵıa”). It results in a not-alignedcomparable corpus, consisting of textsin two languages. We called it “not-aligned” because the version of an articlein one language may have not its corres-ponding version in the other language.

    Aligned Corpus The goal is to extractpairs of bilingual articles related by in-terlanguage links if, at least, one of bothcontains a required category. It resultsin a comparable corpus that is alignedarticle by article.

    In Section 4, we will measure the degreeof comparability of corpora built by meansof these two strategies. Before that, we willdefine how to measure comparability betweenWikipedia-based corpora.

    3. Comparability Measures

    For a comparable corpus C of Wikipediaarticles, constituted for instance by a Portu-guese part Cp and a Spanish part Cs, a compa-rability coefficient can be defined on the basis

    1The software to build CorpusPedia, as well as

    CorpusPedia files for English, French, Spanish, Por-

    tuguese, and Galician, are freely available at http://gramatica.usc.es/pln/

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    9

  • of finding, for each Portuguese term tp in thevocabulary Cvp of Cp, its interlanguage link (ortranslation) in the vocabulary Cvs of Cs. Thevocabulary of a Wikipedia corpus is the set of“internal links” found in that corpus. So, thetwo corpus parts, Cp and Cs, tend to have ahigh degree of comparability if we find manyinternal links in Cvp that can be translated (bymeans of interlanguage links) into many in-ternal links in Cvs . Let Transbin(tp, C

    vs ) be a

    binary function which returns 1 if the trans-lation of the Portuguese term tp is found inthe Spanish vocabulary Cvs . The binary Dicecoefficient, Dicebin, between two parts of acomparable corpus C is then defined as:

    Dicebin(Cp, Cs) =2∑

    tp∈CvpTransbin(tp, C

    vs )

    |Cvp |+ |Cvs |

    We consider that it is not necessary to de-fine the counterpart of the translation fun-ction, since the number of ambiguous termsis very low in Wikipedia, and most cases ofambiguity are solved with the so-called “di-sambiguated pages”.

    To avoid a bias towards common internallinks, that is, towards those links occurringin most articles, we define a specific versionof tf idf weight for each term. In particular,tf idf(tp) is the frequency of term tp in thePortuguese part of the comparable corpus,multiplied by its inverse article frequency inthe whole Portuguese Wikipedia. By takinginto account the tf idf of terms, we can defi-ne a weighted measure of comparability. LetTranstf idf (tp, C

    vs ) be a function which re-

    turns the smallest value (min) of two tf idfscores, both tf idf(tp) and tf idf(ts), wherets is the Spanish translation of tp in the Spa-nish part Cs. The weighted Dice coefficient,Dicetf idf , between two parts of a compara-ble corpus C is then defined as follows:

    Dicetf idf (Cp, Cs) =2∑

    tp∈Cvp

    Transtf idf (tp, Cvs )

    ∑tp∈C

    vp

    tf idf(tp) +∑

    ts∈Cvs

    tf idf(ts)

    The experiments described in the next sec-tion will be performed with the two compa-rability measures defined here.

    4. Experiments and Results

    Taking CorpusPedia as input source, weperformed several experiments to build dif-ferent comparable corpora for three lan-guage pairs, namely Portuguese-Spanish,

    Portuguese-English, and Spanish-English.These corpora were built using the two stra-tegies described in Section 2 and five domainspecific seed terms (in the three languages)considered as representative of five domaintopics: “Archaeology”, “Linguistics”, “Phy-sics”, “Biology”, and “Sport”.

    Table 1 shows the (binary and tf idf) Dicescores obtained from measuring the compara-bility degree of 30 different comparable cor-pora. For each corpus, the table also showsthe size (in Mb) of its two parts. In particu-lar, the first column introduces the two lan-guages of the corpus (pt = Portuguese, sp =Spanish, en = English) and the type of stra-tegy (aligned or not aligned) used to buildit. In the second and third columns, we showthe two Dice scores. The forth column showsthe size of the two parts of the corpus, andthe last column contains the two seed termsemployed to generate the corpus. In Table 2,we show the Dice scores as well as the size ofnine pairs of monolingual corpora randomlygenerated from Wikipedia.

    We can observe first that there are signi-ficant differences in terms of comparabilitybetween the Dice scores in Table 1 and thoseobtained from the randomly generated mono-lingual pairs in Table 2. It follows that cor-pora built by means of our strategies (notaligned and aligned) are actually comparable.Then, we should note that in the compara-ble corpora of Table 1, the Dice scores basedon tf idf are about 70% higher than thosebased on the binary function. By contrast, inrandomly generated corpora (Table 2), thereare no significant differences between Dicebinand Dicetd idf . It means that our tf idf ma-kes the Dice similarity score higher if the twoevaluated corpus parts are actually compara-ble.

    As it was expected, not-aligned corporatend to be larger than the aligned ones. Ho-wever, if we just compare the smallest partsof each corpus, the differences are not veryimportant: the smallest parts of not-alignedcorpora are only 15% larger than those ofaligned corpora. This is in accordance withthe fact that aligned corpora are more balan-ced in terms of size, since no part is muchlarger than the other one. As far the corpussize is concerned, let us note that, in avera-ge, English parts are clearly larger than theSpanish ones, which are slightly larger thanthe Portuguese ones. In general, English ar-

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    10

  • Corpora Dice Dice Size Seed terms(bin) (tf-idf) (in Mb)

    pt-sp (not aligned) .068 .086 0.6Mb/3.4Mb Arqueologia, Arqueoloǵıapt-en (not aligned) .041 .067 0.6Mb/8.4Mb Arqueologia, Archaeologysp-en (not aligned) .090 .140 0.4Mb/8.4Mb Arqueoloǵıa, Archaeologypt-sp (aligned) .179 .199 0.4Mb/0.2Mb Arqueologia, Arqueoloǵıapt-en (aligned) .127 .140 0.4Mb/1.1Mb Arqueologia, Archaeologysp-en (aligned) .181 .226 2.0Mb/2.9Mb Arqueoloǵıa, Archaeology

    pt-sp (not aligned) .078 .129 0.8Mb/1.7Mb Lingúıstica, Lingǘısticapt-en (not aligned) .054 .136 0.8Mb/5.1Mb Lingúıstica, Linguisticssp-en (not aligned) .074 .170 1.7Mb/5.1Mb Lingǘıstica, Linguisticspt-sp (aligned) .140 .214 0.6Mb/0.8Mb Lingúıstica, Lingǘısticapt-en (aligned) .128 .194 0.5Mb/1.2Mb Lingúıstica, Linguisticssp-en (aligned) .150 .257 0.9Mb/1.7Mb Lingǘıstica, Linguistics

    pt-sp (not aligned) .200 .374 4.4Mb/4.8Mb F́ısica, F́ısicapt-en (not aligned) .123 .287 4.4Mb/12Mb F́ısica, Physicssp-en (not aligned) .270 .403 4.8Mb/12Mb F́ısica, Physicspt-sp (aligned) .237 .390 3.6Mb/4.7Mb F́ısica, F́ısicapt-en (aligned) .178 .348 3.8Mb/11Mb F́ısica, Physicssp-en (aligned) .220 .387 3.4Mb/7.6Mb F́ısica, Physics

    pt-sp (not aligned) .130 .227 2.4Mb/1.5Mb Biologia, Bioloǵıapt-en (not aligned) .102 .193 2.4Mb/9.4Mb Biologia, Biologysp-en (not aligned) .068 .129 1.5Mb/9.4Mb Bioloǵıa, Biologypt-sp (aligned) .197 .328 1.6Mb/2.8Mb Biologia, Bioloǵıapt-en (aligned) .186 .308 1.8Mb/4.5Mb Biologia, Biologysp-en (aligned) .213 .294 0.9Mb/1.3Mb Bioloǵıa, Biology

    pt-sp (not aligned) .083 .148 11Mb/35Mb Desporto, Deportept-en (not aligned) .026 .085 11Mb/333Mb Desporto, Sportsp-en (not aligned) .047 .136 35Mb/333Mb Deporte, Sportpt-sp (aligned) .175 .266 9.7Mb/15Mb Desporto, Deportept-en (aligned) .189 .334 11Mb/20Mb Desporto, Sportsp-en (aligned) .206 .290 20Mb/29Mb Deporte, Sport

    pt-sp (not aligned) .111 .192 3.8Mb/9.3Mb Overallpt-en (not aligned) .069 .153 3.8Mb/73Mb Overallsp-en (not aligned) .109 .195 9.3Mb/73Mb Overallpt-sp (aligned) .185 .279 3.2Mb/4.7Mb Overallpt-en (aligned) .161 .264 3.5Mb/7.6Mb Overallsp-en (aligned) .194 .290 6.2Mb/8.5Mb Overall

    Cuadro 1: Dice similarity between several comparable corpora in Portuguese, Spanish, andEnglish.

    Corpora Dice Dice Size(bin) (tf-idf) (in Mb)

    pt-sp1 (random) .012 .012 2.2Mb/0.9Mbpt-en1 (random) .003 .003 2.2Mb/0.4Mbsp-en1 (random) .003 .003 0.9Mb/0.4Mb

    pt-sp2 (random) .016 .014 1.5Mb/3.0Mbpt-en2 (random) .017 .014 1.5Mb/42Mbsp-en2 (random) .017 .015 3.0Mb/42Mb

    pt-sp3 (random) .008 .006 0.2Mb/0.5Mbpt-en3 (random) .001 .001 0.2Mb/1.4Mbsp-en3 (random) .005 .005 0.5Mb/1.4Mb

    Cuadro 2: Dice similarity between randomly generated pairs of monolingual corpora.

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    11

  • ticles tend to have more words than Spanishand Portuguese articles. As it was suggestedby one of the reviewers of the article, one ofthe reasons for the difference in size in thecase of aligned corpora is that Spanish andPortuguese entries seem to be summaries ofthe English ones. So, to increase comparabi-lity between an aligned pair of articles, thelonger article could be shortened by remo-ving those parts which are not present in theother language, obtaining, this way, a morecomparable pair of articles.

    Finally, as it was expected, aligned cor-pora are significantly more comparable (i.e.,higher Dice coefficient) than not-aligned cor-pora. In average,Dicetd idf increases 80% thecomparability of aligned-corpora with regardto not-aligned ones. So, considering that alig-ned corpora only decreases 15% in size in re-lation to not-aligned corpora, we can conclu-de that the aligned strategy seems to be mo-re appropriate to build comparable corporafrom Wikipedia.

    5. Conclusions and Future Work

    The emergence of multilingual resources,such a Wikipedia, makes it possible to de-sign new methods and strategies to compilecorpus from the web, methods that are mo-re efficient and powerful than the traditio-nal ones. In particular, the semi-structuredinformation underlying Wikipedia turns outto be very useful to build comparable corpo-ra. In this article, we proposed two strategiesto build comparable corpora from Wikipediaand a way to measure their degree of com-parability. The experiments led us to conclu-de that corpora aligned article by article aremore comparable than not aligned corpora.Besides, they consist of two balanced corpusparts in terms of size. Finally, they are notmuch smaller than not aligned corpora.

    In future work, we will be focused on howto improve the strategies to build compara-ble corpora by extending coverage (more ar-ticles) without losing comparability. For thispurpose, we will test and evaluate techni-ques to expand categories using a list of si-milar terms identified as hyponyms or co-hyponyms of the source category. In order tofind hyponyms and co-hyponyms of a term, itwill be required to build an ontology of cate-gories using the semi-structured informationof Wikipedia (Chernov et al., 2006; Ponzettoy Navigli, 2009; de Melo yWeikum, 2010). On

    the other hand, we will evaluate comparabi-lity in an indirect way. In particular, we willuse the generated corpora on tasks requiringcomparable corpora as input (e.g., bilinguallexicon extraction). The better the extractedlexicon, the more comparable the input cor-pus should be. Finally, we believe that ourmethod for aligning pairs of articles could beuseful for related tasks, such as Wikipediainfoboxes alignment in different languagues(Adar, Skinner, y Weld, 2009).

    Bibliograf́ıa

    Adafre, S.F. y M. de Rijke. 2006. Finding si-milar sentences across multiple languagesin wikipedia. En 11th Conference of theEuropean Chapter of the Association forComputational Linguistics, páginas 62–69.

    Adar, Eytan, Michael Skinner, y Daniel S.Weld. 2009. Information arbitrage acrossmulti-lingual wikipedia. En Second ACMInternational Conference on Web Searchand Data Mining , WSDM.

    Chernov, Sergey, Tereza Iofciu, WolfgangNejdl, y Xuan Zhou. 2006. Extractingsemantic relationships between wikipediacategories. En SemWiki2006 - From Wikito Semantics, Budva, Montenegro.

    de Melo, Gerard y Gerhard Weikum. 2010.Menta: inducing multilingual taxonomiesfrom wikipedia. En Proceedings of the19th ACM international conference onInformation and knowledge management,CIKM ’10, páginas 1099–1108.

    Filatova, Elena. 2009. Directions for Exploi-ting Asymmetries in Multilingual Wikipe-dia. En CLEAWS3, páginas 30–37, Colo-rado.

    Gamallo, Pablo y Isaac González. 2010. Wi-kipedia as a multilingual source of compa-rable corpora. En LREC 2010 Workshopon Building and Using Comparable Cor-pora, páginas 19–26, Valeta, Malta.

    Gamallo, Pablo y José Ramom Pichel.2008. Learning Spanish-Galician Transla-tion Equivalents Using a Comparable Cor-pus and a Bilingual Dictionary. LNCS,4919:413–423.

    Li, Bo y Eric Gaussier. 2010. Improvingcorpus comparability for bilingual lexiconextraction from comparable corpora. En

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    12

  • 20th International Conference on Compu-tational Linguistics (COLING 2010, pági-nas 644–652.

    Maia, Belinda. 2003. What Are ComparableCorpora. En Workshop on MultilingualCorpora: Linguistic Requirements and Te-chnical Perspectives, páginas 27–34, Lan-caster, UK.

    Ponzetto, Simone Paolo y Roberto Navigli.2009. Large-scale taxonomy mapping forrestructuring and integrating wikipedia.En Proceedings of the 21st internationaljont conference on Artifical intelligence,páginas 2083–2088.

    Pottast, M., B. Stein, y M. Anderka. 2008. Awikipedia-based multilingual retrieval mo-del. En Advances in Information Retrie-val, páginas 522–530.

    Saralegui, X. y I. Alegria. 2007. Similitudentre documentos multiĺıngües de caráctercient́ıfico-técnico en un entorno Web. EnProcesamiento del Lenguaje Natural, pági-na 39.

    Saralegui, X., I. San Vicente, y A. Gurrutxa-ga. 2008. Automatic generation of bilin-gual lexicons from comparable corpora ina popular science domain. En LREC 2008Workshop on Building and Using Compa-rable Corpora.

    Tomás, J., J. Bataller, y F. Casacuberta.2001. Mining Wikipedia as a Parallel andComparable Corpus. En Language Fo-rum, volumen 1, página 34.

    Tyers, M.F. y J.A. Pieanaar. 2008. Extrac-ting Bilingual Word Pairs from Wikipe-dia. En LREC 2008, SALTMIL Works-hop, Marrakesh, Marocco.

    Yu, Kun y Junichi Tsujii. 2009. Bilingualdictionary extraction from wikipedia. EnMachine Translation Summit XII, Otta-wa, Canada.

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    13

  • Extracción de corpus paralelos de la Wikipedia basada en laobtención de alineamientos bilingües a nivel de frase∗

    Extracting Parallel Corpora from Wikipedia

    on the basis of Phrase Level Bilingual Alignment

    Joan Albert Silvestre-Cerdà, Mercedes Garćıa-Mart́ınez,Alberto Barrón-Cedeño, Jorge Civera y Paolo Rosso

    Departament de Sistemes Informàtics i ComputacióUniversitat Politècnica de València

    [email protected],{jsilvestre,lbarron,jcivera,prosso}@dsic.upv.es

    Resumen: Este art́ıculo presenta una nueva técnica de extracción de corpus para-lelos de la Wikipedia mediante la aplicación de técnicas de traducción automáticaestad́ıstica. En concreto, se han utilizado los modelos de alineamiento basados enpalabras de IBM para obtener alineamientos bilingües a nivel de frase entre pares dedocumentos. Para su evaluación se ha generado manualmente un conjunto de testformado por pares de documentos inglés-español, obteniéndose resultados promete-dores.Palabras clave: corpus comparables, extracción de oraciones paralelas, traducciónautomática estad́ıstica

    Abstract: This paper presents a proposal for extracting parallel corpora from Wi-kipedia on the basis of statistical machine translation techniques. We have usedword-level alignment models from IBM in order to obtain phrase-level bilingualalignments between documents pairs. We have manually annotated a set of testEnglish-Spanish comparable documents in order to evaluate the model. The obtai-ned results are encouraging.Keywords: comparable corpora, parallel sentences extraction, statistical machinetranslation

    1. Introducción

    La extracción automática de corpus pa-ralelos a partir de recursos textuales multi-lingües es, hoy por hoy, una tarea de especialinterés debido al creciente auge de la traduc-ción automática estad́ıstica. La web es una

    ∗ Este trabajo se ha llevado a cabo en el marco delVLC/CAMPUS Microcluster on Multimodal Interac-

    tion in Intelligent Systems, financiado parcialmentepor parte de la EC (FEDER/FSE; WIQEI IRSES

    no. 269180 / FP 7 Marie Curie People), por el MI-CINN como parte del proyecto Text-Enterprise 2.0

    (TIN2009-13391-C04-03) en el Plan I+D+i, y porla beca 192021 del CONACyT. También ha recibi-

    do apoyo por parte del EC (FEDER/FSE) y delMEC/MICINN bajo el programa MIPRCV “Conso-

    lider Ingenio 2010” (CSD2007-00018) y el proyectoiTrans2 (TIN2009-14511), por el MITyC en el mar-

    co del proyecto erudito.com (TSI-020110-2009-439),por la Generalitat Valenciana con las ayudas Pro-

    meteo/2009/014 y GV/2010/067, y por el “Vicerrec-

    torado de Investigación de la UPV” con la ayuda20091027.

    fuente inmensa de documentos en múltipleslenguas que tiene muchas posibilidades de ex-plotación. No obstante, encontrar frases pa-ralelas a nivel global en la web es una tareamuy dispersa y extremadamente dif́ıcil, aun-que no imposible (Uszkoreit et al., 2010).

    La Wikipedia es uno de los pocos recur-sos web que nos provee de forma expĺıcitagran cantidad de textos multilingües compa-rables, pues sus contenidos se presentan comoart́ıculos en múltiples idiomas que describenun mismo concepto. El objetivo es, pues, ex-plotar los contenidos comparables de dichosdocumentos con la finalidad de extraer fra-ses paralelas que puedan ser utilizadas en elentrenamiento de sistemas de traducción au-tomática.

    En este trabajo se propone una aproxima-ción heuŕıstica a la extracción de corpus pa-ralelos de la Wikipedia basada en técnicas deTraducción Automática Estad́ıstica (TAE).

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    14

  • En la siguiente sección analizaremos los tra-bajos previos que han servido de inspiración aeste trabajo. Posteriormente, en la Sección 3se describe ampliamente el sistema propues-to. La Sección 4 muestra los resultados expe-rimentales y finalmente, una serie de conclu-siones son expuestas en la Sección 5.

    2. Trabajos relacionados

    Debido a su creciente necesidad e impor-tancia, la extracción automática de corpusparalelos es una tarea bastante explorada enla actualidad, aunque los primeros trabajosse realizaron hace ya más de dos décadas(Brown, Lai, y Mercer, 1991; Gale y Church,1991), si bien éstos se ceñ́ıan a encontrar ali-neamientos entre frases en textos paralelos.Estos trabajos proponen métodos de alinea-miento muy rápidos pero poco precisos, puespara detectar relaciones entre frases utiliza-ban únicamente la información de longitudde las oraciones. Posteriormente, Chen pro-puso utilizar información léxica mediante unsencillo modelo de traducción estad́ıstico ba-sado en palabras, demostrando una mejorasignificativa de la calidad de los alineamien-tos extráıdos (Chen, 1993), y unos años mástarde, Moore combinó ambas aproximaciones(Moore, 2002). Más recientemente, Gonzálezpropuso un modelo de alineamiento entre fra-ses y palabras inspirado en el modelo 1 deIBM (González-Rubio et al., 2008).

    Con el problema de alinear frases en textosparalelos bien estudiado, y ante la crecien-te demanda de corpus paralelos para TAE,los principales esfuerzos se centraron en laextracción de corpus paralelos (Eisele y Xu,2010; Uszkoreit et al., 2010; Varga et al.,2005), en incluso monolingües (Barzilay yElhadad, 2003; Quirk, Brockett, y Dolan,2004), a partir de la web. En éste ámbi-to, la Wikipedia ha sido un recurso bastanteexplotado, presentándose una gran variedadde aproximaciones, desde métodos heuŕısti-cos (Adafre y de Rijke, 2006; Mohammadiy GhasemAghaee, 2010) hasta aproximacio-nes basadas en clasificación estad́ıstica utili-zando combinaciones lineales de caracteŕısti-cas (Smith, Quirk, y Toutanova, 2010; Tomáset al., 2008). También se han llevado a caboalgunos trabajos en la vertiente monolingüe(Yasuda y Sumita, 2008). Ahora bien, nin-guno de los trabajos previos ha exploradola utilización de modelos de traducción es-tad́ısticos como sistemas de evaluación de ali-

    neamientos en recursos comparables como laWikipedia, y es precisamente este vaćıo ex-perimental el que se pretende cubrir en estetrabajo.

    3. Descripción del sistema

    Para la tarea de extracción de cor-pus paralelos de la Wikipedia considera-remos pares de documentos de WikipediaX = (x1, . . . , xj, . . . , x|X|) ∈ X

    ∗ e Y =(y1, . . . , yi, . . . , y|Y |) ∈ Y

    ∗ que representen unmismo concepto, siendo xj la j-ésima frasedel documento X, yi la i-ésima frase del do-cumento Y , y X e Y los vocabularios de loslenguajes en los que se encuentran los respec-tivos documentos. Definimos (xj , yi) como unalineamiento entre la j-ésima frase del docu-mento X y la i-ésima frase del documento Y ,y A un conjunto finito de alineamientos.

    Inicialmente asumiremos que A = (X ×Y ), es decir, el conjunto A contiene todo ali-neamiento posible entre las frases de X yde Y . La probabilidad de cada alineamien-to (xj , yi) ∈ A se calcula de acuerdo con elmodelo 4 de IBM (Brown y others, 1993),que es un modelo de alineamiento a nivel depalabra ampliamente utilizado en TraducciónAutomática Estad́ıstica. Un alineamiento re-cibirá una probabilidad alta si el grado de co-ocurrencia de las palabras que componen lasfrases es alto, pero por contra recibirá unaprobabilidad baja si las palabras involucra-das tienen poca o ninguna correlación. Cabedecir que las puntuaciones otorgadas por losmodelos de IBM provienen de una serie deproductos de probabilidades, tantos como elnúmero de palabras que conforman la frasede destino yi, por lo que dicha puntuacióndebe ser normalizada convenientemente paraque no sea dependiente de la longitud. De noser aśı, los alineamientos con frases destino yide menor número de palabras tendeŕıan a sermás probables, pudiendo darse casos de ali-neamientos (xj , yi) con altos valores de pro-babilidad con |xj| = 8 e |yi| = 1, por ejemplo.

    Una vez se han evaluado todos los alinea-mientos del conjunto A, se obtiene el conjun-to de alineamientos más probables B ⊆ Amediante la siguiente maximización:

    (xj , yi) ∈ B / pIBM (xj | yi) > pIBM (xj | yi′) (1)

    ∀i′ = 1 . . . |Y | ∀j = 1 . . . |X|

    Es decir, para cada frase xj del documen-to X, conservaremos el alineamiento (xj , yi)

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    15

  • que maximice la probabilidad del modelo 4de IBM para toda posible frase yi. Esto im-plica añadir una restricción importante en elproceso de alineamiento, pero que no obstan-te nos permite definir un sistema base o ini-cial que tenemos previsto mejorar en el futuromediante el cálculo y la posterior combina-ción de los alineamientos en ambas direccio-nes.

    Por último, se genera el conjunto final dealineamientos filtrados C ⊆ B, formado poraquellos alineamientos cuya puntuación su-pere un cierto umbral α, es decir:

    (xj , yi) ∈ C / pIBM (xj | yi) > α (2)

    El umbral α puede interpretarse como unparámetro que afecta a la calidad de los ali-neamientos extráıdos, ya que cuanto mayores el umbral, mayor es nuestra exigencia so-bre el sistema, extrayéndose en consecuen-cia un menor número de alineamientos. Enla Sección 4 estudiaremos la influencia de es-te parámetro en las prestaciones de nuestrosistema.

    4. Experimentación

    Con el objetivo de evaluar las prestacio-nes que ofrece nuestro método de extracciónde corpus paralelos de la Wikipedia, hemosrealizado un estudio experimental en el quese evalúa la calidad de los pares de frases ex-tráıdos automáticamente por nuestro sistemaa partir de un conjunto de prueba que tuvi-mos que generar de forma manual, debido ala inexistencia de corpus adecuadamente eti-quetados para esta tarea. La generación dedicho conjunto, formado por pares de docu-mentos de la Wikipedia en inglés y español,es detallada en las Secciones 4.1 y 4.2.

    El modelo 4 de IBM fue entrenado conMGIZA, un software basado en el popularGIZA++ que nos ofrece la posibilidad de eva-luar un conjunto de prueba con los modelosya entrenados, además de que permite reali-zar un entrenamiento paralelo de los mismos.Con el fin de minimizar los problemas relacio-nados con las palabras fuera de vocabulario ygeneralizar el dominio del sistema, los mode-los de IBM se entrenaron con un subconjun-to de pares de frases, definido en (Sanchis-Trilles et al., 2010), de tres corpus de refe-rencia en el área de la Traducción Automáti-ca Estad́ıstica: Europarl-v5 (Koehn, 2005),

    Tabla 1: Estad́ısticas básicas del corpus em-pleado para el entrenamiento de los modelosIBM.

    EntrenamientoIdioma En EsNúmero de frases 2.8MTamaño Vocabulario 118K 164KNúmero Total Palabras 54M 58M

    News-Commentary y United Nations (Rafa-lovitch y Dale, 2009). Las estad́ısticas de estesubconjunto pueden ser consultadas en la Ta-bla 1. Cabe destacar la gran cantidad de pa-res de frases empleados para el entrenamientode los modelos, aśı como el considerable ta-maño de los vocabularios de cada una de laslenguas.

    El resto de esta sección se estructura comosigue: la Sección 4.1 muestra el procedimien-to de extracción de documentos y su prepro-ceso. Posteriormente, las Secciones 4.2 y 4.3presentan la metodoloǵıa de etiquetado y lasmétricas de evaluación empleadas, respecti-vamente. Finalmente, la Sección 4.4 exponelos resultados obtenidos al evaluar el conjun-to de entrenamiento generado manualmente.

    4.1. Selección de documentos y

    preproceso

    La Wikipedia alberga miles de art́ıculosdisponibles en inglés y español, y abarcanun dominio extremadamente amplio. Por esemotivo, y con el objetivo de realizar una prue-ba optimista con el sistema, se realizó unaselección de pares de documentos cuyos do-minios se asemejaran al dominio del corpusempleado en el entrenamiento del modelo dealineamiento. En concreto, se seleccionaronun total de 15 pares de documentos inglés-español relacionados con la economı́a y proce-sos administrativos de la Unión Europea. Dedichos documentos se extrajo el texto plano,que posteriormente fue sometido a un pre-proceso consistente en la separación de frasesen ĺıneas (sentence-splitting), aislamiento depalabras y signos de puntuación (tokenizing)y conversión a minúsculas (lowercasing). Lasestad́ısticas de dicho corpus después de sersometido a este preproceso se muestran en laTabla 2.

    4.2. Metodoloǵıa de etiquetado

    A continuación se describe la metodoloǵıaseguida para generar el conjunto de evalua-

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    16

  • Tabla 2: Estad́ısticas básicas del conjunto deevaluación construido de forma manual.

    EvaluaciónIdioma En EsNúmero de documentos 15Número de frases 661 341Alineamientos posibles 22680Tamaño Vocabulario 3,4K 2,8KNúmero Total Palabras 24,5K 16,2K

    ción etiquetado, partiendo de un conjuntode pares de documentos previamente prepro-cesados. Esta metodoloǵıa está inspirada en(Och y Ney, 2003), pero tomando alineamien-tos entre frases en lugar de alineamientos en-tre palabras.

    Dos personas se encargaron de etiquetarmanualmente y independientemente todo elconjunto de pares de documentos. Se les pi-dió que anotaran aquellos alineamientos, deentre todos los posibles para cada par de do-cumentos, que guardaran una relación de pa-ralelismo.

    Adicionalmente, los etiquetadores fueroninstruidos para que asignaran cada uno delos alineamientos a uno de los siguientes dosconjuntos:

    P : Conjunto de alineamientos probables.Definen alineamientos entre frases queconforman traducciones similares, aun-que no exactas, en las que se expresa lamisma idea semántica, o bien para in-dicar que un determinado alineamientoforma parte de una relación 1-a-muchoso muchos-a-1.

    S: Conjunto de alineamientos seguros,siendo S ⊆ P . Define alineamientos en-tre frases que son traducciones exactas ocasi exactas (paralelas).

    En este contexto, el etiquetador 1 generalos conjuntos S1 y P1, mientras que el etique-tador 2 genera S2 y P2. Entonces, los conjun-tos S1, P1, e S2, P2 se combinan en S y P dela siguiente forma:

    S = S1 ∩ S2

    P = P1 ∪ P2

    El conjunto P (que incluye S) representalos pares de frases que debeŕıan ser extráıdos

    por el sistema, y por tanto son tomados comoreferencia para la tarea. Para el caso concre-to de este corpus, el conjunto S está formadopor 10 alineamientos, mientras que el conjun-to P engloba un total de 115 alineamientos.

    4.3. Medidas de Evaluación

    La evaluación de la calidad del conjun-to filtrado de alineamientos C obtenido deforma automática mediante nuestro sistemase ha realizado mediante la métrica SentenceAlignment Error Rate, claramente inspiradaen la presentada en (Och y Ney, 2003).

    Dado un par de documentos X e Y , losconjuntos de alineamientos entre ambos do-cumentos S y P etiquetados manualmente,y el conjunto filtrado de alineamientos C, sedefine la métrica Sentence Alignment ErrorRate (SAER) como sigue:

    SAER(S,P,C) = 1−|C ∩ S|+ |C ∩ P |

    |C|+ |S|(3)

    Al igual que (Och y Ney, 2003), tambiénhemos empleado las medidas de cobertura yprecisión para obtener más información acer-ca de las prestaciones del sistema:

    Cobertura =|C ∩ S|

    |S|, Precisión =

    |C ∩ P |

    |C|(4)

    4.4. Resultados

    En la presente sección se presentan los re-sultados de las pruebas experimentales lleva-das a cabo con nuestro sistema, utilizandoel conjunto de evaluación generado de for-ma manual. En la Sección 3 hemos resalta-do la necesidad de estudiar la influencia delparámetro α, puesto que radica directamenteen la calidad de la frases extráıdas. Un valoralto para dicho umbral puede conllevar a queel sistema no sea capaz de extraer ningún ali-neamiento. Por contra, un valor pequeño deα se traduciŕıa en la extracción de un grannúmero de pares de frases, e idealmente enun aumento del número de alineamientos co-rrectos (Verdaderos Positivos, V P ), aunquehay que tener en cuenta que el número decasos de Falsos Positivos (FP ), es decir, ali-neamientos que no existen en la referencia,aumenta generalmente en mayor proporciónque los V P s. La clave está pues en encontrarun valor de α que garantice la obtención dela mayor proporción posible de VerdaderosPositivos (V PR) y que minimice el ratio deFalsos Positivos (FPR). Ambas proporcionesse calculan de la siguiente forma:

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    17

  • 0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.01 0.02 0.03

    Ver

    dade

    ros

    Pos

    itivo

    s

    Falsos Positivos

    Figura 1: Curva ROC para constatar la rela-ción entre Verdaderos Positivos y Falsos po-sitivos en función del parámetro α.

    V PR =V P

    P=

    V P

    V P + FN(5)

    FPR =FP

    N=

    FP

    FP + V N(6)

    donde P representa el número de mues-tras positivas, que es igual al número de casosde Verdaderos Positivos (V P ) más el númerode casos de Falsos Negativos (FN), mientrasque N representa el número de muestras ne-gativas, que es igual al número de casos deFalsos Positivos (FP ) más el número de ca-sos de Verdaderos Negativos (V N).

    Con esta finalidad, hemos realizado unaexploración exhaustiva del parámetro α, yposteriormente hemos dibujado una curvaROC, mostrada en la Figura 1, en la que seobserva la relación entre los Verdaderos Po-sitivos (V PR, eje vertical) y los Falsos Po-sitivos (FPR, eje horizontal) en función delumbral α, cuyo valor es inversamente propor-cional al desplazamiento de ambos ejes. Cabedecir que dicha exploración debeŕıa de ha-berse llevado a cabo mediante un conjuntode desarrollo, pero debido a la ausencia delmismo tuvimos que emplear el conjunto deevaluación. En el futuro planeamos ampliardicho corpus para poder generar un conjuntode desarrollo.

    De la Figura 1 cabe destacar varias cosas.En primer lugar, la gráfica tiene un aspectodegenerado debido a que la proporción rela-tiva de Falsos Positivos nunca podrá llegar a

    valer 1, puesto que está acotada superiormen-te por FP/(FP + V N) teniendo en cuentaque FP ≤ |X| (como máximo se darán lugartantos FPs como número de frases del docu-mento de entrada) y que V N ≤ |X × Y | (elsistema puede llegar a descartar el conjuntode todos los posibles alineamientos), por loque el valor del cociente será muy pequeño.En segundo lugar, podemos observar que pa-ra valores más altos del umbral α la relaciónde Falsos Positivos llega a ser casi cero pa-ra un ratio del 0.3 de Verdaderos Positivos,mientras que para valores de α más pequeñospodemos llegar a conseguir un 0.5 de VPRcon un ratio del 0.02 de FPR. En términosrelativos, este segundo punto parece ser elóptimo, pero si tomamos en cuenta los va-lores absolutos, nos encontramos con diferen-cias del orden de centenares de FPs. Es poreste motivo por el cual nos decantaremos porel primer de ellos, con α = 1,1 · 10−3.

    En la Tabla 3 se muestran los valores de lasmétricas, presentadas en la Sección 4.3, trasla evaluación del conjunto de prueba, ademásde otras estad́ısticas de interés, para el valordel umbral que hemos considerado como ópti-mo (α = 1,1·10−3) y para dos casos extremos,con el objetivo de apreciar más notoriamentela influencia de dicho parámetro en las pres-taciones del sistema. La primera fila mues-tra el tamaño del conjunto de alineamien-tos filtrados C, mientras que las cuatro filassiguientes muestran el número de muestrasclasificadas como Verdaderos Positivos (V P ),Verdaderos Negativos (V N), Falsos Positivos(FP ) y Falsos Negativos (FN). Por último,se muestran los valores de las tres métricasempleadas para evaluar las prestaciones delsistema: cobertura, precisión y SAER.

    En ella se puede ver como, a pesar dela simplicidad de nuestro planteamiento, seobtienen unos resultados bastante aceptablespara el valor óptimo de α, con una tasa del0.36 de error de alineamiento, un 0.59 de gra-do de precisión, y sobretodo un 0.9 de cober-tura, aunque cabe decir que esta última noes una medida fiable dado que en el corpussólo existen 10 alineamientos etiquetados co-mo seguros. A continuación se muestran algu-nos ejemplos de los pares de frases extráıdospor nuestro sistema:

    En: On 20 april 2005, the European Com-mission adopted the communication onKosovo to the council “a european futu-

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    18

  • Tabla 3: Resultados del sistema para el conjunto de test generado manualmente, con α = {1 ·10−4, 1,1 · 10−3, 5 · 10−2}.

    α = 1 · 10−4 α = 1,1 · 10−3 α = 5 · 10−2

    |C| 656 59 4V P 58 35 2V N 21967 22541 22563FP 598 24 2FN 57 80 113Cobertura 1,00 0,90 0,1Precisión 0,09 0,59 0,50SAER 0,90 0,36 0,79

    re for Kosovo” which reinforces the com-mission’s commitment to Kosovo.

    Es: El 20 de abril de 2005, la Comisión Euro-pea adoptó la comunicación sobre koso-vo en el consejo “un futuro europeo paraKosovo” que refuerza el compromiso dela comisión con Kosovo.

    En: He added that the decisive factor wouldbe the future and the size of the eurozo-ne, especially whether Denmark, Swedenand the UK would have adopted the euroor not.

    Es: Añadió que el factor decisivo será el fu-turo y el tamaño de la zona del euro,especialmente si Dinamarca, Suecia y elReino Unido se unen al euro o no.

    En: Montenegro officially applied to join theEU on 15 december 2008.

    Es: Oficialmente, Montenegro pidió el accesoa la UE el 15 de diciembre de 2008.

    Si observamos nuevamente la Tabla 3 ynos fijamos en las diferencias existentes en-tre el caso óptimo y los casos extremos, sepueden extraer algunas conclusiones intere-santes. Para α = 1 · 10−4 no se filtra ningúnalineamiento, esto es, C = B, y por tantonos damos cuenta que nuestro sistema nun-ca será capaz de encontrar 57 alineamientosque śı están en la referencia. Para evitar es-ta severa limitación tenemos previsto obtenerlos alineamientos entre frases en ambos sen-tidos (X a Y , e Y a X), y posteriormenteaplicar un algoritmo heuŕıstico inspirado en

    (Och y Ney, 2003) que los combine, partiendode la intersección entre ambos alineamientosy añadiendo alineamientos adicionales. Estonos llevará, en primer lugar, a obtener alinea-mientos más robustos, y en segundo lugar, acapturar relaciones entre frases de muchas-a-1, 1-a-muchas, e incluso muchas-a-muchas.

    5. Conclusiones y Trabajo Futuro

    En este trabajo hemos presentado unaaproximación heuŕıstica alternativa a las yaexistentes para la extracción automática decorpus paralelos a partir de los contenidosmultilingües comparables que ofrece la Wi-kipedia. La evaluación experimental ha mos-trado unos resultados francamente promete-dores para nuestro sistema inicial. Como ex-tensión de este trabajo planeamos obtenerde forma heuŕıstica los alineamientos entrefrases en ambas direcciones con el objetivode mejorar la calidad del sistema, una me-jora que creemos que será sustancial. Otraalternativa de cara al futuro seŕıa emplearla variante del modelo 1 de IBM presentadaen (González-Rubio et al., 2008) en esta ta-rea, ya que nos permitiŕıa obtener los alinea-mientos bidireccionales de forma no heuŕısti-ca mediante un entrenamiento Expectation-Maximization (Dempster, Laird, y Rubin,1977). Con la implementación de estas mejo-ras, realizaremos un estudio comparativo denuestro sistema con otros sistemas del estadodel arte.

    Cabe destacar, además, que en este traba-jo hemos adaptado una metodoloǵıa existentepara la evaluación de alineamientos a nivelde frase. Para ello, hemos definido una me-todoloǵıa de etiquetado adecuada para gene-rar un conjunto de evaluación, aśı como unaserie de métricas para cuantificar las presta-ciones del sistema. Como trabajo futuro pre-

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    19

  • tendemos aumentar el tamaño del corpus yel número de anotadores, con el fin de hacermás robusto el proceso de etiquetado manualde los alineamientos.

    Bibliograf́ıa

    Adafre, S. F. y M. de Rijke. 2006. FindingSimilar Sentences across Multiple Langua-ges in Wikipedia. Proceedings of the 11thConference of the European Chapter ofthe Association for Computational Lin-guistics, páginas 62–69.

    Barzilay, Regina y Noemie Elhadad. 2003.Sentence Alignment for MonolingualComparable Corpora. En Proceedingsof the 2003 conference on Empiricalmethods in natural language processing,EMNLP ’03, páginas 25–32, Stroudsburg,PA, USA. Association for ComputationalLinguistics.

    Brown, P. F. y others. 1993. The Mat-hematics of Statistical Machine Transla-tion: Parameter Estimation. Computatio-nal Linguistics, 19(2):263–311.

    Brown, Peter F., Jennifer C. Lai, y Robert L.Mercer. 1991. Aligning Sentences in Pa-rallel Corpora. En Proceedings of the29th annual meeting on Association forComputational Linguistics, ACL ’91, pági-nas 169–176, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.

    Chen, Stanley F. 1993. Aligning Sentencesin Bilingual Corpora Using Lexical Infor-mation. En Proceedings of the 31st an-nual meeting on Association for Compu-tational Linguistics, ACL ’93, páginas 9–16, Stroudsburg, PA, USA. Association forComputational Linguistics.

    Dempster, A. P., N. M. Laird, y D. B. Rubin.1977. Maximum Likelihood from Incom-plete Data via the EM Algorithm. J. Roy.Statistical Society. Series B, 39(1):1–38.

    Eisele, Andreas y Jia Xu. 2010. ImprovingMachine Translation Performance usingComparable Corpora. En Proceedings ofthe 3rd Workshop on Building and UsingComparable Corpora LREC 2010, páginas35–41. ELRA.

    Gale, William A. y Kenneth W. Church.1991. A Program for Aligning Senten-ces in Bilingual Corpora. En Proceedings

    of the 29th annual meeting on Associa-tion for Computational Linguistics, ACL’91, páginas 177–184, Stroudsburg, PA,USA. Association for Computational Lin-guistics.

    González-Rubio, Jesús, Germán Sanchis-Trilles, Alfons Juan, y Francisco Casacu-berta. 2008. A Novel Alignment ModelInspired on IBM Model 1. En Proceedingsof the 12th conference of the European As-sociation for Machine Translation, pági-nas 47–56.

    Koehn, P. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. EnProc. of the MT Summit X, páginas 79–86, September.

    Mohammadi, M. y N. GhasemAghaee. 2010.Building Bilingual Parallel Corpora Basedon Wikipedia. En Computer Engineeringand Applications (ICCEA), 2010 SecondInternational Conference on, volumen 2,páginas 264 –268, march.

    Moore, Robert C. 2002. Fast and Accura-te Sentence Alignment of Bilingual Corpo-ra. En Proceedings of the 5th Conferenceof the Association for Machine Transla-tion in the Americas on Machine Transla-tion: From Research to Real Users, AMTA’02, páginas 135–144, London, UK, UK.Springer-Verlag.

    Och, Franz Josef y Hermann Ney. 2003. ASystematic Comparison of Various Statis-tical Alignment Models. ComputationalLinguistics, 29:19–51, March.

    Quirk, Chris, Chris Brockett, y William Do-lan. 2004. Monolingual Machine Transla-tion for Paraphrase Generation. En Pro-ceedings of the 2004 Conference on Em-pirical Methods in Natural Language Pro-cessing, páginas 142–149.

    Rafalovitch, Alexandre y Robert Dale. 2009.United Nations General Assembly Resolu-tions: A Six-Language Parallel Corpus.

    Sanchis-Trilles, Germán, Jesús Andrés-Ferrer, Guillem Gascó, Jesús González-Rubio, Pascual Mart́ınez-Gómez, Martha-Alicia Rocha, Joan-Andreu Sánchez, yFrancisco Casacuberta. 2010. UPV-PRHLT English–Spanish System forWMT10. En Proceedings of the JointFifth Workshop on Statistical MachineTranslation and Metrics MATR, páginas

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    20

  • 172–176, Uppsala, Sweden, July. Associa-tion for Computational Linguistics.

    Smith, Jason R., Chris Quirk, y Kristina Tou-tanova. 2010. Extracting Parallel Sen-tences from Comparable Corpora usingDocument Level Alignment. En HumanLanguage Technologies: The 2010 AnnualConference of the North American Chap-ter of the Association for Computatio-nal Linguistics, HLT ’10, páginas 403–411, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

    Tomás, Jesús, Jordi Bataller, Francisco Ca-sacuberta, y Jaime Lloret. 2008. Mi-ning Wikipedia as a Parallel and Com-parable Corpus. LANGUAGE FORUM,34(1). Article presented at CICLing-2008,9th International Conference on Intelli-gent Text Processing and ComputationalLinguistics, February 17 to 23, 2008, Hai-fa, Israel.

    Uszkoreit, Jakob, Jay M. Ponte, Ashok C.Popat, y Moshe Dubiner. 2010. LargeScale Parallel Document Mining for Ma-chine Translation. En Proceedings of the23rd International Conference on Compu-tational Linguistics, COLING ’10, páginas1101–1109, Stroudsburg, PA, USA. Asso-ciation for Computational Linguistics.

    Varga, Dániel, László Németh, Péter Halácsy,András Kornai, Viktor Trón, y ViktorNagy. 2005. Parallel Corpora for MediumDensity Languages. En Proceedings of theRANLP 2005, páginas 590–596.

    Yasuda, Keiji y Eiichiro Sumita. 2008. Met-hod for Building Sentence-Aligned Corpusfrom Wikipedia. En Proceedings of the33th AAAI workshop on Artificial Intelli-gence (AAAI-08).

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    21

  • Pivot strategies as an alternative for statistical machinetranslation tasks involving iberian languages∗

    Estrategias pivote como alternativa a las tareas de traducción automáticaestad́ıstica entre idiomas ibéricos

    Carlos Henŕıquez†, Marta R. Costa-jussà?, Rafael E. Banchs‡, Lluis Formiga† and José B. Mariño†† Universitat Politècnica de Catalunya-TALP

    C/Jordi Girona, 08034, Barcelona{carlos.henriquez,lluis.formiga,jose.marino}@upc.edu

    ?Barcelona Media Innovation CenterAv Diagonal, 177, 9th floor, 08018 Barcelona, Spain

    [email protected]‡ Institute for Infocomm Research

    1 Fusionopolis Way 21-01, Singapore [email protected]

    Resumen: Este art́ıculo describe diferentes aproximaciones para construir sistemasde traducción automática estad́ısticas (SMT por sus siglas en inglés) entre idio-mas de escasos recursos paralelos. La estrategia es especialmente interesante paraEspaña, un páıs con tres idiomas oficiales (catalán, vasco y gallego) aparte del cas-tellano, en donde es dif́ıcil conseguir corpus paralelo entre cualquiera de los tresprimeros pero es comparativamente fácil hacerlo entre castellano y cualquiera deellos. Tal particularidad nos permite aprovechar el castellano como puente o pivotepara construir sistemas que traduzcan entre catalán e inglés, por ejemplo. Estossistemas son de gran utilidad para los idiomas minoritarios pues ayudan a darlesuna presencia global y a promover su uso. Como caso de uso, se describe un sistemacatalán-inglés siguiendo la estrategia pivote de corpus sintético, la comparamos conuna aproximación de cascada y comentamos sobre mejoras adicionales que pudieranimplementarse para este par de idiomas en particular.Palabras clave: idioma pivote, traducción automática estad́ıstica, corpus paraleloescaso, cascada, pseudo-corpus, modelos de traducción, frases, n-gramas

    Abstract: This paper describes different pivot approaches to built SMT systems forlanguage pairs with scarce parallel resources. The strategy is particularly interestingfor Spain, a country with three official languages (Catalan, Basque, and Galician)besides Spanish, where it is difficult to find parallel corpora between two of the firstthree mentioned languages but it is relatively easy to collect it between Spanish andany of them. This characteristic, however, allow us to develop machine translationsystems from major languages like English, to Catalan for instance, using Spanish aspivot. Such systems help these minority languages giving them global presence andpromoting their use in content collaboration. We describe a English-Catalan base-line system built following the synthetic approach, we compare it with the transferapproach and comment about future enhancement that could be implemented forthis language pair.Keywords: pivot language, statistical machine translation, scarse parallel corpora,cascade, pseudo-corpus, phrase-based, ngram-based, translation models

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    22

  • 1. Motivation

    Spain is a multilingual country with fourofficial languages: Catalan, Euskera, Galicianand Spanish. Catalan is spoken by 11.5 mi-llion people, Euskera by 1.2 million people,Galician by 3.2 million people and Spanish by400 million people. Given the high number ofSpanish speakers compared to the other lan-guages, Spanish has much more linguistic anddata resources.

    The quantity of resources is relevant instatistical machine translation. The more pa-rallel text we have, the better the transla-tion quality. In order to face the lack of re-sources in translation, there are many re-search works on pivot approaches which con-sist on using a pivot language to performa source to target translation (Bertoldi etal., 2008a) (Costa-jussà, Henŕıquez, y Ban-chs, 2011). For example, in order to translatefrom Galician to Catalan, we could use Spa-nish as pivot language. There are much mo-re resources in Galician-Spanish and Spanish-Catalan than between Galician and Catalandirectly. The same could happen when inter-ested in translating Catalan, Euskera or Ga-lician into English. In this work, we introdu-ce a state-of-the-art English-Catalan trans-lation system recently built for the free webtranslator N-II1.

    The main differences with the Catalan-English SMT system presented in (de Gis-pert y Mariño, 2006) are that in this pa-per we use an extended corpus and we pro-pose to build a hybrid system which usesan Ngram-based system for Catalan-Spanishand a phrase-based system for Spanish-English. The Ngram-based system outper-forms the phrase-based system in Catalan-Spanish (Farrús et al., 2009) while the op-posite occurs for the case of Spanish-English(Costa-Jussà y Fonollosa, 2009). Additiona-lly, for the Catalan-Spanish system we areusing a further competitive system using ru-les and statistical features (Farrús et al.,2011).

    The remainder of this paper is organized

    ∗ The research leading to these results has recei-ved funding from the European Community’s SeventhFramework Programme (FP7/2007-2013) under grantagreement 247762 (FAUST) and from the Spanish Mi-nistry of Science and Innovation through the Juan dela Cierva research program and the Buceador project(TEC2009-14094-C04-01).

    1available at http://www.n-ii.org

    as follows. Section 2 reports a brief descrip-tion of the phrase-based and Ngram-basedtranslation approaches. Section 3 presentsthe pivot approaches used in this paper. Sec-tion 4 describes the English-Catalan SMTsystem. Section 5 compares the pivot strate-gies in terms of translation quality and Sec-tion 6 presents the most relevant conclusions.

    2. Statistical MachineTranslation approaches

    As mentioned in the previous section, weare working with two SMT systems: thephrase-based (Koehn, Och, y Marcu, 2003)and Ngram-based systems (Mariño et al.,2006; Casacuberta y Vidal, 2004), which arebriefly described as follows.

    2.1. Phrase-based

    This approach to SMT performs the trans-lation splitting the source sentence in seg-ments and assigning to each segment a bi-lingual phrase from a phrase-table. Bilin-gual phrases are translation units that con-tain source words and target words, e.g. <unidad de traducción | translation unit >,and have different scores associated to them.These bilingual phrases are then selected tomaximize a linear combination of feature fun-ctions. Such strategy is known as the log-linear model (Och y Ney, 2002) and it is for-mally defined as:

    ê = arg máxe

    [M∑

    m=1

    λmhm (e, f)

    ](1)

    where hm are different feature functions withweights λm. The two main feature functionsare the translation model (TM) and the tar-get language model (LM). Additional modelsinclude POS target language models, lexicalweights, word penalty and reordering modelsamong others.

    Moses (Koehn et al., 2007) was used tobuild the phrase-based system.

    2.2. Ngram-based

    The base of the Ngram approach is theconcept of tuple. Tuples are bilingual unitswith consecutive words both on the sourceand target side that are consistent with theword alignment. They must provide a uni-que monotonic segmentation of the sentencepair and they cannot be inside another tuple

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    23

  • in the same sentence. This unique segmenta-tion allows us to see the translation model asa language model, where the language is com-posed of tuples instead of words. That way,the context used in the translation model isbilingual and implicitly works as a languagemodel with bilingual context as well. In fact,while a language model is required in phrase-based and hierarchical phrase-based systems,in Ngram-based systems it is considered justan additional feature.

    This alternative approach to a translationmodel defines the probability as:

    P (f, e) =N∏

    n=1

    P((f, e)n | (f, e)n−1 , . . . , (f, e)1

    )(2)

    where (f, e)n is the n-th tuple of hypothesise for the source sentence f .

    As additional features, we used a Part-Of-Speech (POS) language model for the targetside and a target word bonus model.

    We used the open source decoder MARIE(Crego, de Gispert, y Mariño, 2005) to buildthe Ngram-based system.

    3. Pivot Approaches

    The best approaches to build a SMT sys-tem through a pivot language are: the cas-cade system, also known as the transfer ap-proach and the pseudo-corpus or syntheticapproach. Other pivot approaches do not out-perform these two (Wu y Wang, 2007) (Cohny Lapata, 2007). The cascade and the pseudo-corpus approaches have been evaluated andcompared in works such as (de Gispert y Ma-riño, 2006; Bertoldi et al., 2008a; Bertoldiet al., 2008b). Consistently, both works ha-ve shown that the pseudo-corpus approach isthe best performing strategy.

    3.1. Cascade or transfer method

    This approach considers the languagepairs source-pivot and pivot-target indepen-dently. It consists in training and tuning twodifferent SMT systems and combine them ina two-step process: first, we translate a sourcesentence using the source-pivot system; then,we use the resulting sentence as input for thepivot-target translation. A common variationfor this strategy presented in (Khalilov et al.,2008) considers a n-best output instead of thesingle-best during the first translation andthen produce a m-best translation in the last

    step. At the end,mn-best hypotheses are pro-duced, which are reranked by using MinimumBayes Risk (MBR) (Kumar y Byrne, 2004),allowing the introduction of additional featu-res such as new language models.

    3.2. Pseudo-corpus or syntheticapproach

    Instead of considering the two languagepairs independently, this approach producesa single source-target SMT system. Assumingwe have a source-pivot and a pivot-target pa-rallel corpus, we build and tuned a pivot-target SMT system and we use it to translatethe pivot part from the source-pivot corpus.This results in a source-target synthetic cor-pus (hence the name) which is finally used tobuild the source-target SMT system. For thetuning process, we could also use a synthe-tic development corpus but an actual source-target corpus is prefered, if possible. A sim-ple variation for this approach is to build apivot-source SMT system in order to transla-te the pivot part of the pivot-target corpus,and use the resulting source-target syntheticcorpus to build the final system.

    4. Building an English-CatalanSMT using Spanish as pivot

    We present an English-Catalan SMT ba-seline system, using Spanish as the pivot lan-guage. In this case, the parallel corpus avai-lable for the Catalan-Spanish language pairwas provided by the bilingual newspaper “ElPeriódico”2 and the English-Spanish corres-ponds to the train corpora provided duringthe 2010 WMT’s translation task3, i.e. Eu-roparl and News Commentary. We followedthe synthetic approach described before tobuild the final system. Therefore, the Spanishpart from the WMT Corpus was translatedinto Catalan and a English-Catalan phrase-based SMT system was built using the resul-ting synthetic corpus. Table 1 shows a sum-mary of the statistics of both corpora. Wealso used the Catalan-Spanish baseline toget-her with the Spanish-English baseline systempresented in the 2010’s WMT (Henŕıquez Q.et al., 2010) to build the other direction andcompare the different approaches in it.

    2http://www.elperiodico.es3http://www.statmt.org/wmt10/translation-

    task.html

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    24

  • Corpora Catalan Spanish

    Training sents. 4,6M 4,6MRunning words 96,94M 96,86M

    Vocabulary 1,28M 1,23MDevelopment sents. 1966 1966

    Running words 46765 44667Vocabulary 9132 9426

    Corpora Spanish English

    Training sents. 1,18M 1,18MRunning words 26,45M 25,29M

    Vocabulary 118073 89248Development sents. 1729 1729

    Running words 37092 34774Vocabulary 7025 6199Test sents. 2525 2525

    Running words 69565 65595Vocabulary 10539 8907

    Cuadro 1: Catalan-Spanish and Spanish-English corpora (M stands for Millions)

    4.1. Spanish-Catalan baselinesystem

    As mentioned before, the Spanish-CatalanSMT system (named N-II) is based on thecorpus provided by the bilingual newspaper“El Periódico”. It is a Ngram-based SMTsystem that includes several improvementsspecific to the language pair: a homonymdisambiguation for the Catalan verb ‘soler’and Catalan possessives, special considera-tion for pronominal clitics, upper-case wordsand the Catalan apostrophe, gender concor-dance, numbers and time categorization andtext processing for common mistakes foundwhen writing in Catalan. The full descriptioncan be found in (Farrús et al., 2011).

    4.2. English-Catalan systemdescription

    Once obtained the Catalan translationfrom the Spanish section of the WMT corpus,a phrase-based SMT system was built usingMoses as the decoder. Apart from the base-line pipeline, the system also includes a POStarget language model computed with TnT(Brants, 2000), numbers and time categori-zation similar to N-II and the parallel corpuswas aligned considering the Catalan lemmascomputed with Freeling (Padró et al., 2010)and the English stems of words obtained withSnowball4.

    4http://snowball.tartarus.org

    Pivot approach Direction BLEU

    Cascade cat-eng 21,63Cascade eng-cat 24,29

    Pseudo-corpus cat-eng 23,19Pseudo-corpus eng-cat 26,97

    Cuadro 2: English-Catalan results

    5. Results

    Table 2 shows the BLEU score of the cas-cade and pseudo-corpus approaches in bothdirections. The test set was the one providedas internal test set during the WMT transla-tion task. It is also important to mention thatthe score was computed using one reference.

    The final quality of the Catalan-Englishsystem is determined by the quality of theSpanish-English corpus, whose baseline has aBLEU around 24 (Henŕıquez Q. et al., 2010).The Catalan-Spanish baseline has a BLEUaround 80 (Farrús et al., 2009). Also there is anegative effect given the difference in domainbetween the Catalan-Spanish corpus (a regio-nal newspaper) and Spanish-English corpus(Europarl).

    Using paired bootstrap resampling(Koehn, 2004), we can see that for thesesystems, the Pseudo-corpus approach isbetter than Cascade with 95 % statisticalsignificance.

    6. Conclusions and further work

    We have presented an English-CatalanSMT system built using Spanish as pivot lan-guage, given the scarce resources for English-Catalan.

    Similarly to previous research work, wehave seen here that, in the particular trans-lation task under consideration, the pseudo-corpus approach constitutes the best stra-tegy for pivot translation. Although the cas-cade approach clearly performs worse thanthe pseudo-corpus approach, it could be alsobeneficial to consider a system combinationbetween these two strategies to further boostthe quality of the translations.

    Further work should focus on buildingSpanish-pivot systems between all the offi-cial languages and English, as well as amongthem. The similarities between the languages(except Basque) and the availability of para-llel corpora between Spanish and the othersencourage the approach.

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    25

  • Bibliograf́ıa

    Bertoldi, N., R. Cattoni, M. Federico, yM. Barbaiani. 2008a. FBK @ IWSLT-2008. En Proc. of the InternationalWorkshop on Spoken Language Transla-tion, páginas 34–38, Hawaii, USA.

    Bertoldi, Nicola, Madalina Barbaiani, Mar-cello Federico, y Roldano Cattoni. 2008b.Phrase-Based Statistical Machine Trans-lation with Pivot Languages. En Procee-dings of IWSLT.

    Brants, T. 2000. TnT – a statisticalpart-of-speech tagger. En Proc. of theSixth Applied Natural Language Proces-sing (ANLP-2000), Seattle, WA.

    Casacuberta, F. y E. Vidal. 2004. Machinetranslation with inferred stochastic finite-state transducers. Computational Lin-guistics, 30(2):205–225.

    Cohn, T. y M. Lapata. 2007. Machine Trans-lation by Triangulation: Making EffectiveUse of Multi-Parallel Corpora. En Proc.of the ACL.

    Costa-Jussà, M. R. y J. A. R. Fonollosa.2009. Phrase and ngram-based statisticalmachine translation system combination.Applied Artificial Intelligence: An Inter-national Journal, 23(7):694–711, August.

    Costa-jussà, M.R., C. Henŕıquez, y R. Ban-chs. 2011. Evaluación de estrategias pa-ra la traducción automática estad́ıstica dechino a castellano con el inglés como len-gua pivote. En Proc. of the SEPLN, Huel-va.

    Crego, J.M., A. de Gispert, y J.B. Mariño.2005. An Ngram-based Statistical Machi-ne Translation Decoder. En Proceedingsof 9th European Conference on SpeechCommunication and Technology (Inters-peech).

    de Gispert, A. y J.B. Mariño. 2006. Catalan-English Statistical Machine Translationwithout Parallel Corpus: Bridging th-rough Spanish. En Proc. of LREC5th Workshop on Strategies for develo-ping Machine Translation for MinorityLanguages (SALTMIL’06), páginas 65–68,Genova.

    Farrús, M., M. R. Costa-jussà, J. B. Mariño,M. Poch, A. Hernández, C. Henŕıquez,

    y J. A. R. Fonollosa. 2011. Overco-ming statistical machine translation limi-tations: error analysis and proposed so-lutions for the catalan-spanish languagepair. Language Resoures and Evaluation,45(2):181–208.

    Farrús, M., M. R. Costa-jussà, M. Poch,A. Hernández, y J. B. Mariño. 2009.Improving a catalan-spanish statisticaltranslation system using morphosyntac-tic knowledge. En Proceedings of Euro-pean Association for Machine Translation2009.

    Henŕıquez Q., C. A., M.R. Costa-jussà,V. Daudaravicius, R. E. Banchs, y J. B.Mariño. 2010. Using collocation segmen-tation to augment the phrase table. EnProceedings of the Joint Fifth Workshopon Statistical Machine Translation andMetricsMATR, páginas 104–108, Uppsala,Sweden, July.

    Khalilov, M., M. R. Costa-Jussà, C. A.Henŕıquez, J. A. R. Fonollosa,A. Hernández, J. B. Mariño, R. E.Banchs, B. Chen, M. Zhang, A. Aw, yH. Li. 2008. The TALP & I2R SMTSystems for IWSLT 2008. En Proc. ofthe International Workshop on SpokenLanguage Translation, páginas 116–123,Hawaii, USA.

    Koehn, P. 2004. Statistical significance testsfor machine translation evaluation. EnProceedings of EMNLP, volumen 4, pági-nas 388–395.

    Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Co-wan, W. Shen, C. Moran, R. Zens,C. Dyer, O. Bojar, A. Constantin, yE. Herbst. 2007. Moses: Open SourceToolkit for Statistical Machine Transla-tion. En ACL ’07: Proceedings of the 45thAnnual Meeting of the ACL on InteractivePoster and Demonstration Sessions, pági-nas 177–180, Morristown, NJ, USA.

    Koehn, P., F.J. Och, y D. Marcu. 2003. Sta-tistical phrase-based translation. En Proc.of the 41th Annual Meeting of the Associa-tion for Computational Linguistics.

    Kumar, S. y W. Byrne. 2004. Minimumbayes-risk decoding for statistical machinetranslation. En Proceedings of the HumanLanguage Technology and North American

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    26

  • Association for Computational Linguis-tics Conference (HLT/NAACL’04), pági-nas 169–176, Boston, USA, May.

    Mariño, José B., Rafael E. Banchs, Josep M.Crego, Adrià de Gispert, Patrik Lam-bert, José A. R. Fonollosa, y Marta R.Costa-jussà. 2006. Ngram-based Machi-ne Translation. Computational Linguis-tics, 32(4):527–549.

    Och, F. J. y H. Ney. 2002. Discriminati-ve Training and Maximum Entropy Mo-dels for Statistical Machine Translation.En Proceedings of the 40th Annual Mee-ting of the Association for ComputationalLinguistics (ACL).

    Padró, Ll., M. Collado, S. Reese, M. Lloberes,y I. Castellón. 2010. FreeLing 2.1: FiveYears of Open-Source Language Proces-sing Tools. En Proceedings of 7th Langua-ge Resources and Evaluation Conference(LREC 2010), La Valleta, Malta, May.

    Wu, H. y H. Wang. 2007. Pivot Langua-ge Approach for Phrase-Based StatisticalMachine Translation. En Proc. of theACL, páginas 856–863, Prague.

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    27

  • A Bilingual Summary Corpus for Information Extraction andother Natural Language Processing Applications∗

    Un corpus bilingüe para la extracción de información y otras tareas deprocesamiento de lenguaje natural.

    Horacio Saggion and Sandra SzaszUniversitat Pompeu Fabra

    Departament de Tecnologies de la Informació i les ComunicacionsGrupo TALN

    C/Tanger 122 - Barcelona - 08018Spain

    [email protected], [email protected]

    Resumen: Presentamos un corpus bilingüe comparable en español e inglés de paresde resúmenes de tres tipos de eventos: accidentes aéreos, accidentes ferroviarios yterremotos. Cada resumen es un texto que describe de manera sucinta un eventoparticular. El corpus fue anotado manualmente con información semántica sobrecada evento y resulta apropiado para la experimentación en extracción de infor-mación monolingüe aśı como también cros-lingue.Palabras clave: Extracción de informaciones, corpus bilingüe, resúmenes

    Abstract: Cross-lingual information extraction, the task of extracting informationfrom multiple-multilingual sources, can benefit from the availability of a corpus ofequivalent documents in various languages. We present a dataset of pairs of sum-maries in Spanish and English in various application domains and demonstrate itsuse in information extraction experiments. The dataset has been manually anno-tated with semantic information.Keywords: Cross-lingual information extraction, biligual corpus, summaries

    1 Introduction

    Cross-lingual information extraction, thetask of extracting information from multiple-multilingual sources, is a problem which hasreceived considerably less attention than ex-traction from mono-lingual sources. In thispaper, we are concerned with the creationof a dataset for the development and evalu-ation of cross-lingual information extractionsystems. Our corpus is a set of pairs of sum-maries in Spanish and English in various do-mains. An example of the dataset is shownbelow:

    17 julio 2006 Isla de Java: un mare-moto de magnitud 7,7 Richter demagnitud provoca un ’tsunami’ quecausó la muerte de 596 personas.

    ∗ We are grateful to Programa Ramón y Cajal fromMinisterio de Ciencia e Innovación, Spain.

    On 17 July at 03:19:25 p.m. localtime an earthquake measuring 7.7 onthe Richter scale struck offshore im-mediately south of West Java at adepth of 10 km. The areas affected bythe earthquake and resultant tsunamiincluded the districts of Taskimalaya,Ciamis, Sukabumi and Garut in WestJava province, Cilacap, Kebumen andBanyumas in Central Java and the Gu-nung Kidul and Bantul districts in theprovince of Yogyakarta. No. Deaths500.

    These elements in the dataset are non-translated equivalent summaries which havebeen found on the Web. They report onthe same event, in this case an earthquake,but because they are not translations of oneanother, they contain different information,for example the Spanish summary reports596 people dead while the English summary

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    28

  • reports 500 people dead. The Englishsummary is more verbose and containsinformation about the time of the event andvarious locations affected by the tremor thusbeing the two elements complementary. Thedataset can be used for training informationextraction systems, studying template-to-text bilingual generation, and automaticknowledge modelling.

    This paper gives an overview of thedataset and initial experiments showing itspotential application. The rest of this paperis structured as follows: Section 2 we explainrelated work and then, in Section 3 we de-scribe the data set created. After that, inSection 4 we illustrate how we have used thecorpus and in Section 5 we present our con-clusions.

    2 Related Work

    There are various multilingual datasets inthe machine translation field such as the Eu-roparl Multilingual Corpus (Koehn, 2005) orthe United Nations Parallel Corpus (Eiseley Chen, 2010). Related to the work pre-sented here are those datasets prepared fortext summarization or information extrac-tion research. Among them we have iden-tified the SummBank corpus (Saggion et al.,2002) created for the study of multi-lingualsummarization in Chinese and English. Thedocuments in this corpus are translationsof one another and contain announcementsof a local administration. The corpus hasbeen used in text summarization and infor-mation retrieval experiments (Radev et al.,2003). Because of the content and annota-tion provided with the dataset, this corpus isprobably less suitable for information extrac-tion. The CAST corpus (Orăsan, Mitkov,y Hasler, 2003) contains newswire texts andpopular science articles in English where an-notations are added to indicate: (i) essen-tial sentences, (ii) unessential fragments insentences, and (iii) links between sentenceswhen one sentence is needed to understandanother. Because of the particular annota-tion schema used, the corpus has potentialapplications for sentence compression. TheSumTime-Meteo Corpus (Reiter y Sripada,2002) provides weather summaries in Englishfrom numerical data and is potentially usefulin data to text generation applications andinformation extraction. The Ziff-Davis cor-

    pus contains technical documents in Englishand their human created summaries and hasbeen used in text summarization experiments(Knight y Marcu, 2000). The dataset of theMessage Understanding Conferences (ARPA,1993) is probably the best known set for thedevelopment of information extraction sys-tems.

    3 Data Set Creation andAnnotation

    The dataset under development is a com-parable corpus of Spanish and Englishsummaries for four different domains: avia-tion accidents, rail accidents, earthquakes,and terrorist acts; this later subset is stillunder development. Further domains willbe incorporated in the future for researchersinterested in evaluating the robustness andadaptation capabilities of different naturallanguage processing techniques. In orderto collect the summaries, a keyword searchstrategy was used to search for documents onthe Internet using Google Search. Keywordsper domain were defined and used to selecta set of Web pages in Spanish, for examplethe keywords “lista de terremotos” couldbe used to search for documents in theearthquake domain. The pages returnedby the search engine were examined toverify if they actually contained an eventsummary and in that case a document wascreated for the summary (it is not unusualto find multiple summaries in a single Webpage). The documents were given namesindicating the type of event and the dateof the event/incident. A set of around 50summaries per domain in Spanish werecollected in this manner. After this, foreach event summary originally in Spanishthe Internet was searched for an equivalentEnglish summary (not a translation) usingkeywords in English, this time manuallyderived from the Spanish summary. Forexample if an earthquake event mentioneda particular date and intensity, then thoseelements were used as keywords. Followingthis procedure we found equivalent Englishsummaries for most of the Spanish ones.

    For each domain (event or incident) aset of semantic components (i.e., slots) wereidentified based on intuition and on the ac-tual data observed in a set of summaries forthe domain. The slots/components making

    Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)

    29

  • Information # Spa # EngCity 23 16Country 47 31DateOfEarthquake 53 36Depth 1 4Duration 1 3Epicentre 7 7Fatalities 50 35Homeless 7 11Injured 9 11Magnitude 47 32OtherPlacesAffected 27 29Province 10 9Region 25 25Survivors 1 2TimeOfEarthquake 4 21TotalVictims 2 0

    Table 3: Number of Semantic Conceptsin Spanish and English Earthquake’s Sum-maries

    up the templates which model the domainare shown in Table 1.

    Corpus examples (pairs of summaries inthe two languages) for the three domains areshown in Table 2. In order to manually an-notate the summaries with semantic infor-mation, we have used the GATE annotationframework (Maynard et al., 2002). To fa-